Skip to content
Playbook · Practice

Citability Audit

Quick facts

Difficulty
Intermediate
Time
~2–4 hours per surface; less on re-audits
Prerequisites
Citability, GEO Metrics
What this is
Per-page diagnostic — does each passage survive being lifted into an AI answer, alone?
Method spine
Seven structural signals from Citability §4 + a manual chunk-extraction test
Output
Pass/fail/partial per signal × per chunk, severity-tagged, with a rewrite route per finding
Effort
~2–4 hours per coherent surface; less on re-audits
No composite score
A 0–100 'citability score' is rejected on principle — see §8. Per-signal verdicts are the load-bearing output

1. What this audit is — and what its output looks like

A citability audit walks one coherent surface — a page, a template, a content cluster — and tests, passage by passage, whether an AI engine can lift the chunk it just retrieved into a synthesized answer. The seven structural signals defined in Citability — self-contained chunks, direct-answer blocks, Q&A, steps, citable tables, heading discipline, liftable quotes — become the audit’s per-passage checklist, and a manual chunk-extraction test (§4) makes each pass-or-fail concrete by running the actual engine on the actual text.

The output is a per-signal pass/fail/partial matrix across the sampled chunks, severity-tagged per finding, with a one-line rewrite route attached. The mechanism the paper visibility figures support — content substance beats keyword tricks — is the academic anchor, with the bounded reading at Aggarwal et al. 2024. Microsoft puts the operational frame plainly in May 2026: “The unit of value shifts from documents to groundable information” (Bing — Evolving role of the index). The unit being audited here is the passage, not the page.

“Citability audit” is a generic descriptor, not a coinage. The specific seven-signal + chunk-extraction procedure is GEO Wiki’s organizing device, anchored to Citability §4, and the Full GEO Audit consumes it as the deep dive for its Layer-4 content findings.

2. Before you audit — surface, sampling, engine, baseline

Four decisions fix what every later finding means. Get one wrong and the report is uninterpretable — the same decide-before-you-measure discipline the Full GEO Audit and AI Citation Tracking playbooks open with.

DecisionOptionsRule of thumb
SurfaceOne page / one template / a content cluster / a localeAudit one coherent surface; a mixed-surface finding is not actionable. For templates, audit the highest-traffic instance
SamplingWhole page / a representative passage set of ~5–8 chunksSample the TL;DR + one H2 first-paragraph + one table row + one FAQ + one quotable mid-section claim; do not just audit the top of the page
EngineThe engine your audience actually uses, declaredDifferent surfaces reward different chunk shapes — §6 explains the deltas. A clean pass on Perplexity does not guarantee a pass elsewhere
BaselineFirst audit / delta against a prior auditWithout a baseline you have a snapshot, not a trend; say which you have in the report header

When to run. As the deep dive when the Full GEO Audit’s Layer-4 step flags a “read but not cited” finding; as a quarterly check on flagship or evergreen pages; plus triggers — a content restructure, a CMS or template migration, a major answer-block rewrite, or new tracking evidence that a competitor is being cited on your queries while you are not.

Inputs on hand before you start. The page rendered the way a crawler sees it (fetched HTML, not the painted DOM — borrow the discipline from AI Crawlers), the page’s heading list as a flat outline, and a fresh incognito session in the engine you declared in step 3 above.

3. The seven-signal audit ladder

The audit walks the same seven structural signals Citability §4 names — same order, same definitions — and for each one asks two questions: does this passage carry the signal? and if not, what tax does the failure impose on grounding? The 7-row table below is the page’s spine; each H3 in §5 walks one row.

#SignalAudit questionGoverning definition
1Self-contained chunkPick a paragraph at random — does it stand alone, lifted out of context?Citability §4.1
2Direct-answer / TL;DR blockIs the answer in the first 1–2 sentences of the section?Citability §4.2
3Q&A / FAQ structureDo question-shaped headings match real user sub-queries?Citability §4.3
4Step / HowTo structureIf the page has a procedure, is it a numbered, imperative list?Citability §4.4
5Citable table / listDoes each row read alone, with caption and column labels?Citability §4.5
6Heading-hierarchy disciplineClean H2 → H3 nesting, no skipped levels, no decorative headings?Citability §4.6
7Liftable quotable sentenceIs there a crisp standalone claim per H2 that survives extraction?Citability §4.7

Signals 1, 2, and 7 are the highest-leverage in practice — they govern whether any atom of the page is liftable. Signals 3 and 4 are conditional on the page being the right form for Q&A or HowTo at all; do not invent either where neither fits, that is the §7 fake-fix anti-pattern. Signal 5 scales with how tabular the page is; Signal 6 is cheap to audit and cheap to fix. Walk the table top-down; rank findings by severity, not by signal number (§8).

4. Step 1 — The chunk-extraction test (start here, always)

The load-bearing manual test, named explicitly. Lift one passage out of context, paste it alone into ChatGPT search or Perplexity, and ask the engine to summarize or answer using only that text. If the engine fills in obvious gaps, asks for context, or completes incorrectly, the chunk is not self-contained. Every other check in the playbook is a proxy for can this be lifted? — this test asks the question directly, with the actual engine, in the actual mode.

The procedure:

  1. Pick the sample. A representative 5–8 chunks per the §2 sampling decision: the TL;DR, one H2 first paragraph, one table row, one FAQ answer, one quotable claim mid-section.
  2. Lift each chunk verbatim. Strip surrounding paragraphs, headings, and “as above” or “see §3” references. The lifted version is what an AI engine sees after retrieval has fired.
  3. Paste alone into a fresh engine session. No prior turn, no history, no system prompt — personalized sessions are not reproducible. Ask: “What is this passage saying?” or “Use only this passage to answer: [the page’s target query].”
  4. Score the chunk on three outcomes:
    • liftable — engine answers cleanly using only the chunk.
    • ⚠️ partial — engine hedges, asks for context, or completes the missing setup incorrectly. Note exactly what context was missing.
    • broken — engine cannot parse the chunk at all (pronoun chain, table-without-caption, hedged multi-clause sentence).
  5. Log the failure shape per signal. Map each ⚠️ and ❌ back to one of the seven signals in §3 — that mapping is the finding.

The canonical row schema (so manual results aggregate into the §8 report):

audit_date          UTC date of the audit
page_url            URL of the audited page
chunk_excerpt       first 120 chars of the lifted passage
signal_n            which of the seven signals (1–7) the failure maps to
outcome             liftable | partial | broken
failure_shape       short note (pronoun ref / no caption / hedged / …)
severity            blocker | major | minor (see §8)

4.1 Worked example — three variants of one passage

Three versions of the same fact about robots.txt, each lifted alone and pasted into a fresh ChatGPT search session with the prompt “What is this passage saying?”. The contrast is the test.

Version ✅ — liftable. “A robots.txt file is a plain-text file at the root of a domain that tells crawlers which URLs they may fetch. Each rule names a user-agent and a path.” The engine returns a clean restatement: definition, location, structure. No hedging, no question back. Verdict: ✅ — Signal 1 passes.

Version ⚠️ — partial. “It tells crawlers which URLs they may fetch — see the earlier diagram for the path-matching rules.” The engine hedges: “This appears to describe a file that controls crawler access, but the subject is unclear — what file is being referenced?” The pronoun and the dangling reference to “the earlier diagram” force the engine to ask. Verdict: ⚠️ — Signal 1 partial; pronoun reference + external pointer break self-containment.

Version ❌ — broken. “As above, it applies; but as we noted in §2, the precedence rules can override.” The engine cannot resolve any subject or operation; it returns a request for the original document. Verdict: ❌ — Signal 1 broken; pure pronoun chain with no anchor.

The same exercise on the TL;DR, on an H2 first paragraph, and on a table row gives you the per-chunk row of the §8 finding matrix. Tools that score citability without running this test are predicting from surface features; the test produces ground truth.

5. Step 2 — Walk the seven signals (per-signal audit micro-tables)

Each H3 below uses the same 4-row micro-table — audit question / what good looks like / failure shape / rewrite route — so findings are directly comparable across signals. Definitions live in Citability §4; examples come from §4 above.

5.1 Signal 1 — Self-contained chunk

Audit questionLifted alone, does this paragraph resolve without its neighbors?
What good looks likeOne paragraph stating its own subject + claim + (where needed) attribution
Failure shapePronoun chains, “as above” / “see §X”, dangling references to a diagram or earlier table
Rewrite routeWriting for AI Citation — Self-contained chunks

5.2 Signal 2 — Direct-answer / TL;DR block

Audit questionIs the answer in the first one or two sentences of the section?
What good looks likeAn inverted-pyramid lede stating the claim, then justification
Failure shapeTwo or three paragraphs of preamble before the claim appears
Rewrite routeWriting for AI Citation — Inverted-pyramid sections

5.3 Signal 3 — Q&A / FAQ structure

Audit questionDo question-shaped headings match queries a real user would type?
What good looks like### My page was retrieved but not cited — why? matched to a real query
Failure shapeTopic headings no one searches for, or invented FAQs (the §7 anti-pattern)
Rewrite routeWriting for AI Citation — Question-shaped headings

Conditional — do not invent Q&A where the page form does not call for it (that is the §7 anti-pattern). Question shape ties to query fan-out, see Answer Loop §3.1.

5.4 Signal 4 — Step / HowTo structure

Audit questionIf the page contains a procedure, is it a numbered, imperative list?
What good looks likeOne action per step, lifted cleanly as a unit, no surrounding prose required
Failure shape”First you should consider… and then it may be worth…” prose with steps embedded
Rewrite routeWriting for AI Citation — Step lists

5.5 Signal 5 — Citable table / list

Audit questionDoes each row read alone, with caption and self-explanatory column labels?
What good looks likeDiscrete, captioned rows the engine can quote whole
Failure shapeTables whose rows mean nothing without the surrounding paragraph
Rewrite routeWriting for AI Citation — Self-labeling tables

Microsoft is explicit: “Clear headings, tables, and FAQ sections help surface key information and make content easier for AI systems to reference accurately” (Bing AI Performance).

5.6 Signal 6 — Heading-hierarchy discipline

Audit questionClean H2 → H3 nesting, no skipped levels, no decorative headings?
What good looks likeEvery H2/H3 names a real unit; flat outline reads like a table of contents
Failure shapeSkipped levels (H2 → H4), headings used for visual size, or duplicated H1s
Rewrite routeWriting for AI Citation — Heading discipline

5.7 Signal 7 — Liftable quotable sentence

Audit questionIs there a crisp standalone claim per H2 that survives extraction with attribution intact?
What good looks like”Retrieval makes you a candidate; grounding decides if you are used.”
Failure shape”It could perhaps be argued that, in some cases, retrieval may not always lead to use.”
Rewrite routeWriting for AI Citation — Quotable claims

6. How the audit varies by surface — invariant vs delta

The seven signals are invariant — they win everywhere. What varies is which failure each surface penalizes hardest, which shapes the §2 engine choice.

SurfaceMost load-bearing signalsWhy
Perplexity1, 5, 7Citation-dense by design; rewards tight liftable chunks and quotable claims the most
ChatGPT search2Live fetch; rewards a direct-answer block near the top of the fetched page
Google AI Overviews3, 6Index-based; rewards heading discipline and Q&A structure that match query fan-out

A clean pass on one surface does not generalize. The audit is also not language-invariant in practice — chunk and answer-block citability shifts in Chinese versus English, see Multilingual GEO.

7. Fake-fix anti-patterns — when a remediation trips a different filter

Patterns practitioners reach for after an audit finds a gap, that look like a remediation but trip a different AI-spam or trust filter. Concept-level over-application examples sit in Citability §6; these are the operational complement.

Anti-patternLooks like the fix forWhy it actually fails
Over-chunking the whole page into one-sentence paragraphsSignal 1 (self-contained)Fragments lose meaning; nothing is a coherent liftable answer
Inventing FAQ entries no real user asksSignal 3 (Q&A)Recognized as boilerplate, down-weighted as low-effort content
Adding manufactured statistics to look “citable”Signal 7 (quotable claim)Unsourced numbers fail trust filtering — see E-E-A-T
Lifting boilerplate from another page of yoursSignal 1 (self-contained)Near-duplicate detection — see AI Content Detection

Google says the quiet part out loud in its May 2026 optimization guide: “There’s no requirement to break your content into tiny pieces for AI to better understand it. Google systems are able to understand the nuance of multiple topics on a page” (AI Optimization Guide). Over-chunking is the dominant fake fix because it superficially mimics Signal 1 while erasing the very property Signal 1 measures — coherent self-containment.

The position, stated plainly: citability is necessary, not sufficient. Structure without substance is detectable and penalized; trust gaps still defeat perfect chunking. Under competition, many such rewrites stop working — see C-SEO Bench.

8. Scoring & the report deliverable

The load-bearing output is per-signal pass / fail / partial, one row per chunk audited, with severity attached. Signals 1, 2, and 7 default to major when broken; 3, 4, and 6 default to minor unless the page is the wrong form entirely; 5 scales with how much of the page is tabular. The severity model is anchored the same way Full GEO Audit §5 anchors its findings — to which signal failed, not to a global rubric.

A 0–100 “citability score” is rejected on principle. Every score on the market ships without a published formula: Topify sells “a 0–100 grade of how AI-ready your website is” with no method disclosed; Citability.ai returns a “Combined Score: 62” aggregating three subscores with no weights published; Mangools’ AI Search Grader states the score is “weighted by market share” without naming the weights. A bare number whose method is opaque is a rumour, not a measurement — the same provenance discipline GEO Metrics enforces for every reported number applies in stronger form here. Per-signal verdicts are reproducible; a single score is not.

What the report ships, every time: a header (date, surface audited, sampling decision, engine tested), the per-chunk × per-signal matrix (✅/⚠️/❌ per cell), severity-ranked findings with the §4 failure shape inline, a one-line rewrite route per finding linking out to Writing for AI Citation, and — if a prior audit exists — a baseline delta. A re-audit re-runs the chunk-extraction test on the changed chunks only; signals you did not touch are trusted forward.

9. Validity threats & pitfalls

Do not ship a report without clearing every item.

  • Auditing the painted DOM, not the fetched HTML — client-side-rendered content hides signals from the crawler; test what a non-JS fetch sees (SSR for AI Crawlers).
  • Sampling only the top of the page — Signal 1 fails most often in the middle of long pages; sample H2 first paragraphs across the whole document.
  • Running the chunk-extraction test in a personalized session — logged-in or history-on sessions return non-reproducible answers.
  • Treating a 6-of-7 pass as a page pass — a page can pass six signals and still fail to be cited if Signal 1 broke.
  • Single-engine generalization — a clean Perplexity pass does not generalize to AI Overviews; declare the engine in the report header.
  • Locale-mixing — see Multilingual GEO; do not generalize a Chinese-page audit to English or vice versa.
  • Unverified citation reported as a pass — a passage the engine quoted may not actually support the claim it was attached to (Liu et al. 2023); reconcile via AI Citation Tracking §4.1.

10. Further reading

References

Academic:

  1. Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K. & Deshpande, A. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL · paper summary
  2. Puerto, H., Gubri, M., Green, C., Oh, S. J. & Yun, S. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 Datasets & Benchmarks. arXiv:2506.11097
  3. Liu, N. F., Zhang, T. & Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP 2023. arXiv:2304.09848

Official platform documentation (verified 2026-05):

Frequently asked questions

Is a 'citability audit' a real thing or just GEO Wiki's term?
The label is generic — agencies and tools have shipped audit products under various names since 2024. The specific seven-signal diagnostic + chunk-extraction test in this playbook is GEO Wiki's organizing device, anchored to the Citability concept entry, not an established standard. Use it because the chunk-extraction test produces ground truth, not because anyone ratified the procedure.
How is this different from the Full GEO Audit?
The Full GEO Audit is the bottom-up walk across six layers — access, render, structure, content, off-site authority, outcome — and it consumes this playbook as the deep dive for the content & trust layer. If your full audit surfaces a 'retrieved but not cited' finding, this is the per-page procedure that diagnoses why. As a standalone, this audit answers one narrower question; as a sub-audit it feeds the bigger ladder.
Do I need ChatGPT or Perplexity to run the audit?
Any chat interface that does retrieval and shows you what it pulled will work — ChatGPT search, Perplexity, Gemini, or Bing Copilot. The two engines this playbook quotes verbatim in §4 are ChatGPT and Perplexity because they expose enough of the retrieved passage to verify the test. Run the test on the engine your audience actually uses; a clean pass on Perplexity does not guarantee a pass on AI Overviews — see §6.
Why does this playbook refuse to give a 0–100 citability score?
Because every score on the market ships without a published formula. Topify advertises 'a 0–100 grade of how AI-ready your website is'; Citability.ai returns a 'Combined Score: 62' that mixes three subscores with no weights disclosed; Mangools states the score is 'weighted by market share' without publishing the weights. A bare number whose method is opaque is a rumour, not a measurement. The per-signal pass/fail matrix in §8 is what ships.
What's the cheapest, highest-signal check if I only have 30 minutes?
Pick three paragraphs at random from your most important page, lift each one verbatim out of context, paste it alone into ChatGPT search or Perplexity, and ask 'what is this passage saying?'. Every passage the engine hedges on, asks for context for, or completes incorrectly is a citability finding — usually a Signal 1 (self-contained chunk) failure. This is §4 in compressed form, and it is the test no automated tool actually runs.

Related playbooks & wiki

Sources

Primary

  1. GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
  2. GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
  3. A new resource for optimizing for generative AI in Google Search · Google Search Central · 2026-05-15
  4. AI Optimization Guide · Google Search Central · 2026-05-15
  5. Evolving role of the index: From ranking pages to supporting answers · Microsoft Bing · 2026-05-06
  6. Introducing AI Performance in Bing Webmaster Tools (Public Preview) · Microsoft Bing · 2026-02-10
  7. AI features and your website · Google Search Central · 2025-12-10
  8. Top ways to ensure your content performs well in Google's AI experiences on Search · Google Search Central · 2025-05-01
  9. ChatGPT search — OpenAI Help Center · OpenAI
  10. What is an answer engine, and how does Perplexity work as one? · Perplexity AI

Secondary

  1. C-SEO Bench: Does Conversational SEO Work? (Puerto et al., NeurIPS '25 D&B) · arXiv / NeurIPS '25 D&B
  2. Evaluating Verifiability in Generative Search Engines (Liu et al., EMNLP '23 Findings) · arXiv / EMNLP '23 Findings
Last updated: 2026-05-25 Authors: Ray Yang Topic: Practice