Citability Audit
Quick facts
- Difficulty
- Intermediate
- Time
- ~2–4 hours per surface; less on re-audits
- Prerequisites
- Citability, GEO Metrics
- What this is
- Per-page diagnostic — does each passage survive being lifted into an AI answer, alone?
- Method spine
- Seven structural signals from Citability §4 + a manual chunk-extraction test
- Output
- Pass/fail/partial per signal × per chunk, severity-tagged, with a rewrite route per finding
- Effort
- ~2–4 hours per coherent surface; less on re-audits
- No composite score
- A 0–100 'citability score' is rejected on principle — see §8. Per-signal verdicts are the load-bearing output
1. What this audit is — and what its output looks like
A citability audit walks one coherent surface — a page, a template, a content cluster — and tests, passage by passage, whether an AI engine can lift the chunk it just retrieved into a synthesized answer. The seven structural signals defined in Citability — self-contained chunks, direct-answer blocks, Q&A, steps, citable tables, heading discipline, liftable quotes — become the audit’s per-passage checklist, and a manual chunk-extraction test (§4) makes each pass-or-fail concrete by running the actual engine on the actual text.
The output is a per-signal pass/fail/partial matrix across the sampled chunks, severity-tagged per finding, with a one-line rewrite route attached. The mechanism the paper visibility figures support — content substance beats keyword tricks — is the academic anchor, with the bounded reading at Aggarwal et al. 2024. Microsoft puts the operational frame plainly in May 2026: “The unit of value shifts from documents to groundable information” (Bing — Evolving role of the index). The unit being audited here is the passage, not the page.
“Citability audit” is a generic descriptor, not a coinage. The specific seven-signal + chunk-extraction procedure is GEO Wiki’s organizing device, anchored to Citability §4, and the Full GEO Audit consumes it as the deep dive for its Layer-4 content findings.
2. Before you audit — surface, sampling, engine, baseline
Four decisions fix what every later finding means. Get one wrong and the report is uninterpretable — the same decide-before-you-measure discipline the Full GEO Audit and AI Citation Tracking playbooks open with.
| Decision | Options | Rule of thumb |
|---|---|---|
| Surface | One page / one template / a content cluster / a locale | Audit one coherent surface; a mixed-surface finding is not actionable. For templates, audit the highest-traffic instance |
| Sampling | Whole page / a representative passage set of ~5–8 chunks | Sample the TL;DR + one H2 first-paragraph + one table row + one FAQ + one quotable mid-section claim; do not just audit the top of the page |
| Engine | The engine your audience actually uses, declared | Different surfaces reward different chunk shapes — §6 explains the deltas. A clean pass on Perplexity does not guarantee a pass elsewhere |
| Baseline | First audit / delta against a prior audit | Without a baseline you have a snapshot, not a trend; say which you have in the report header |
When to run. As the deep dive when the Full GEO Audit’s Layer-4 step flags a “read but not cited” finding; as a quarterly check on flagship or evergreen pages; plus triggers — a content restructure, a CMS or template migration, a major answer-block rewrite, or new tracking evidence that a competitor is being cited on your queries while you are not.
Inputs on hand before you start. The page rendered the way a crawler sees it (fetched HTML, not the painted DOM — borrow the discipline from AI Crawlers), the page’s heading list as a flat outline, and a fresh incognito session in the engine you declared in step 3 above.
3. The seven-signal audit ladder
The audit walks the same seven structural signals Citability §4 names — same order, same definitions — and for each one asks two questions: does this passage carry the signal? and if not, what tax does the failure impose on grounding? The 7-row table below is the page’s spine; each H3 in §5 walks one row.
| # | Signal | Audit question | Governing definition |
|---|---|---|---|
| 1 | Self-contained chunk | Pick a paragraph at random — does it stand alone, lifted out of context? | Citability §4.1 |
| 2 | Direct-answer / TL;DR block | Is the answer in the first 1–2 sentences of the section? | Citability §4.2 |
| 3 | Q&A / FAQ structure | Do question-shaped headings match real user sub-queries? | Citability §4.3 |
| 4 | Step / HowTo structure | If the page has a procedure, is it a numbered, imperative list? | Citability §4.4 |
| 5 | Citable table / list | Does each row read alone, with caption and column labels? | Citability §4.5 |
| 6 | Heading-hierarchy discipline | Clean H2 → H3 nesting, no skipped levels, no decorative headings? | Citability §4.6 |
| 7 | Liftable quotable sentence | Is there a crisp standalone claim per H2 that survives extraction? | Citability §4.7 |
Signals 1, 2, and 7 are the highest-leverage in practice — they govern whether any atom of the page is liftable. Signals 3 and 4 are conditional on the page being the right form for Q&A or HowTo at all; do not invent either where neither fits, that is the §7 fake-fix anti-pattern. Signal 5 scales with how tabular the page is; Signal 6 is cheap to audit and cheap to fix. Walk the table top-down; rank findings by severity, not by signal number (§8).
4. Step 1 — The chunk-extraction test (start here, always)
The load-bearing manual test, named explicitly. Lift one passage out of context, paste it alone into ChatGPT search or Perplexity, and ask the engine to summarize or answer using only that text. If the engine fills in obvious gaps, asks for context, or completes incorrectly, the chunk is not self-contained. Every other check in the playbook is a proxy for can this be lifted? — this test asks the question directly, with the actual engine, in the actual mode.
The procedure:
- Pick the sample. A representative 5–8 chunks per the §2 sampling decision: the TL;DR, one H2 first paragraph, one table row, one FAQ answer, one quotable claim mid-section.
- Lift each chunk verbatim. Strip surrounding paragraphs, headings, and “as above” or “see §3” references. The lifted version is what an AI engine sees after retrieval has fired.
- Paste alone into a fresh engine session. No prior turn, no history, no system prompt — personalized sessions are not reproducible. Ask: “What is this passage saying?” or “Use only this passage to answer: [the page’s target query].”
- Score the chunk on three outcomes:
- ✅ liftable — engine answers cleanly using only the chunk.
- ⚠️ partial — engine hedges, asks for context, or completes the missing setup incorrectly. Note exactly what context was missing.
- ❌ broken — engine cannot parse the chunk at all (pronoun chain, table-without-caption, hedged multi-clause sentence).
- Log the failure shape per signal. Map each ⚠️ and ❌ back to one of the seven signals in §3 — that mapping is the finding.
The canonical row schema (so manual results aggregate into the §8 report):
audit_date UTC date of the audit
page_url URL of the audited page
chunk_excerpt first 120 chars of the lifted passage
signal_n which of the seven signals (1–7) the failure maps to
outcome liftable | partial | broken
failure_shape short note (pronoun ref / no caption / hedged / …)
severity blocker | major | minor (see §8)
4.1 Worked example — three variants of one passage
Three versions of the same fact about robots.txt, each lifted alone and pasted into a fresh ChatGPT search session with the prompt “What is this passage saying?”. The contrast is the test.
Version ✅ — liftable. “A robots.txt file is a plain-text file at the root of a domain that tells crawlers which URLs they may fetch. Each rule names a user-agent and a path.” The engine returns a clean restatement: definition, location, structure. No hedging, no question back. Verdict: ✅ — Signal 1 passes.
Version ⚠️ — partial. “It tells crawlers which URLs they may fetch — see the earlier diagram for the path-matching rules.” The engine hedges: “This appears to describe a file that controls crawler access, but the subject is unclear — what file is being referenced?” The pronoun and the dangling reference to “the earlier diagram” force the engine to ask. Verdict: ⚠️ — Signal 1 partial; pronoun reference + external pointer break self-containment.
Version ❌ — broken. “As above, it applies; but as we noted in §2, the precedence rules can override.” The engine cannot resolve any subject or operation; it returns a request for the original document. Verdict: ❌ — Signal 1 broken; pure pronoun chain with no anchor.
The same exercise on the TL;DR, on an H2 first paragraph, and on a table row gives you the per-chunk row of the §8 finding matrix. Tools that score citability without running this test are predicting from surface features; the test produces ground truth.
5. Step 2 — Walk the seven signals (per-signal audit micro-tables)
Each H3 below uses the same 4-row micro-table — audit question / what good looks like / failure shape / rewrite route — so findings are directly comparable across signals. Definitions live in Citability §4; examples come from §4 above.
5.1 Signal 1 — Self-contained chunk
| Audit question | Lifted alone, does this paragraph resolve without its neighbors? |
| What good looks like | One paragraph stating its own subject + claim + (where needed) attribution |
| Failure shape | Pronoun chains, “as above” / “see §X”, dangling references to a diagram or earlier table |
| Rewrite route | Writing for AI Citation — Self-contained chunks |
5.2 Signal 2 — Direct-answer / TL;DR block
| Audit question | Is the answer in the first one or two sentences of the section? |
| What good looks like | An inverted-pyramid lede stating the claim, then justification |
| Failure shape | Two or three paragraphs of preamble before the claim appears |
| Rewrite route | Writing for AI Citation — Inverted-pyramid sections |
5.3 Signal 3 — Q&A / FAQ structure
| Audit question | Do question-shaped headings match queries a real user would type? |
| What good looks like | ### My page was retrieved but not cited — why? matched to a real query |
| Failure shape | Topic headings no one searches for, or invented FAQs (the §7 anti-pattern) |
| Rewrite route | Writing for AI Citation — Question-shaped headings |
Conditional — do not invent Q&A where the page form does not call for it (that is the §7 anti-pattern). Question shape ties to query fan-out, see Answer Loop §3.1.
5.4 Signal 4 — Step / HowTo structure
| Audit question | If the page contains a procedure, is it a numbered, imperative list? |
| What good looks like | One action per step, lifted cleanly as a unit, no surrounding prose required |
| Failure shape | ”First you should consider… and then it may be worth…” prose with steps embedded |
| Rewrite route | Writing for AI Citation — Step lists |
5.5 Signal 5 — Citable table / list
| Audit question | Does each row read alone, with caption and self-explanatory column labels? |
| What good looks like | Discrete, captioned rows the engine can quote whole |
| Failure shape | Tables whose rows mean nothing without the surrounding paragraph |
| Rewrite route | Writing for AI Citation — Self-labeling tables |
Microsoft is explicit: “Clear headings, tables, and FAQ sections help surface key information and make content easier for AI systems to reference accurately” (Bing AI Performance).
5.6 Signal 6 — Heading-hierarchy discipline
| Audit question | Clean H2 → H3 nesting, no skipped levels, no decorative headings? |
| What good looks like | Every H2/H3 names a real unit; flat outline reads like a table of contents |
| Failure shape | Skipped levels (H2 → H4), headings used for visual size, or duplicated H1s |
| Rewrite route | Writing for AI Citation — Heading discipline |
5.7 Signal 7 — Liftable quotable sentence
| Audit question | Is there a crisp standalone claim per H2 that survives extraction with attribution intact? |
| What good looks like | ”Retrieval makes you a candidate; grounding decides if you are used.” |
| Failure shape | ”It could perhaps be argued that, in some cases, retrieval may not always lead to use.” |
| Rewrite route | Writing for AI Citation — Quotable claims |
6. How the audit varies by surface — invariant vs delta
The seven signals are invariant — they win everywhere. What varies is which failure each surface penalizes hardest, which shapes the §2 engine choice.
| Surface | Most load-bearing signals | Why |
|---|---|---|
| Perplexity | 1, 5, 7 | Citation-dense by design; rewards tight liftable chunks and quotable claims the most |
| ChatGPT search | 2 | Live fetch; rewards a direct-answer block near the top of the fetched page |
| Google AI Overviews | 3, 6 | Index-based; rewards heading discipline and Q&A structure that match query fan-out |
A clean pass on one surface does not generalize. The audit is also not language-invariant in practice — chunk and answer-block citability shifts in Chinese versus English, see Multilingual GEO.
7. Fake-fix anti-patterns — when a remediation trips a different filter
Patterns practitioners reach for after an audit finds a gap, that look like a remediation but trip a different AI-spam or trust filter. Concept-level over-application examples sit in Citability §6; these are the operational complement.
| Anti-pattern | Looks like the fix for | Why it actually fails |
|---|---|---|
| Over-chunking the whole page into one-sentence paragraphs | Signal 1 (self-contained) | Fragments lose meaning; nothing is a coherent liftable answer |
| Inventing FAQ entries no real user asks | Signal 3 (Q&A) | Recognized as boilerplate, down-weighted as low-effort content |
| Adding manufactured statistics to look “citable” | Signal 7 (quotable claim) | Unsourced numbers fail trust filtering — see E-E-A-T |
| Lifting boilerplate from another page of yours | Signal 1 (self-contained) | Near-duplicate detection — see AI Content Detection |
Google says the quiet part out loud in its May 2026 optimization guide: “There’s no requirement to break your content into tiny pieces for AI to better understand it. Google systems are able to understand the nuance of multiple topics on a page” (AI Optimization Guide). Over-chunking is the dominant fake fix because it superficially mimics Signal 1 while erasing the very property Signal 1 measures — coherent self-containment.
The position, stated plainly: citability is necessary, not sufficient. Structure without substance is detectable and penalized; trust gaps still defeat perfect chunking. Under competition, many such rewrites stop working — see C-SEO Bench.
8. Scoring & the report deliverable
The load-bearing output is per-signal pass / fail / partial, one row per chunk audited, with severity attached. Signals 1, 2, and 7 default to major when broken; 3, 4, and 6 default to minor unless the page is the wrong form entirely; 5 scales with how much of the page is tabular. The severity model is anchored the same way Full GEO Audit §5 anchors its findings — to which signal failed, not to a global rubric.
A 0–100 “citability score” is rejected on principle. Every score on the market ships without a published formula: Topify sells “a 0–100 grade of how AI-ready your website is” with no method disclosed; Citability.ai returns a “Combined Score: 62” aggregating three subscores with no weights published; Mangools’ AI Search Grader states the score is “weighted by market share” without naming the weights. A bare number whose method is opaque is a rumour, not a measurement — the same provenance discipline GEO Metrics enforces for every reported number applies in stronger form here. Per-signal verdicts are reproducible; a single score is not.
What the report ships, every time: a header (date, surface audited, sampling decision, engine tested), the per-chunk × per-signal matrix (✅/⚠️/❌ per cell), severity-ranked findings with the §4 failure shape inline, a one-line rewrite route per finding linking out to Writing for AI Citation, and — if a prior audit exists — a baseline delta. A re-audit re-runs the chunk-extraction test on the changed chunks only; signals you did not touch are trusted forward.
9. Validity threats & pitfalls
Do not ship a report without clearing every item.
- Auditing the painted DOM, not the fetched HTML — client-side-rendered content hides signals from the crawler; test what a non-JS fetch sees (SSR for AI Crawlers).
- Sampling only the top of the page — Signal 1 fails most often in the middle of long pages; sample H2 first paragraphs across the whole document.
- Running the chunk-extraction test in a personalized session — logged-in or history-on sessions return non-reproducible answers.
- Treating a 6-of-7 pass as a page pass — a page can pass six signals and still fail to be cited if Signal 1 broke.
- Single-engine generalization — a clean Perplexity pass does not generalize to AI Overviews; declare the engine in the report header.
- Locale-mixing — see Multilingual GEO; do not generalize a Chinese-page audit to English or vice versa.
- Unverified citation reported as a pass — a passage the engine quoted may not actually support the claim it was attached to (Liu et al. 2023); reconcile via AI Citation Tracking §4.1.
10. Further reading
- Concept layer — Citability (definitions of every signal this playbook audits), E-E-A-T (the trust half of grounding)
- Adjacent playbooks — Full GEO Audit (this is its Layer-4 deep dive), Writing for AI Citation (rewrite recipes per signal), AI Citation Tracking (the outcome-side measurement loop)
- Per-engine surfaces — Perplexity, ChatGPT search
- Academic anchor — Aggarwal et al. 2024 — GEO: Generative Engine Optimization; follow-up bounded reading via C-SEO Bench
References
Academic:
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K. & Deshpande, A. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL · paper summary
- Puerto, H., Gubri, M., Green, C., Oh, S. J. & Yun, S. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 Datasets & Benchmarks. arXiv:2506.11097
- Liu, N. F., Zhang, T. & Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP 2023. arXiv:2304.09848
Official platform documentation (verified 2026-05):
- Google Search Central — A new resource for optimizing for generative AI in Google Search · AI Optimization Guide · AI features and your website · Succeeding in AI search
- Microsoft Bing — Evolving role of the index: From ranking pages to supporting answers · AI Performance in Bing Webmaster Tools
- OpenAI — ChatGPT search Help Center
- Perplexity — What is an answer engine, and how does Perplexity work as one?
Frequently asked questions
Is a 'citability audit' a real thing or just GEO Wiki's term?
How is this different from the Full GEO Audit?
Do I need ChatGPT or Perplexity to run the audit?
Why does this playbook refuse to give a 0–100 citability score?
What's the cheapest, highest-signal check if I only have 30 minutes?
Related playbooks & wiki
Sources
Primary
- GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
- GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
- A new resource for optimizing for generative AI in Google Search · Google Search Central · 2026-05-15
- AI Optimization Guide · Google Search Central · 2026-05-15
- Evolving role of the index: From ranking pages to supporting answers · Microsoft Bing · 2026-05-06
- Introducing AI Performance in Bing Webmaster Tools (Public Preview) · Microsoft Bing · 2026-02-10
- AI features and your website · Google Search Central · 2025-12-10
- Top ways to ensure your content performs well in Google's AI experiences on Search · Google Search Central · 2025-05-01
- ChatGPT search — OpenAI Help Center · OpenAI
- What is an answer engine, and how does Perplexity work as one? · Perplexity AI
Secondary
- C-SEO Bench: Does Conversational SEO Work? (Puerto et al., NeurIPS '25 D&B) · arXiv / NeurIPS '25 D&B
- Evaluating Verifiability in Generative Search Engines (Liu et al., EMNLP '23 Findings) · arXiv / EMNLP '23 Findings