Playbook · Practice

AI Citation Tracking

Quick facts

Difficulty: Intermediate
Time: Half-day setup, then ~30 min/week
Prerequisites: GEO Metrics, Citation vs Mention vs Link
What this is: Operational playbook — how to collect the numbers, not what they mean
Two modes: Manual (ground truth) first → automated (scale) second
Operational metric core: 4 — Citation Rate, Citation Share, Average Position, Source Diversity
Definitions source: Every metric is defined in GEO Metrics; this playbook collects them
Effort: ~Half-day setup, then ~30 min/week

1. What AI citation tracking is

A repeatable measurement instrument: a frozen prompt set, queried against a declared engine set on a fixed cadence, with each answer sampled, verified, and logged to a single canonical schema. The output is a time series of how often, how prominently, and on which engines you appear.

This is the operational companion to GEO Metrics — every metric name in this playbook links to its formula there. Two pages, one place per definition, no drift.

A scope clarification, because the question comes up: this is citation tracking. Monitoring unlinked mentions as an operation — detecting them in answer text, de-duping, normalizing across engines — is a harder, separate problem handled by Brand Mention Tracking. A mention here is logged as an appearance value (§4) for completeness.

Concretely, the instrument is:

frozen prompt set → chosen engines → sampled answers → verified, normalized log → delta report

There are two delivery modes, in this order: manual (ground truth) then automated (scale). The order is not negotiable — §4 explains why.

Why this matters: “are we being cited?” is the most-asked GEO KPI question, and answering it badly — a biased prompt set, mixed metric definitions — is worse than not answering it, because a confident wrong number gets acted on. The right mental model is the one the academic origin of GEO set out: measure prominence continuously, not citation as a yes/no. See Aggarwal et al. 2024 for the position-adjusted “impression” idea this playbook operationalizes, and Generative Engine Optimization for why ranked-link metrics no longer apply.

2. Decide before you measure

Four decisions fix the meaning of every later number. Get one wrong and the rest is uninterpretable.

Decision	Options	Rule of thumb
Which metric(s)	The operational core: Citation Rate, Citation Share, Average Position, Source Diversity	Start with Citation Rate + Source Diversity; they need no competitor set
Mention or citation	Count citations only, mentions only, or both (tagged separately)	They are different constructs — never sum them. See Citation vs Mention, Brand Mentions
Which engines	Your audience’s actual engines, not all of them	The engine set is itself a reported variable — declare it
Time window + cadence	e.g. weekly sample, 7-day window	Answers refresh fast; the window is part of the metric, not a footnote

The operational core is four metrics — Citation Rate, Citation Share, Average Position (definition A, citation order), Source Diversity — each defined in GEO Metrics (§3.2, §3.3, §3.5, §3.10 respectively). For any other metric in this playbook, look up its formula in the same place.

3. Step 1 — Build the prompt set (your measurement instrument)

The prompt set is the single biggest source of bias in AI citation tracking (it is pitfall #3 in GEO Metrics §7). The prompt set is the experiment — treat it with the rigour of a survey instrument.

Rules:

Derive prompts from real user intents, not your own keyword list. What would a buyer actually type into ChatGPT?
Size: start with 30–50 prompts. Below ~30 the sampling noise swamps the signal; you can grow later.
Balance categories (commercial / informational / comparison) so one intent class does not dominate the average.
Freeze and version the set. Changing prompts silently makes a time series meaningless. Add prompts in a new version tag; never edit a live one in place.
Store it under version control — the prompt set is data, not config.

Anti-pattern to name explicitly: hand-picking the prompts you already win. That produces a number that only moves down, and a strategy optimized for a flattering sample.

Deliverable — a versioned prompts.csv:

id,query,intent,category,locale,prompt_set_v,added_date,retired_date
q001,"best crm for startups",commercial,software,en,v3,2026-05-19,
q002,"how does retrieval augmented generation work",informational,technical,en,v3,2026-05-19,

4. Step 2 — The manual method (start here, always)

Run this by hand before you automate anything. Manual sampling is the ground truth: it builds judgement about what “cited” actually looks like per engine, carries zero vendor lock-in, and is the only way to validate a tool you later buy.

Procedure:

Schedule it — same prompts, same day-of-week, same time window. Cadence is part of the data.
One fresh session per prompt per engine — no logged-in history, no personalization. Personalized answers are not reproducible.
Record the raw answer and its sources verbatim. Raw capture is immutable; metrics are derived from it later, never the reverse.
Tag your domain’s appearance as cited, mentioned, or absent (per the Citation vs Mention test).
Capture citation rank — your position in the engine’s source list (this feeds Average Position, definition A).

Deliverable — one canonical citation-log row schema (the automated path in §5 converges on the same schema):

run_date           date the sample was taken (UTC)
prompt_id          FK -> prompts.csv
prompt_set_v       version tag of the frozen prompt set
engine             perplexity | chatgpt | google-aio | ...
appearance         cited | mentioned | absent
citation_rank      integer position in the source list (null unless cited)
source_url         the URL the engine attributed (null unless cited)
citation_verified  true | false   (see §4.1)
snippet            the sentence/section the engine actually used

4.1 Verify every citation (the step everyone skips)

An AI-attributed URL is a claim, not a fact. It may 404, or resolve to a page that does not support the sentence it was attached to. This is unique to AI tracking and load-bearing.

For each cited URL, record citation_verified = true only if both hold:

the URL resolves (not dead, not a redirect to an unrelated page), and
the page actually supports the claim the engine attached to it.

Report verified and unverified citations separately. A hallucinated or unsupported citation is a finding — see Liu et al. 2023, Evaluating Verifiability in Generative Search Engines for why this failure mode is common enough to measure deliberately.

5. Step 3 — The automated method (scale, once manual works)

Automate only a manual process you have already validated. The build-vs-buy question is secondary to that rule.

The per-engine API reality (verified 2026-05). There is no uniform “give me my citations” API across engines — state capability per engine, never as one promise:

Engine	Programmatic source list?	Mechanism	Order guarantee	Note
Perplexity (Sonar API)	Yes	`search_results[]` — `title`, `url`, `snippet`, `date`; legacy `citations[]` deprecated & removed	Array order; no documented relevance ranking	Cleanest path — see Perplexity AI
ChatGPT (OpenAI Responses API, `web_search`)	Yes	inline `url_citation` annotations plus a fuller `sources[]` list	Not documented as ranked	Citations ≠ full source list; reconcile both — see ChatGPT Search
Google AI Overviews	No official API	Blended into Search Console “Web”; no per-citation attribution	—	Third-party SERP scrapers only — see Google AI Overviews

Sources: Perplexity Chat Completions reference and changelog; OpenAI web search tool; Google Search Central — AI features.

The automation is a thin wrapper that converges on the §4 schema — engine-agnostic by design:

for engine in engines:
  for prompt in prompt_set_v:                  # frozen, versioned
    answer, sources = engine.ask(prompt)       # fresh session, no history
    appearance = classify(answer, sources, my_domain)   # cited|mentioned|absent
    rank       = citation_rank(sources, my_domain)      # null if not cited
    verified   = url_resolves(src) and supports_claim(src, answer)   # §4.1
    log.append(run_date, engine, prompt.id, prompt_set_v,
               appearance, rank, source_url, verified, snippet)
# log now has the SAME schema as the manual method

Buy path. If you buy instead of build, the tool-selection question is really a metrics-definition question — see the vendor matrix in GEO Metrics §4 (Profound, Otterly, Ahrefs, BrightEdge, Similarweb and how each defines the KPI). Vendor selection is downstream of definition; rank tools by which definition you trust, not the other way around.

6. Step 4 — Normalize, store, and compute deltas

Manual and automated paths now share one schema, so they are comparable. Two storage rules:

Raw answers are immutable. Store the captured answer + source list as-is.
Metrics are derived, never hand-edited. If a number looks wrong, fix the query, not the cell.

Compute the operational core from the log — for the formulas, see GEO Metrics; below are the query shapes:

-- Citation Rate  (definition: GEO Metrics §3.2)
SELECT engine,
       COUNT(*) FILTER (WHERE appearance = 'cited') AS cited,
       COUNT(*)                                     AS answers
FROM citation_log WHERE prompt_set_v = 'v3' GROUP BY engine;

-- Average Position, definition A / citation order  (GEO Metrics §3.5)
SELECT engine, AVG(citation_rank)
FROM citation_log WHERE appearance = 'cited' GROUP BY engine;

-- Source Diversity  (GEO Metrics §3.10)
SELECT COUNT(DISTINCT engine)
FROM citation_log WHERE appearance = 'cited';

Is this change real? A week-over-week delta is not automatically a result. Before you report movement, clear this checklist:

Was the prompt-set version identical across both samples?
Is the underlying citation count large enough? A jump on 5 citations is noise — the same small-sample caution GEO Metrics flags for First-Cite Rate.
Did an engine change behaviour between runs (model or retrieval update)?
Did citation_verified rates move? A “gain” in unverified citations is not a gain.

7. Step 5 — Report it without lying to yourself

Every reported number ships with its provenance, or it is not comparable to anything:

Prompt-set version (e.g. v3)
Engine set (the exact engines, not “AI”)
Time window (e.g. 7-day, sampled weekly)
Mention or citation (which you counted)
Position definition (A / B / C — see GEO Metrics §3.5)

A “Citation Rate of 18%” with none of the above is a rumour, not a metric.

Tie movement to outcomes carefully, and do not over-claim: tracking proves visibility moved, not that revenue moved. The business bridge is a separate model — route it to GEO ROI Models.

Where this fits the bigger loop: this playbook produces a recurring snapshot that the GEO Audit consumes — tracking is the heartbeat, the audit is the periodic physical.

8. Validity threats & pitfalls

Do not ship a report without clearing every item. The linked source explains each one in depth:

Prompt-set bias — a flattering or unversioned set (§3; GEO Metrics §7)
Time-window bias — windows of different length are different metrics
Multilingual slicing — Chinese and English answer pools are not summable (GEO Metrics §7)
Mention / citation mixing — counting both as one inflates results (Citation vs Mention)
Position-definition mixing — A vs B vs C are incomparable (GEO Metrics §3.5)
Personalization leakage — logged-in / history-on sessions are not reproducible
Hallucinated or dead citations — unverified citations reported as wins (§4.1)
Over-extrapolating a single-actor lift — a gain you measured alone may not survive once competitors optimize against the same engine; see the C-SEO Bench caveat in Aggarwal et al. 2024 §6

9. Further reading

Definition layer: GEO Metrics · Citation vs Mention · Brand Mentions
Business framing: GEO ROI Models
Adjacent operation: GEO Audit
Per-engine specifics: Perplexity AI · ChatGPT Search · Google AI Overviews
Academic anchor: Aggarwal et al. 2024 — GEO: Generative Engine Optimization

References

Academic:

Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL
Puerto, H. et al. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 D&B. arXiv:2506.11097
Liu, N., Zhang, T., Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP ‘23. arXiv:2304.09848

API & platform documentation (verified 2026-05):

Perplexity — Chat Completions API Reference · Changelog
OpenAI — Web Search tool (Responses API)
Google Search Central — AI features and your site

Vendor KPI methodology (for the buy path, via GEO Metrics):

Otterly.ai — Brand Report KPI Definitions
Ahrefs — Brand Radar Methodology

Frequently asked questions

Should I start with manual tracking or an automated tool?

Manual, always. A manual sample is the ground truth that builds judgement, has zero vendor lock-in, and lets you validate any tool you later buy. Automate only a process you have already run by hand and trust. A dashboard you cannot reproduce by hand is a dashboard you cannot defend.

Can't I just pull citations from each engine's API?

Partially, and not uniformly. Perplexity's Sonar API returns a search_results array (the legacy citations field is deprecated and removed). OpenAI's Responses API web_search tool returns inline url_citation annotations plus a fuller sources list. Google AI Overviews has no official content API — it is blended into Search Console's 'Web' totals with no per-citation attribution, so practitioners rely on third-party SERP scrapers. Never assume one engine's behavior generalizes.

How many prompts do I need, and can I change them later?

Start with 30–50 prompts derived from real user intents, then freeze and version the set. The prompt set IS the experiment — changing it silently breaks your time series. Add prompts in a new version; never edit a live one in place.

An AI cited a URL that doesn't exist or doesn't support the claim. Do I count it?

Log it with citation_verified = false and report verified and unverified citations separately. A hallucinated or dead citation is a finding, not noise — it is exactly the kind of signal a binary 'were we cited?' metric hides.

Where do I find the formula for Visibility Score?

Every metric definition — Visibility Score included — sits in GEO Metrics. Keeping definitions in one place stops two pages drifting apart. This playbook covers how to collect whichever metric you chose; GEO Metrics covers what each one means and how each vendor computes it.

Related playbooks & wiki

Sources

Primary

GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
Perplexity API — Chat Completions Reference · Perplexity
Perplexity API — Changelog (citations field deprecation) · Perplexity
OpenAI — Web Search tool (Responses API) · OpenAI
Google Search Central — AI features and your site · Google
Otterly.ai — Brand Report KPI Definitions · Otterly.ai
Ahrefs Brand Radar Methodology · Ahrefs

Secondary

C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B
Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings