AI Citation Tracking
Quick facts
- Difficulty
- Intermediate
- Time
- Half-day setup, then ~30 min/week
- Prerequisites
- GEO Metrics, Citation vs Mention vs Link
- What this is
- Operational playbook — how to collect the numbers, not what they mean
- Two modes
- Manual (ground truth) first → automated (scale) second
- Operational metric core
- 4 — Citation Rate, Citation Share, Average Position, Source Diversity
- Definitions source
- Every metric is defined in GEO Metrics; this playbook collects them
- Effort
- ~Half-day setup, then ~30 min/week
1. What AI citation tracking is
A repeatable measurement instrument: a frozen prompt set, queried against a declared engine set on a fixed cadence, with each answer sampled, verified, and logged to a single canonical schema. The output is a time series of how often, how prominently, and on which engines you appear.
This is the operational companion to GEO Metrics — every metric name in this playbook links to its formula there. Two pages, one place per definition, no drift.
A scope clarification, because the question comes up: this is citation tracking. Monitoring unlinked mentions as an operation — detecting them in answer text, de-duping, normalizing across engines — is a harder, separate problem handled by Brand Mention Tracking. A mention here is logged as an appearance value (§4) for completeness.
Concretely, the instrument is:
frozen prompt set → chosen engines → sampled answers → verified, normalized log → delta report
There are two delivery modes, in this order: manual (ground truth) then automated (scale). The order is not negotiable — §4 explains why.
Why this matters: “are we being cited?” is the most-asked GEO KPI question, and answering it badly — a biased prompt set, mixed metric definitions — is worse than not answering it, because a confident wrong number gets acted on. The right mental model is the one the academic origin of GEO set out: measure prominence continuously, not citation as a yes/no. See Aggarwal et al. 2024 for the position-adjusted “impression” idea this playbook operationalizes, and Generative Engine Optimization for why ranked-link metrics no longer apply.
2. Decide before you measure
Four decisions fix the meaning of every later number. Get one wrong and the rest is uninterpretable.
| Decision | Options | Rule of thumb |
|---|---|---|
| Which metric(s) | The operational core: Citation Rate, Citation Share, Average Position, Source Diversity | Start with Citation Rate + Source Diversity; they need no competitor set |
| Mention or citation | Count citations only, mentions only, or both (tagged separately) | They are different constructs — never sum them. See Citation vs Mention, Brand Mentions |
| Which engines | Your audience’s actual engines, not all of them | The engine set is itself a reported variable — declare it |
| Time window + cadence | e.g. weekly sample, 7-day window | Answers refresh fast; the window is part of the metric, not a footnote |
The operational core is four metrics — Citation Rate, Citation Share, Average Position (definition A, citation order), Source Diversity — each defined in GEO Metrics (§3.2, §3.3, §3.5, §3.10 respectively). For any other metric in this playbook, look up its formula in the same place.
3. Step 1 — Build the prompt set (your measurement instrument)
The prompt set is the single biggest source of bias in AI citation tracking (it is pitfall #3 in GEO Metrics §7). The prompt set is the experiment — treat it with the rigour of a survey instrument.
Rules:
- Derive prompts from real user intents, not your own keyword list. What would a buyer actually type into ChatGPT?
- Size: start with 30–50 prompts. Below ~30 the sampling noise swamps the signal; you can grow later.
- Balance categories (commercial / informational / comparison) so one intent class does not dominate the average.
- Freeze and version the set. Changing prompts silently makes a time series meaningless. Add prompts in a new version tag; never edit a live one in place.
- Store it under version control — the prompt set is data, not config.
Anti-pattern to name explicitly: hand-picking the prompts you already win. That produces a number that only moves down, and a strategy optimized for a flattering sample.
Deliverable — a versioned prompts.csv:
id,query,intent,category,locale,prompt_set_v,added_date,retired_date
q001,"best crm for startups",commercial,software,en,v3,2026-05-19,
q002,"how does retrieval augmented generation work",informational,technical,en,v3,2026-05-19,
4. Step 2 — The manual method (start here, always)
Run this by hand before you automate anything. Manual sampling is the ground truth: it builds judgement about what “cited” actually looks like per engine, carries zero vendor lock-in, and is the only way to validate a tool you later buy.
Procedure:
- Schedule it — same prompts, same day-of-week, same time window. Cadence is part of the data.
- One fresh session per prompt per engine — no logged-in history, no personalization. Personalized answers are not reproducible.
- Record the raw answer and its sources verbatim. Raw capture is immutable; metrics are derived from it later, never the reverse.
- Tag your domain’s appearance as
cited,mentioned, orabsent(per the Citation vs Mention test). - Capture citation rank — your position in the engine’s source list (this feeds Average Position, definition A).
Deliverable — one canonical citation-log row schema (the automated path in §5 converges on the same schema):
run_date date the sample was taken (UTC)
prompt_id FK -> prompts.csv
prompt_set_v version tag of the frozen prompt set
engine perplexity | chatgpt | google-aio | ...
appearance cited | mentioned | absent
citation_rank integer position in the source list (null unless cited)
source_url the URL the engine attributed (null unless cited)
citation_verified true | false (see §4.1)
snippet the sentence/section the engine actually used
4.1 Verify every citation (the step everyone skips)
An AI-attributed URL is a claim, not a fact. It may 404, or resolve to a page that does not support the sentence it was attached to. This is unique to AI tracking and load-bearing.
For each cited URL, record citation_verified = true only if both hold:
- the URL resolves (not dead, not a redirect to an unrelated page), and
- the page actually supports the claim the engine attached to it.
Report verified and unverified citations separately. A hallucinated or unsupported citation is a finding — see Liu et al. 2023, Evaluating Verifiability in Generative Search Engines for why this failure mode is common enough to measure deliberately.
5. Step 3 — The automated method (scale, once manual works)
Automate only a manual process you have already validated. The build-vs-buy question is secondary to that rule.
The per-engine API reality (verified 2026-05). There is no uniform “give me my citations” API across engines — state capability per engine, never as one promise:
| Engine | Programmatic source list? | Mechanism | Order guarantee | Note |
|---|---|---|---|---|
| Perplexity (Sonar API) | Yes | search_results[] — title, url, snippet, date; legacy citations[] deprecated & removed | Array order; no documented relevance ranking | Cleanest path — see Perplexity AI |
ChatGPT (OpenAI Responses API, web_search) | Yes | inline url_citation annotations plus a fuller sources[] list | Not documented as ranked | Citations ≠ full source list; reconcile both — see ChatGPT Search |
| Google AI Overviews | No official API | Blended into Search Console “Web”; no per-citation attribution | — | Third-party SERP scrapers only — see Google AI Overviews |
Sources: Perplexity Chat Completions reference and changelog; OpenAI web search tool; Google Search Central — AI features.
The automation is a thin wrapper that converges on the §4 schema — engine-agnostic by design:
for engine in engines:
for prompt in prompt_set_v: # frozen, versioned
answer, sources = engine.ask(prompt) # fresh session, no history
appearance = classify(answer, sources, my_domain) # cited|mentioned|absent
rank = citation_rank(sources, my_domain) # null if not cited
verified = url_resolves(src) and supports_claim(src, answer) # §4.1
log.append(run_date, engine, prompt.id, prompt_set_v,
appearance, rank, source_url, verified, snippet)
# log now has the SAME schema as the manual method
Buy path. If you buy instead of build, the tool-selection question is really a metrics-definition question — see the vendor matrix in GEO Metrics §4 (Profound, Otterly, Ahrefs, BrightEdge, Similarweb and how each defines the KPI). Vendor selection is downstream of definition; rank tools by which definition you trust, not the other way around.
6. Step 4 — Normalize, store, and compute deltas
Manual and automated paths now share one schema, so they are comparable. Two storage rules:
- Raw answers are immutable. Store the captured answer + source list as-is.
- Metrics are derived, never hand-edited. If a number looks wrong, fix the query, not the cell.
Compute the operational core from the log — for the formulas, see GEO Metrics; below are the query shapes:
-- Citation Rate (definition: GEO Metrics §3.2)
SELECT engine,
COUNT(*) FILTER (WHERE appearance = 'cited') AS cited,
COUNT(*) AS answers
FROM citation_log WHERE prompt_set_v = 'v3' GROUP BY engine;
-- Average Position, definition A / citation order (GEO Metrics §3.5)
SELECT engine, AVG(citation_rank)
FROM citation_log WHERE appearance = 'cited' GROUP BY engine;
-- Source Diversity (GEO Metrics §3.10)
SELECT COUNT(DISTINCT engine)
FROM citation_log WHERE appearance = 'cited';
Is this change real? A week-over-week delta is not automatically a result. Before you report movement, clear this checklist:
- Was the prompt-set version identical across both samples?
- Is the underlying citation count large enough? A jump on 5 citations is noise — the same small-sample caution GEO Metrics flags for First-Cite Rate.
- Did an engine change behaviour between runs (model or retrieval update)?
- Did
citation_verifiedrates move? A “gain” in unverified citations is not a gain.
7. Step 5 — Report it without lying to yourself
Every reported number ships with its provenance, or it is not comparable to anything:
- Prompt-set version (e.g.
v3) - Engine set (the exact engines, not “AI”)
- Time window (e.g. 7-day, sampled weekly)
- Mention or citation (which you counted)
- Position definition (A / B / C — see GEO Metrics §3.5)
A “Citation Rate of 18%” with none of the above is a rumour, not a metric.
Tie movement to outcomes carefully, and do not over-claim: tracking proves visibility moved, not that revenue moved. The business bridge is a separate model — route it to GEO ROI Models.
Where this fits the bigger loop: this playbook produces a recurring snapshot that the GEO Audit consumes — tracking is the heartbeat, the audit is the periodic physical.
8. Validity threats & pitfalls
Do not ship a report without clearing every item. The linked source explains each one in depth:
- Prompt-set bias — a flattering or unversioned set (§3; GEO Metrics §7)
- Time-window bias — windows of different length are different metrics
- Multilingual slicing — Chinese and English answer pools are not summable (GEO Metrics §7)
- Mention / citation mixing — counting both as one inflates results (Citation vs Mention)
- Position-definition mixing — A vs B vs C are incomparable (GEO Metrics §3.5)
- Personalization leakage — logged-in / history-on sessions are not reproducible
- Hallucinated or dead citations — unverified citations reported as wins (§4.1)
- Over-extrapolating a single-actor lift — a gain you measured alone may not survive once competitors optimize against the same engine; see the C-SEO Bench caveat in Aggarwal et al. 2024 §6
9. Further reading
- Definition layer: GEO Metrics · Citation vs Mention · Brand Mentions
- Business framing: GEO ROI Models
- Adjacent operation: GEO Audit
- Per-engine specifics: Perplexity AI · ChatGPT Search · Google AI Overviews
- Academic anchor: Aggarwal et al. 2024 — GEO: Generative Engine Optimization
References
Academic:
- Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL
- Puerto, H. et al. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 D&B. arXiv:2506.11097
- Liu, N., Zhang, T., Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP ‘23. arXiv:2304.09848
API & platform documentation (verified 2026-05):
- Perplexity — Chat Completions API Reference · Changelog
- OpenAI — Web Search tool (Responses API)
- Google Search Central — AI features and your site
Vendor KPI methodology (for the buy path, via GEO Metrics):
- Otterly.ai — Brand Report KPI Definitions
- Ahrefs — Brand Radar Methodology
Frequently asked questions
Should I start with manual tracking or an automated tool?
Can't I just pull citations from each engine's API?
How many prompts do I need, and can I change them later?
An AI cited a URL that doesn't exist or doesn't support the claim. Do I count it?
Where do I find the formula for Visibility Score?
Related playbooks & wiki
Sources
Primary
- GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
- GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
- Perplexity API — Chat Completions Reference · Perplexity
- Perplexity API — Changelog (citations field deprecation) · Perplexity
- OpenAI — Web Search tool (Responses API) · OpenAI
- Google Search Central — AI features and your site · Google
- Otterly.ai — Brand Report KPI Definitions · Otterly.ai
- Ahrefs Brand Radar Methodology · Ahrefs
Secondary
- C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B
- Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings