Brand Mention Tracking
Quick facts
- Difficulty
- Intermediate
- Time
- Half-day detector build, then ~30 min/week
- Prerequisites
- GEO Metrics, Brand Mentions, Citation vs Mention vs Link
- What this is
- Operational playbook — how to count unlinked brand mentions in AI answer prose, not why they matter
- Companion playbook
- Citation tracking owns URL fields the engine emits; mention tracking owns the detector for in-prose names
- Mention metric core
- 4 — Mention Frequency, Share of Voice, Answer Inclusion Rate, Brand Sentiment
- Definitions source
- Every metric is defined in GEO Metrics; this playbook collects them
- Effort
- ~Half-day detector build, then ~30 min/week
1. What brand mention tracking is
A repeatable measurement instrument: a frozen prompt set, queried against a declared engine set on a fixed cadence, with each answer sampled, detected for brand names in the prose — yours and your competitors’ — verified, and logged to a single canonical schema. The output is a time series of how often, where in the answer, and with what sentiment your brand is named.
The companion playbook is AI Citation Tracking: same measurement spine, different detection problem. Citations are URLs the engine emits as fields — Perplexity’s search_results[], OpenAI’s url_citation annotations, Gemini’s groundingChunks — so the detector is trivial and the work is reconciliation. Mentions live in generated prose with no field anywhere across major engines; the detector is what you build, and it is where almost all of the error lives. The mechanism for why an unlinked mention is worth measuring is the subject of Brand Mentions; the three-outcome attribution taxonomy (citation vs mention vs link) is Citation vs Mention; the metric formulas this playbook collects sit in GEO Metrics.
Three specifics, not editorializing, make mention tracking the harder of the two siblings:
- No
mentions[]field on any engine API. Confirmed for Perplexity (Chat Completions reference), OpenAI (Responses API web search), Google AI Overviews (Search Central — AI features), and Gemini (Grounding with Google Search). Each surface tags source URLs and (sometimes) credit spans for citations — none labels brand entities in the generated prose. - The unit of “one mention” is a choice, not a given (§4.3). “Acme is fast. Acme integrates with X. Acme costs less.” is one mention, three mentions, or one-per-occurrence depending on the unit you declare — three different numbers from the same answer.
- Brand-name collisions (“Apple” the company vs “apple” the fruit; “Meta” the platform vs “meta” the modifier; competitor sub-brands shared with yours) make precision the harder metric, not recall. The inverse of citation tracking, where the engine can never invent a URL.
2. Decide before you measure
Five decisions fix the meaning of every later number. Get one wrong and the rest is uninterpretable.
| Decision | Options | Rule of thumb |
|---|---|---|
| Which metric(s) | Mention Frequency, Share of Voice, Answer Inclusion Rate, Brand Sentiment | Start with Mention Frequency + AIR (no competitor set required); SOV needs the competitor-set decision below |
| Unit of “one mention” | Sentence-level / answer-level / phrase-level | Default sentence-level; declare it once, then never mix units across runs |
| Competitor set | Closed (predefined N) / open (everyone named) | Closed for SOV stability; open for landscape sweeps — report mode in every header |
| Engine set | Your audience’s engines, not all of them | The engine set is itself a reported variable — declare it |
| Time window + cadence | e.g. weekly sample, 7-day window | Answers refresh fast; the window is part of the metric, not a footnote |
The four mention metrics live in GEO Metrics — §3.4 (Share of Voice), §3.6 (Mention Frequency), §3.7 (Answer Inclusion Rate), §3.9 (Brand Sentiment). For any other metric in this playbook, look up its formula in the same place.
3. Step 1 — Build the prompt set + competitor set
The prompt-set rules are the same as the citation-tracking sibling — 30–50 prompts derived from real user intents, balanced across categories, frozen and versioned, stored under version control as data, not config. See AI Citation Tracking §3 for the discipline in full; do not re-derive it per playbook.
The new instrument for mention tracking is a competitor set — the dimension citation tracking does not strictly need. Two modes:
- Closed — your N declared competitors. Stable SOV denominator, runs are comparable.
- Open — every brand mentioned in any answer. Better landscape awareness, but the denominator drifts run-over-run, so SOV is not comparable without explicit re-baselining.
Why this matters: a Share of Voice number without a declared competitor set is a metric without a definition — and is exactly why vendor SOV figures disagree (Otterly publishes its formula, Ahrefs weights by impressions, Profound and BrightEdge do not disclose; the inventory is in GEO Metrics §3.4). Declare both the competitor-set mode and the member list, and version both.
Deliverable — a brands.yaml companion to prompts.csv:
# brands.yaml — versioned alongside prompts.csv
my_brand:
canonical: "Acme"
aliases: ["Acme Inc.", "Acme Corp", "Acme.ai"]
negative: ["acme"] # dictionary-word collision
disambiguation: ["SaaS", "CRM", "acme.com"] # required-context terms
parent: null
brands_set_v: v1
added_date: 2026-05-20
competitor_set:
mode: closed
members: [my_brand, competitor_a, competitor_b, ...]
competitor_set_v: v1
4. Step 2 — The manual method (start here, always)
Run this by hand before you automate anything. Manual sampling is the ground truth that builds judgement about what a “mention” actually looks like per engine, validates a tool you later buy, and is the only way to debug a detector that confidently miscounts. The rule from AI Citation Tracking §4 applies unchanged; the new manual steps unique to mentions are below.
4.1 Detection rules — alias, case, and boundary
The canonical name plus alias list lives in brands.yaml. Four boundary rules to declare and freeze:
- Case-sensitivity — default case-insensitive matching, but record the original casing in the log row (useful for sentiment context, “ACME LAUNCHED” reads differently from “acme launched”).
- Possessives and plurals — include “Acme’s” and “Acmes” by default; exclude only via an explicit negative-list entry.
- Hyphenation and spacing variants — “Acme AI”, “Acme.AI”, “Acme-AI” — accept all unless
brands.yamlsays otherwise. - URL / email collisions in prose — exclude. The answer’s sources tray is the citation surface, not the mention surface; a URL containing the brand string is a citation event, not a prose mention.
Rules are data, not code — store the boundary policy alongside brands.yaml and version it.
4.2 Disambiguation — the precision step
The failure mode unique to mention tracking: collision with dictionary words (“Apple”, “Amazon” the river, “Meta” the modifier), with people’s names, and with competitor sub-brands. False positives silently inflate Mention Frequency and corrupt SOV — a detector that runs hot but uncalibrated is the most common silent failure of a bought-in tool whose internals you cannot audit.
Rule: when the alias is ambiguous, require local-context disambiguation — a co-occurring product, domain, or topic phrase in the same sentence or the previous sentence. Promote the rule to the brands.yaml disambiguation: field rather than hard-coding it. This is also where the LLM-judge upgrade in §5.2 stage 3 earns its keep.
4.3 The dedup unit — declare it once, then live with it
The single most consequential numeric decision in the playbook. Three options, three different numbers from the same answer:
- Sentence-level — count one mention per sentence containing a hit. Maps cleanly to how an answer “spends words on you.” Default.
- Answer-level — at most one mention per answer per brand. Closest to vendor “mentioned in answer? yes/no” framing — Otterly’s headline Brand Mentions KPI is in this family (see Otterly Brand Report KPI Definitions). Effectively the Answer Inclusion Rate view.
- Phrase-level — every occurrence. Inflates Mention Frequency; only useful when tracking density per sentence, which is rare.
Declare the dedup unit in every report header (it is a provenance field, not a footnote), exactly as the citation-tracking sibling declares position definition A/B/C in AI Citation Tracking §7. The same log can be rolled up under all three units — keep the option open by storing the raw occurrence count per sentence (§4 schema below), then derive the unit-specific rollup at query time.
4.4 Sentiment — capture at detection, route the formula
Tag each retained mention with sentiment: pos / neg / neu / comparative (the brand is named against a competitor, e.g. “X is faster than Acme”). Heuristics that survive most cases without an ML detector:
- adjective proximity within the same sentence (positive / negative adjective ≤ 4 tokens from the brand string);
- comparative phrasing — “faster than”, “unlike”, “compared to” — within ±1 sentence;
- list position in an “X vs Y” / “alternatives to X” table-form answer.
The aggregation formula for Brand Sentiment lives in GEO Metrics §3.9; this playbook only captures the tag. Sentiment cannot be retro-tagged from a derived log — if you skip it now, you re-sample later. That is why it sits in the manual method, not deferred to a future playbook.
4.5 Verify every mention (the precision gate)
The mention analogue of AI Citation Tracking §4.1 (citation_verified). For each detected mention, set mention_verified = true only if both hold:
- the named string actually refers to your entity (passes disambiguation, not a collision), and
- the sentence is about the brand rather than a passing appearance — not “Acme.com” embedded in a competitor’s URL, not the brand named only as a caveat that contradicts your reading.
Report verified and unverified counts separately. The same “use vs credit is decoupled” pattern Liu et al. 2023 measured for citations applies here — an emitted name is a claim the engine made, not a fact about your entity. An inflated SOV from unverified mentions is the most common silent failure mode of a vendor whose detector you can’t audit; report verified-only as the headline and unverified as a footnote.
Deliverable — the canonical mention-log row schema, designed to be reconcilable with the AI Citation Tracking §4 citation-log schema (join the two for “did this brand get any credit at all?” analysis):
run_date date the sample was taken (UTC)
prompt_id FK -> prompts.csv
prompt_set_v version tag of the frozen prompt set
brands_set_v version tag of brands.yaml + competitor set
engine perplexity | chatgpt | google-aio | gemini | ...
brand_id FK -> brands.yaml (my_brand or competitor_x)
unit sentence | answer | phrase (the dedup unit)
occurrence_count int — within-unit hits (>= 1)
mention_position 1-indexed position in the answer text
sentiment pos | neg | neu | comparative
mention_verified true | false (see §4.5)
detector_v rule pack version + (optional) judge model + prompt
snippet the sentence(s) the mention appears in
5. Step 3 — The automated method (the detector is the product)
Automate only a manual process you have already validated; that rule is unchanged from AI Citation Tracking §5. What changes is that in citation tracking the engine ships the detector and you wrap it. In mention tracking, the detector is what you ship.
5.1 The per-engine API reality (mention edition)
Symmetric to the citation-tracking table, but worse for mentions — no engine exposes a “brands named in prose” field anywhere:
| Engine | Programmatic mention extraction? | What you get | Note |
|---|---|---|---|
| Perplexity (Sonar API) | No | Answer string only; brands appear in prose. search_results[] is the citation surface; the legacy citations[] is deprecated and removed | Detector is yours — see Perplexity AI |
ChatGPT (OpenAI Responses API, web_search) | No | Answer text + url_citation annotations + sources[] — all citation surfaces, not mention surfaces | Detector is yours — see ChatGPT Search |
| Google AI Overviews | No | No official content API; Search Console aggregates “Web” traffic with no per-citation attribution | Detector is yours — see Google AI Overviews |
| Gemini (Google Search Grounding) | Partial — groundingSupports gives credit spans for sources, not for brand entities (docs) | groundingChunks (sources) and groundingSupports (cited text spans); brand spans need NLP on top | Detector is yours — see Google Gemini |
| Bing Copilot | No | Answer text + a sources panel; no public mention API | Detector is yours — see Bing Copilot |
Every row says “Detector is yours.” That is the point of §5.2.
5.2 The detector — three honest stages
Three stages, in order of cost and ambition. Most production runs settle at stage 2; stage 3 is QA shadow rather than always-on.
- Stage 1 — rule-based. Alias dictionary, regex with word-boundary anchors, the boundary rules from §4.1, the negative list. Cheap, deterministic, auditable. Misses paraphrase (“the team behind X”) and resolves collisions poorly.
- Stage 2 — rule-based + heuristic disambiguator. Layer a co-occurrence check from
brands.yaml.disambiguation— accept an ambiguous hit only if a confirming context phrase is in the same sentence or the prior sentence. This is the production baseline for most brands. - Stage 3 — LLM-as-judge for the residue. For hits flagged ambiguous by stage 2, or as a periodic shadow run for QA, pass the sentence + brand candidate to a separate, cheap, declared model:
Is "Acme" in this sentence referring to the SaaS company or another sense? Return yes/no/uncertain + 1-sentence reason.Pin the judge model and prompt; version both as part ofdetector_v. Promote stage 3 to always-on only for brands with high-collision aliases (Apple, Amazon, Meta-named-as-Facebook).
Discipline: never let an LLM judge silently change a count between runs. A judge upgrade is a detector_v bump and a re-baseline, exactly the way a prompt-set change is in AI Citation Tracking §3.
Pseudocode wrapper, converging on the §4 schema — engine-agnostic by design:
for engine in engines:
for prompt in prompt_set_v: # frozen, versioned
answer = engine.ask(prompt) # fresh session, no history
for s in split_sentences(answer):
for brand in brands_set_v:
hit, occ = rule_match(s, brand.aliases, brand.negative)
if hit:
ok = disambiguate(s, brand) and llm_judge(s, brand) # 5.2 stages
tag = tag_sentiment(s, brand)
log.append(run_date, engine, prompt.id, prompt_set_v,
brands_set_v, brand.id, "sentence",
occ, position(s, answer), tag,
mention_verified=ok, detector_v=detector_v, snippet=s)
5.3 Buy path — vendor detector reality
The build-vs-buy question reduces to a definition question: which vendor’s mention construct do you trust? Vendor selection is downstream of definition, not the other way around. Compact pointer-table (formulas live in GEO Metrics §3.4):
| Vendor | What they call a “mention” | Detector visibility | Where to read |
|---|---|---|---|
| Otterly.AI | Brand Mentions (answer-level binary) / Share of Voice (raw count share) | Detector private; formulas public | Brand Report KPI Definitions |
| Ahrefs Brand Radar | AI Share of Voice, impression-weighted (Google search volume) | Detector private; methodology published | Brand Radar methodology |
| Profound | Visibility Score / Share of Voice | Black box | How to Track Your Visibility in AI Search |
| BrightEdge | Brand-mention-share variants (extends their SEO SOV patent to AI) | Not disclosed | SOV in 2026 |
Anti-pattern: stacking two vendor SOV numbers in one report as if they were the same metric. They are not — Ahrefs’ impression-weighted SOV and Otterly’s raw-count SOV answer different questions, and Profound and BrightEdge are not falsifiable without methodology. The inventory and the disagreement are already in GEO Metrics §3.4; this playbook enforces it at the reporting layer.
6. Step 4 — Normalize, store, compute deltas
The storage rules are identical to the citation-tracking sibling: raw answers are immutable, metrics are derived, never hand-edit a number. Fix the query, not the cell. Three queries you actually run (formulas defined in GEO Metrics):
-- Mention Frequency (GEO Metrics §3.6)
SELECT engine, brand_id,
SUM(occurrence_count) FILTER (WHERE mention_verified) AS mentions
FROM mention_log
WHERE prompt_set_v = 'v3' AND unit = 'sentence'
GROUP BY engine, brand_id;
-- Share of Voice, closed competitor set, raw count (GEO Metrics §3.4)
WITH m AS (
SELECT brand_id, SUM(occurrence_count) AS n
FROM mention_log
WHERE prompt_set_v = 'v3' AND mention_verified
AND brand_id IN (SELECT member FROM competitor_set_v3)
GROUP BY brand_id
)
SELECT brand_id, n * 1.0 / SUM(n) OVER () AS sov FROM m;
-- Answer Inclusion Rate (GEO Metrics §3.7)
SELECT engine,
COUNT(DISTINCT prompt_id) FILTER
(WHERE brand_id = 'my_brand' AND mention_verified) * 1.0
/ COUNT(DISTINCT prompt_id) AS air
FROM mention_log
WHERE prompt_set_v = 'v3'
GROUP BY engine;
The queries run unchanged as DuckDB over a mention_log.parquet file — no relational store required.
Is this delta real? Before reporting movement, clear this checklist (the citation-tracking sibling’s checklist plus the mention-specific items in bold):
- Was the prompt-set version identical across both samples?
- Were
brands.yamland the competitor set version-identical across both samples? - Was the detector version (rule pack + judge model + judge prompt) identical?
- Is the underlying mention base large enough? Below ~30 mentions, the same small-sample caution GEO Metrics §3.7 flags for AIR applies.
- Did an engine change behaviour between runs?
- Did
mention_verifiedrates move? A “gain” in unverified mentions is not a gain.
7. Step 5 — Report it without lying to yourself
Every reported number ships with its provenance, or it is not comparable to anything. The provenance superset for mention tracking:
- prompt-set version (e.g.
v3) - brands-set version + competitor-set mode (closed/open) + member list
- detector version — rule pack + LLM-judge model + judge prompt
- engine set (the exact engines, not “AI”)
- time window (e.g. 7-day, sampled weekly)
- dedup unit (sentence / answer / phrase)
- sentiment scheme version
A “Share of Voice of 18%” with none of the above is a rumour, not a metric.
Tie movement to outcomes carefully: Mention Frequency moved, not revenue. The business bridge is a separate model — route it to GEO ROI Models. The mention-log this playbook produces is one of two inputs the GEO Audit Layer 6 consumes as a tracking snapshot (the other being the citation-log) — tracking is the heartbeat, the audit is the periodic physical.
8. Validity threats & pitfalls
Do not ship a report without clearing every item. Mention-specific failure modes are in bold; shared-with-citation-tracking failure modes are summarized with a route-out:
- Prompt-set bias — see AI Citation Tracking §3
- Time-window bias — see AI Citation Tracking §6
- Brand-name collisions — the precision killer; §4.2 disambiguation + §5.2 stage 3 LLM-judge are the defense
- Dedup-unit drift — comparing sentence-level last month to answer-level this month is, accidentally, reporting fraud
- Competitor-set drift (open mode) — SOV denominators change silently as new brands enter the answer space; lock and version
- Vendor SOV stacking — §5.3 above; GEO Metrics §3.4 already inventories the methodology disagreement
- Negative-mention blind-spot — high Mention Frequency with negative sentiment is not a win; without §4.4 sentiment capture you cannot see it
- Multilingual splitting — English and Chinese mention pools are not summable; see GEO Metrics §7
- Aggarwal transfer — Aggarwal et al. 2024 measures impression (being used) on on-page rewrites, not unlinked-mention prevalence. Their headline up-to-40% lift is not a Mention Frequency forecast for this playbook; the bounded reading in the paper entry holds here too.
- One-actor lift over-extrapolation — C-SEO Bench (Puerto et al. 2025) finds many conversational-SEO rewrites lose effectiveness under competition. The same caution applies to mention tracking: a SOV gain you measured alone may not survive once competitors optimize for the same prompt set.
9. Further reading
- Definition layer: GEO Metrics · Brand Mentions · Citation vs Mention
- Business framing: GEO ROI Models
- Companion playbook: AI Citation Tracking — the citation half of the same measurement pair
- Adjacent operation: GEO Audit — consumes this playbook’s snapshot at Layer 6
- Per-engine specifics: Perplexity AI · ChatGPT Search · Google AI Overviews
- Academic anchor (boundary): Aggarwal et al. 2024
References
Academic:
- Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL
- Liu, N., Zhang, T., Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP ‘23. arXiv:2304.09848
- Puerto, H. et al. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 D&B. arXiv:2506.11097
API & platform documentation (verified 2026-05):
- Perplexity — Chat Completions API Reference · Changelog
- OpenAI — Web Search tool (Responses API)
- Google Search Central — AI features and your site
- Google — Grounding with Google Search (Gemini API)
Vendor KPI methodology (for the buy path, via GEO Metrics §3.4):
- Otterly.AI — Brand Report KPI Definitions
- Ahrefs — Brand Radar Methodology
- Profound — How to Track Your Visibility in AI Search
- BrightEdge — What Share of Voice Really Means for Search in 2026
Frequently asked questions
How is mention tracking different from citation tracking?
Sentence-level or answer-level dedup — which should I default to?
Can I just buy a vendor (Otterly, Profound, Ahrefs) instead of building a detector?
My brand name collides with a dictionary word — how do I avoid false positives?
Does Aggarwal's '+40%' headline apply to mention tracking?
Related playbooks & wiki
Sources
Primary
- GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
- GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
- Perplexity API — Chat Completions Reference · Perplexity
- Perplexity API — Changelog (citations field deprecation) · Perplexity
- OpenAI — Web Search tool (Responses API) · OpenAI
- Google Search Central — AI features and your site · Google
- Grounding with Google Search (Gemini API — groundingChunks / groundingSupports) · Google
- Otterly.AI — Brand Report KPI Definitions · Otterly.AI
- Ahrefs Brand Radar Methodology · Ahrefs
- Profound — How to Track Your Visibility in AI Search · Profound
Secondary
- BrightEdge — What Share of Voice Really Means for Search in 2026 · BrightEdge
- Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings
- C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B