Playbook · Practice

Brand Mention Tracking

Quick facts

Difficulty: Intermediate
Time: Half-day detector build, then ~30 min/week
Prerequisites: GEO Metrics, Brand Mentions, Citation vs Mention vs Link
What this is: Operational playbook — how to count unlinked brand mentions in AI answer prose, not why they matter
Companion playbook: Citation tracking owns URL fields the engine emits; mention tracking owns the detector for in-prose names
Mention metric core: 4 — Mention Frequency, Share of Voice, Answer Inclusion Rate, Brand Sentiment
Definitions source: Every metric is defined in GEO Metrics; this playbook collects them
Effort: ~Half-day detector build, then ~30 min/week

1. What brand mention tracking is

A repeatable measurement instrument: a frozen prompt set, queried against a declared engine set on a fixed cadence, with each answer sampled, detected for brand names in the prose — yours and your competitors’ — verified, and logged to a single canonical schema. The output is a time series of how often, where in the answer, and with what sentiment your brand is named.

The companion playbook is AI Citation Tracking: same measurement spine, different detection problem. Citations are URLs the engine emits as fields — Perplexity’s search_results[], OpenAI’s url_citation annotations, Gemini’s groundingChunks — so the detector is trivial and the work is reconciliation. Mentions live in generated prose with no field anywhere across major engines; the detector is what you build, and it is where almost all of the error lives. The mechanism for why an unlinked mention is worth measuring is the subject of Brand Mentions; the three-outcome attribution taxonomy (citation vs mention vs link) is Citation vs Mention; the metric formulas this playbook collects sit in GEO Metrics.

Three specifics, not editorializing, make mention tracking the harder of the two siblings:

No mentions[] field on any engine API. Confirmed for Perplexity (Chat Completions reference), OpenAI (Responses API web search), Google AI Overviews (Search Central — AI features), and Gemini (Grounding with Google Search). Each surface tags source URLs and (sometimes) credit spans for citations — none labels brand entities in the generated prose.
The unit of “one mention” is a choice, not a given (§4.3). “Acme is fast. Acme integrates with X. Acme costs less.” is one mention, three mentions, or one-per-occurrence depending on the unit you declare — three different numbers from the same answer.
Brand-name collisions (“Apple” the company vs “apple” the fruit; “Meta” the platform vs “meta” the modifier; competitor sub-brands shared with yours) make precision the harder metric, not recall. The inverse of citation tracking, where the engine can never invent a URL.

2. Decide before you measure

Five decisions fix the meaning of every later number. Get one wrong and the rest is uninterpretable.

Decision	Options	Rule of thumb
Which metric(s)	Mention Frequency, Share of Voice, Answer Inclusion Rate, Brand Sentiment	Start with Mention Frequency + AIR (no competitor set required); SOV needs the competitor-set decision below
Unit of “one mention”	Sentence-level / answer-level / phrase-level	Default sentence-level; declare it once, then never mix units across runs
Competitor set	Closed (predefined N) / open (everyone named)	Closed for SOV stability; open for landscape sweeps — report mode in every header
Engine set	Your audience’s engines, not all of them	The engine set is itself a reported variable — declare it
Time window + cadence	e.g. weekly sample, 7-day window	Answers refresh fast; the window is part of the metric, not a footnote

The four mention metrics live in GEO Metrics — §3.4 (Share of Voice), §3.6 (Mention Frequency), §3.7 (Answer Inclusion Rate), §3.9 (Brand Sentiment). For any other metric in this playbook, look up its formula in the same place.

3. Step 1 — Build the prompt set + competitor set

The prompt-set rules are the same as the citation-tracking sibling — 30–50 prompts derived from real user intents, balanced across categories, frozen and versioned, stored under version control as data, not config. See AI Citation Tracking §3 for the discipline in full; do not re-derive it per playbook.

The new instrument for mention tracking is a competitor set — the dimension citation tracking does not strictly need. Two modes:

Closed — your N declared competitors. Stable SOV denominator, runs are comparable.
Open — every brand mentioned in any answer. Better landscape awareness, but the denominator drifts run-over-run, so SOV is not comparable without explicit re-baselining.

Why this matters: a Share of Voice number without a declared competitor set is a metric without a definition — and is exactly why vendor SOV figures disagree (Otterly publishes its formula, Ahrefs weights by impressions, Profound and BrightEdge do not disclose; the inventory is in GEO Metrics §3.4). Declare both the competitor-set mode and the member list, and version both.

Deliverable — a brands.yaml companion to prompts.csv:

# brands.yaml — versioned alongside prompts.csv
my_brand:
  canonical: "Acme"
  aliases: ["Acme Inc.", "Acme Corp", "Acme.ai"]
  negative: ["acme"]                  # dictionary-word collision
  disambiguation: ["SaaS", "CRM", "acme.com"]   # required-context terms
  parent: null
  brands_set_v: v1
  added_date: 2026-05-20

competitor_set:
  mode: closed
  members: [my_brand, competitor_a, competitor_b, ...]
  competitor_set_v: v1

4. Step 2 — The manual method (start here, always)

Run this by hand before you automate anything. Manual sampling is the ground truth that builds judgement about what a “mention” actually looks like per engine, validates a tool you later buy, and is the only way to debug a detector that confidently miscounts. The rule from AI Citation Tracking §4 applies unchanged; the new manual steps unique to mentions are below.

4.1 Detection rules — alias, case, and boundary

The canonical name plus alias list lives in brands.yaml. Four boundary rules to declare and freeze:

Case-sensitivity — default case-insensitive matching, but record the original casing in the log row (useful for sentiment context, “ACME LAUNCHED” reads differently from “acme launched”).
Possessives and plurals — include “Acme’s” and “Acmes” by default; exclude only via an explicit negative-list entry.
Hyphenation and spacing variants — “Acme AI”, “Acme.AI”, “Acme-AI” — accept all unless brands.yaml says otherwise.
URL / email collisions in prose — exclude. The answer’s sources tray is the citation surface, not the mention surface; a URL containing the brand string is a citation event, not a prose mention.

Rules are data, not code — store the boundary policy alongside brands.yaml and version it.

4.2 Disambiguation — the precision step

The failure mode unique to mention tracking: collision with dictionary words (“Apple”, “Amazon” the river, “Meta” the modifier), with people’s names, and with competitor sub-brands. False positives silently inflate Mention Frequency and corrupt SOV — a detector that runs hot but uncalibrated is the most common silent failure of a bought-in tool whose internals you cannot audit.

Rule: when the alias is ambiguous, require local-context disambiguation — a co-occurring product, domain, or topic phrase in the same sentence or the previous sentence. Promote the rule to the brands.yaml disambiguation: field rather than hard-coding it. This is also where the LLM-judge upgrade in §5.2 stage 3 earns its keep.

4.3 The dedup unit — declare it once, then live with it

The single most consequential numeric decision in the playbook. Three options, three different numbers from the same answer:

Sentence-level — count one mention per sentence containing a hit. Maps cleanly to how an answer “spends words on you.” Default.
Answer-level — at most one mention per answer per brand. Closest to vendor “mentioned in answer? yes/no” framing — Otterly’s headline Brand Mentions KPI is in this family (see Otterly Brand Report KPI Definitions). Effectively the Answer Inclusion Rate view.
Phrase-level — every occurrence. Inflates Mention Frequency; only useful when tracking density per sentence, which is rare.

Declare the dedup unit in every report header (it is a provenance field, not a footnote), exactly as the citation-tracking sibling declares position definition A/B/C in AI Citation Tracking §7. The same log can be rolled up under all three units — keep the option open by storing the raw occurrence count per sentence (§4 schema below), then derive the unit-specific rollup at query time.

4.4 Sentiment — capture at detection, route the formula

Tag each retained mention with sentiment: pos / neg / neu / comparative (the brand is named against a competitor, e.g. “X is faster than Acme”). Heuristics that survive most cases without an ML detector:

adjective proximity within the same sentence (positive / negative adjective ≤ 4 tokens from the brand string);
comparative phrasing — “faster than”, “unlike”, “compared to” — within ±1 sentence;
list position in an “X vs Y” / “alternatives to X” table-form answer.

The aggregation formula for Brand Sentiment lives in GEO Metrics §3.9; this playbook only captures the tag. Sentiment cannot be retro-tagged from a derived log — if you skip it now, you re-sample later. That is why it sits in the manual method, not deferred to a future playbook.

4.5 Verify every mention (the precision gate)

The mention analogue of AI Citation Tracking §4.1 (citation_verified). For each detected mention, set mention_verified = true only if both hold:

the named string actually refers to your entity (passes disambiguation, not a collision), and
the sentence is about the brand rather than a passing appearance — not “Acme.com” embedded in a competitor’s URL, not the brand named only as a caveat that contradicts your reading.

Report verified and unverified counts separately. The same “use vs credit is decoupled” pattern Liu et al. 2023 measured for citations applies here — an emitted name is a claim the engine made, not a fact about your entity. An inflated SOV from unverified mentions is the most common silent failure mode of a vendor whose detector you can’t audit; report verified-only as the headline and unverified as a footnote.

Deliverable — the canonical mention-log row schema, designed to be reconcilable with the AI Citation Tracking §4 citation-log schema (join the two for “did this brand get any credit at all?” analysis):

run_date            date the sample was taken (UTC)
prompt_id           FK -> prompts.csv
prompt_set_v        version tag of the frozen prompt set
brands_set_v        version tag of brands.yaml + competitor set
engine              perplexity | chatgpt | google-aio | gemini | ...
brand_id            FK -> brands.yaml (my_brand or competitor_x)
unit                sentence | answer | phrase  (the dedup unit)
occurrence_count    int — within-unit hits (>= 1)
mention_position    1-indexed position in the answer text
sentiment           pos | neg | neu | comparative
mention_verified    true | false   (see §4.5)
detector_v          rule pack version + (optional) judge model + prompt
snippet             the sentence(s) the mention appears in

5. Step 3 — The automated method (the detector is the product)

Automate only a manual process you have already validated; that rule is unchanged from AI Citation Tracking §5. What changes is that in citation tracking the engine ships the detector and you wrap it. In mention tracking, the detector is what you ship.

5.1 The per-engine API reality (mention edition)

Symmetric to the citation-tracking table, but worse for mentions — no engine exposes a “brands named in prose” field anywhere:

Engine	Programmatic mention extraction?	What you get	Note
Perplexity (Sonar API)	No	Answer string only; brands appear in prose. `search_results[]` is the citation surface; the legacy `citations[]` is deprecated and removed	Detector is yours — see Perplexity AI
ChatGPT (OpenAI Responses API, `web_search`)	No	Answer text + `url_citation` annotations + `sources[]` — all citation surfaces, not mention surfaces	Detector is yours — see ChatGPT Search
Google AI Overviews	No	No official content API; Search Console aggregates “Web” traffic with no per-citation attribution	Detector is yours — see Google AI Overviews
Gemini (Google Search Grounding)	Partial — `groundingSupports` gives credit spans for sources, not for brand entities (docs)	`groundingChunks` (sources) and `groundingSupports` (cited text spans); brand spans need NLP on top	Detector is yours — see Google Gemini
Bing Copilot	No	Answer text + a sources panel; no public mention API	Detector is yours — see Bing Copilot

Every row says “Detector is yours.” That is the point of §5.2.

5.2 The detector — three honest stages

Three stages, in order of cost and ambition. Most production runs settle at stage 2; stage 3 is QA shadow rather than always-on.

Stage 1 — rule-based. Alias dictionary, regex with word-boundary anchors, the boundary rules from §4.1, the negative list. Cheap, deterministic, auditable. Misses paraphrase (“the team behind X”) and resolves collisions poorly.
Stage 2 — rule-based + heuristic disambiguator. Layer a co-occurrence check from brands.yaml.disambiguation — accept an ambiguous hit only if a confirming context phrase is in the same sentence or the prior sentence. This is the production baseline for most brands.
Stage 3 — LLM-as-judge for the residue. For hits flagged ambiguous by stage 2, or as a periodic shadow run for QA, pass the sentence + brand candidate to a separate, cheap, declared model: Is "Acme" in this sentence referring to the SaaS company or another sense? Return yes/no/uncertain + 1-sentence reason. Pin the judge model and prompt; version both as part of detector_v. Promote stage 3 to always-on only for brands with high-collision aliases (Apple, Amazon, Meta-named-as-Facebook).

Discipline: never let an LLM judge silently change a count between runs. A judge upgrade is a detector_v bump and a re-baseline, exactly the way a prompt-set change is in AI Citation Tracking §3.

Pseudocode wrapper, converging on the §4 schema — engine-agnostic by design:

for engine in engines:
  for prompt in prompt_set_v:                          # frozen, versioned
    answer = engine.ask(prompt)                        # fresh session, no history
    for s in split_sentences(answer):
      for brand in brands_set_v:
        hit, occ = rule_match(s, brand.aliases, brand.negative)
        if hit:
          ok = disambiguate(s, brand) and llm_judge(s, brand)   # 5.2 stages
          tag = tag_sentiment(s, brand)
          log.append(run_date, engine, prompt.id, prompt_set_v,
                     brands_set_v, brand.id, "sentence",
                     occ, position(s, answer), tag,
                     mention_verified=ok, detector_v=detector_v, snippet=s)

5.3 Buy path — vendor detector reality

The build-vs-buy question reduces to a definition question: which vendor’s mention construct do you trust? Vendor selection is downstream of definition, not the other way around. Compact pointer-table (formulas live in GEO Metrics §3.4):

Vendor	What they call a “mention”	Detector visibility	Where to read
Otterly.AI	Brand Mentions (answer-level binary) / Share of Voice (raw count share)	Detector private; formulas public	Brand Report KPI Definitions
Ahrefs Brand Radar	AI Share of Voice, impression-weighted (Google search volume)	Detector private; methodology published	Brand Radar methodology
Profound	Visibility Score / Share of Voice	Black box	How to Track Your Visibility in AI Search
BrightEdge	Brand-mention-share variants (extends their SEO SOV patent to AI)	Not disclosed	SOV in 2026

Anti-pattern: stacking two vendor SOV numbers in one report as if they were the same metric. They are not — Ahrefs’ impression-weighted SOV and Otterly’s raw-count SOV answer different questions, and Profound and BrightEdge are not falsifiable without methodology. The inventory and the disagreement are already in GEO Metrics §3.4; this playbook enforces it at the reporting layer.

6. Step 4 — Normalize, store, compute deltas

The storage rules are identical to the citation-tracking sibling: raw answers are immutable, metrics are derived, never hand-edit a number. Fix the query, not the cell. Three queries you actually run (formulas defined in GEO Metrics):

-- Mention Frequency  (GEO Metrics §3.6)
SELECT engine, brand_id,
       SUM(occurrence_count) FILTER (WHERE mention_verified) AS mentions
FROM mention_log
WHERE prompt_set_v = 'v3' AND unit = 'sentence'
GROUP BY engine, brand_id;

-- Share of Voice, closed competitor set, raw count  (GEO Metrics §3.4)
WITH m AS (
  SELECT brand_id, SUM(occurrence_count) AS n
  FROM mention_log
  WHERE prompt_set_v = 'v3' AND mention_verified
    AND brand_id IN (SELECT member FROM competitor_set_v3)
  GROUP BY brand_id
)
SELECT brand_id, n * 1.0 / SUM(n) OVER () AS sov FROM m;

-- Answer Inclusion Rate  (GEO Metrics §3.7)
SELECT engine,
       COUNT(DISTINCT prompt_id) FILTER
         (WHERE brand_id = 'my_brand' AND mention_verified) * 1.0
       / COUNT(DISTINCT prompt_id) AS air
FROM mention_log
WHERE prompt_set_v = 'v3'
GROUP BY engine;

The queries run unchanged as DuckDB over a mention_log.parquet file — no relational store required.

Is this delta real? Before reporting movement, clear this checklist (the citation-tracking sibling’s checklist plus the mention-specific items in bold):

Was the prompt-set version identical across both samples?
Were brands.yaml and the competitor set version-identical across both samples?
Was the detector version (rule pack + judge model + judge prompt) identical?
Is the underlying mention base large enough? Below ~30 mentions, the same small-sample caution GEO Metrics §3.7 flags for AIR applies.
Did an engine change behaviour between runs?
Did mention_verified rates move? A “gain” in unverified mentions is not a gain.

7. Step 5 — Report it without lying to yourself

Every reported number ships with its provenance, or it is not comparable to anything. The provenance superset for mention tracking:

prompt-set version (e.g. v3)
brands-set version + competitor-set mode (closed/open) + member list
detector version — rule pack + LLM-judge model + judge prompt
engine set (the exact engines, not “AI”)
time window (e.g. 7-day, sampled weekly)
dedup unit (sentence / answer / phrase)
sentiment scheme version

A “Share of Voice of 18%” with none of the above is a rumour, not a metric.

Tie movement to outcomes carefully: Mention Frequency moved, not revenue. The business bridge is a separate model — route it to GEO ROI Models. The mention-log this playbook produces is one of two inputs the GEO Audit Layer 6 consumes as a tracking snapshot (the other being the citation-log) — tracking is the heartbeat, the audit is the periodic physical.

8. Validity threats & pitfalls

Do not ship a report without clearing every item. Mention-specific failure modes are in bold; shared-with-citation-tracking failure modes are summarized with a route-out:

Prompt-set bias — see AI Citation Tracking §3
Time-window bias — see AI Citation Tracking §6
Brand-name collisions — the precision killer; §4.2 disambiguation + §5.2 stage 3 LLM-judge are the defense
Dedup-unit drift — comparing sentence-level last month to answer-level this month is, accidentally, reporting fraud
Competitor-set drift (open mode) — SOV denominators change silently as new brands enter the answer space; lock and version
Vendor SOV stacking — §5.3 above; GEO Metrics §3.4 already inventories the methodology disagreement
Negative-mention blind-spot — high Mention Frequency with negative sentiment is not a win; without §4.4 sentiment capture you cannot see it
Multilingual splitting — English and Chinese mention pools are not summable; see GEO Metrics §7
Aggarwal transfer — Aggarwal et al. 2024 measures impression (being used) on on-page rewrites, not unlinked-mention prevalence. Their headline up-to-40% lift is not a Mention Frequency forecast for this playbook; the bounded reading in the paper entry holds here too.
One-actor lift over-extrapolation — C-SEO Bench (Puerto et al. 2025) finds many conversational-SEO rewrites lose effectiveness under competition. The same caution applies to mention tracking: a SOV gain you measured alone may not survive once competitors optimize for the same prompt set.

9. Further reading

Definition layer: GEO Metrics · Brand Mentions · Citation vs Mention
Business framing: GEO ROI Models
Companion playbook: AI Citation Tracking — the citation half of the same measurement pair
Adjacent operation: GEO Audit — consumes this playbook’s snapshot at Layer 6
Per-engine specifics: Perplexity AI · ChatGPT Search · Google AI Overviews
Academic anchor (boundary): Aggarwal et al. 2024

References

Academic:

Aggarwal, P. et al. (2024). GEO: Generative Engine Optimization. KDD ‘24. arXiv:2311.09735 · ACM DL
Liu, N., Zhang, T., Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. Findings of EMNLP ‘23. arXiv:2304.09848
Puerto, H. et al. (2025). C-SEO Bench: Does Conversational SEO Work? NeurIPS ‘25 D&B. arXiv:2506.11097

API & platform documentation (verified 2026-05):

Perplexity — Chat Completions API Reference · Changelog
OpenAI — Web Search tool (Responses API)
Google Search Central — AI features and your site
Google — Grounding with Google Search (Gemini API)

Vendor KPI methodology (for the buy path, via GEO Metrics §3.4):

Otterly.AI — Brand Report KPI Definitions
Ahrefs — Brand Radar Methodology
Profound — How to Track Your Visibility in AI Search
BrightEdge — What Share of Voice Really Means for Search in 2026

Frequently asked questions

How is mention tracking different from citation tracking?

Same measurement spine — frozen prompt set, declared engines, fixed cadence, single canonical log — but a different detection problem. Citations come back in URL fields the engine emits: Perplexity's search_results[], OpenAI's url_citation annotations, Gemini's groundingChunks. Mentions live in the generated prose with no field anywhere across major engines, so you build the detector — alias dictionary, disambiguation, dedup unit, sentiment — and the detector is where almost all of the error lives.

Sentence-level or answer-level dedup — which should I default to?

Sentence-level. It maps cleanly to how an answer 'spends words on you', tolerates the LLM-judge step on ambiguous hits, and is the unit most analyses end up comparing on. Also report an answer-level rollup from the same log when you need to compare to vendor numbers — most vendors report a 'mentioned in answer? yes/no' figure that is the answer-level view. The log supports both; what you must not do is silently switch units between runs.

Can I just buy a vendor (Otterly, Profound, Ahrefs) instead of building a detector?

Yes, and most teams should. The catch is that vendor selection is a metrics-definition question in disguise — Otterly publishes its SOV formula, Ahrefs Brand Radar weights mentions by Google search volume to estimate impressions, Profound and BrightEdge do not publish detector internals. Rank vendors by which definition you trust (see [GEO Metrics §3.4](/geo-metrics)); two vendors' SOV numbers are not comparable even when both pages say 'Share of Voice'.

My brand name collides with a dictionary word — how do I avoid false positives?

Three layers, in order. (1) Negative list in brands.yaml — exclude lowercase dictionary forms, URL collisions, common-noun matches. (2) Disambiguation context — require a co-occurring product, domain, or topic phrase in the same or prior sentence before accepting an ambiguous hit. (3) LLM-judge for the residue — pass the sentence + brand candidate to a small declared model and log its verdict + reason as part of provenance. The three stages are §5.2; precision, not recall, is the harder metric for mention tracking — the inverse of citation tracking.

Does Aggarwal's '+40%' headline apply to mention tracking?

No. Aggarwal et al. measures position-adjusted *impression* on on-page rewrites — being *used*. Brand mention tracking measures unlinked off-site naming — being *named*. They are different constructs, and the +40% is a per-method, per-domain upper bound against 2023–24 engines, not a flat expectation that transfers across constructs. The mechanism for why mentions matter is [Brand Mentions](/brand-mentions); the bounded reading of Aggarwal's lift is in the [paper entry](/papers/aggarwal-geo-benchmark-2024).

Related playbooks & wiki

Sources

Primary

GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
Perplexity API — Chat Completions Reference · Perplexity
Perplexity API — Changelog (citations field deprecation) · Perplexity
OpenAI — Web Search tool (Responses API) · OpenAI
Google Search Central — AI features and your site · Google
Grounding with Google Search (Gemini API — groundingChunks / groundingSupports) · Google
Otterly.AI — Brand Report KPI Definitions · Otterly.AI
Ahrefs Brand Radar Methodology · Ahrefs
Profound — How to Track Your Visibility in AI Search · Profound

Secondary

BrightEdge — What Share of Voice Really Means for Search in 2026 · BrightEdge
Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings
C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B