GEO: Generative Engine Optimization (Aggarwal et al. 2024)
Quick facts
- Authors
- Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande
- Venue
- KDD 2024 (Proc. 30th ACM SIGKDD)
- Year
- 2024
- DOI
- 10.1145/3637528.3671900
- URL
- https://arxiv.org/abs/2311.09735
- Reproducibility
- code-and-data
Plain-English summary
Aggarwal et al. asked a then-new question: if an AI search engine writes the answer instead of listing links, how does your content get into that answer? They formalized this as Generative Engine Optimization, built GEO-bench (10,000 queries across 25 domains) to measure it, and tested nine content rewrites. Citing sources, adding statistics, and adding quotations reliably increased how prominently a page was used in the synthesized answer — by up to 40% in their setup — while keyword stuffing, the classic SEO reflex, did not.
Key findings
- Content-level rewrites (add citations, statistics, quotations) lifted visibility by up to ~40% on the paper's Position-Adjusted Word Count metric.
- 'Up to 40%' is an upper bound for specific methods and domains, not an average effect — the headline figure is routinely over-generalized in practitioner writing.
- Method effectiveness is domain- and engine-dependent: Cite Sources wins on factual queries, Authoritative on debate/history, Statistics Addition on law and opinion.
- Keyword Stuffing — the traditional-SEO reflex — did not help and could hurt, early evidence that GEO is not SEO tactics relabelled.
- Lower-ranked pages gained the most (Cite Sources gave a +115.1% lift to rank-5 pages), suggesting GEO partly rebalances search incumbency.
- Effects held on a live engine (Perplexity.ai, up to ~22%), materially smaller than the internal-engine figure — external validity is bounded.
1. What this paper is — and why it anchors the whole field
This is the paper that gave the field its name. GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) is the academic origin of the term GEO.
Its move was to take a fuzzy practitioner intuition — “get my content into the AI answer” — and turn it into a measurable optimization problem with a public benchmark.
One scope honesty note up front: the paper’s definition is narrow and benchmark-scoped. It studies content rewrites against a fixed evaluation harness. Practitioners later broadened “GEO” to mean the whole discipline. This entry is a deep read of the paper, not of the broadened term.
| Attribute | Value |
|---|---|
| Authors | Aggarwal, Murahari, Rajpurohit, Kalyan, Narasimhan, Deshpande |
| Venue | KDD 2024 (Proc. 30th ACM SIGKDD) |
| Identifiers | arXiv 2311.09735 · DOI 10.1145/3637528.3671900 |
| Artifacts | Code + GEO-bench data, Apache-2.0 |
2. The problem it formalizes
The structural break the paper builds on: a results page used to be a ranked list of links; a generative engine returns a single synthesized answer. Visibility is no longer the same thing as ranking.
The paper’s abstraction is deliberately minimal:
- The generative engine is a black box — the content creator cannot change the model, only the input web content.
- The objective is to maximize the content’s visibility inside the synthesized answer, not its rank in a link list.
- Success is therefore a property of how the answer uses your text, which is why the paper needs a new metric (see §4) rather than reusing rank.
The retrieval-and-grounding machinery that makes this possible — why an engine pulls some sources and grounds its answer on them — is the prerequisite, covered by the answer loop and surveyed in Gao et al. 2023. This paper takes that machinery as given and asks the optimization question on top of it.
3. Methodology — the GEO framework and GEO-bench
The paper evaluates nine content-optimization methods, applied as rewrites to a candidate web source:
| # | Method | One-line intent |
|---|---|---|
| 1 | Authoritative | Rewrite in a more authoritative tone |
| 2 | Statistics Addition | Add relevant quantitative data |
| 3 | Keyword Stuffing | Add more query keywords (the SEO reflex) |
| 4 | Cite Sources | Add citations to credible sources |
| 5 | Quotation Addition | Add relevant quotations |
| 6 | Easy-to-Understand | Simplify the language |
| 7 | Fluency Optimization | Improve fluency |
| 8 | Unique Words | Add uncommon/unique vocabulary |
| 9 | Technical Terms | Add domain technical terms |
GEO-bench, the benchmark, is constructed as:
- 10,000 queries (8K train / 1K validation / 1K test).
- Drawn from 9 datasets: MS MARCO, ORCAS-I, Natural Questions, AllSouls, LIMA, Davinci-Debate, Perplexity.ai Discover, ELI5, and GPT-4-generated queries.
- 25 domains (e.g. Arts, Health, Games) with seven categorizations.
- Each query carries its top-5 Google results as the candidate source set.
Two engines are tested: an internal generative engine (GPT-3.5-turbo prompted over the top-5 Google results) and a deployed engine, Perplexity.ai, used as a real-world check.
4. The “impression” metric — the paper’s most-cited contribution
The single most reused idea in this paper is not a tactic; it is a metric. It reframes citation vs. mention into something continuous and position-aware.
The paper proposes two visibility measures:
Position-Adjusted Word Count (Imp_pwc):
sum over cited sentences s of |s| · e^(-pos(s)/|S|)
divided by total response word count
= word count of your cited sentences, exponentially
discounted by how late they appear in the answer.
Subjective Impression:
a GPT-3.5-scored composite over 7 sub-dimensions:
relevance · influence · uniqueness · subjective position ·
subjective count · click-likelihood · diversity
Why this matters more than “was I cited, yes/no”: it captures how prominently a source shapes the answer, which is the practical surface citability optimizes for. Most later vendor KPIs and GEO measurement frameworks are descendants of this position-weighted idea.
5. Key findings
The headline and — more importantly — its boundaries:
| Finding | Detail |
|---|---|
| Headline lift | GEO methods boosted visibility up to ~40% on Position-Adjusted Word Count |
| Best methods | Quotation Addition +41% (PAWC); Statistics Addition +37% (Subjective); Cite Sources +30% (PAWC) |
| Not an average | ”Up to 40%” is a per-method, per-domain upper bound, not a flat expected gain |
| Domain-dependent | Cite Sources → factual queries; Authoritative → debate/history; Statistics → law & opinion |
| SEO reflex fails | Keyword Stuffing did not help, and could hurt |
| Incumbency rebalance | Rank-5 pages gained most — Cite Sources gave them +115.1% |
| Live-engine check | On Perplexity.ai the lift was up to ~22%, smaller than internal |
Read findings 3 and 7 together: the actionable signal is content substance beats keyword tricks, and the actionable caution is the percentage is a ceiling, not a promise.
6. GEO Wiki critique
Steel-man first. Three contributions are genuinely foundational and have held up:
- Naming and framing. Turning “get into the AI answer” into a black-box optimization problem is the move the whole field is built on.
- The impression metric. Position-adjusted, continuous visibility is the right unit and propagated everywhere.
- A public benchmark. GEO-bench made the claim contestable — which is exactly what lets the next critique exist.
Bounded critique — four points:
- External validity. The engines tested are 2023–24 in form (an internal GPT-3.5 harness plus Perplexity.ai). Today’s ChatGPT Search, Gemini and AI Overviews differ in retrieval and synthesis; the paper cannot speak to them, and should not be quoted as if it does.
- Benchmark drift. GEO-bench’s corpus and the engines both age. The “40%” is bound to a 2024 snapshot and is not transferable across time as a constant.
- Replication points the other way. Puerto et al., C-SEO Bench (NeurIPS 2025 Datasets & Benchmarks), finds many conversational-SEO rewrites ineffective or counterproductive once multiple parties optimize against the same engine. The single-actor lift in this paper is an upper bound, not the equilibrium.
- Headline framing. “Up to 40%” travels through practitioner writing as “~40%”. The honest read is a per-method, per-domain maximum that shrank to ~22% on a live engine and may shrink further under competition and trust filtering (E-E-A-T pressures).
Position. Foundational is not turnkey. Take the direction — substance (sources, statistics, quotations) over keyword manipulation — and discard the number as a planning input.
7. Reproducibility
Verified at draft time (2026-05-17), not assumed:
| Artifact | Status |
|---|---|
| Source code | Public — github.com/GEO-optim/GEO (run_geo.py, geo_functions.py) |
| Benchmark data | Public — HuggingFace GEO-optim/geo-bench |
| License | Apache-2.0 |
| Project page | generative-engines.com/GEO/ |
Field value: code-and-data. Both the method implementation and GEO-bench are openly available, so the headline experiments are independently reproducible — a meaningful strength relative to most papers in this space.
8. What it means for practitioners
What to carry away — and what not to:
- Use: content-substance rewrites. Adding credible citations, concrete statistics, and relevant quotations is the durable, repeatedly-validated direction.
- Use: the measurement mindset. Track prominence continuously, not citation as a binary — wire it into AI citation tracking.
- Do not carry: the specific percentages, or any cross-engine / cross-domain extrapolation. The same rewrite behaves differently on ChatGPT Search vs Perplexity.ai.
- Do not assume: single-actor gains survive once competitors optimize too (see §6, C-SEO Bench).
9. Further reading
- Gao et al. 2023 — RAG: A Survey — the retrieval/grounding mechanism this paper sits on top of.
- C-SEO Bench (Puerto et al. 2025) — the key replication counterweight; read alongside §6.
- Generative Engine Optimization — how the broadened practitioner term relates to this narrow academic origin.
References
- Aggarwal, Murahari, Rajpurohit, Kalyan, Narasimhan, Deshpande — GEO: Generative Engine Optimization, KDD 2024. arXiv:2311.09735 · DOI:10.1145/3637528.3671900
- GEO — official code & experiments: github.com/GEO-optim/GEO
- GEO-bench dataset: huggingface.co/datasets/GEO-optim/geo-bench
- GEO project page: generative-engines.com/GEO
- Gao et al. — Retrieval-Augmented Generation for LLMs: A Survey, 2023. arXiv:2312.10997
- Puerto, Gubri, Green, Oh, Yun — C-SEO Bench: Does Conversational SEO Work?, NeurIPS 2025 D&B. arXiv:2506.11097
- Liu, Zhang, Liang — Evaluating Verifiability in Generative Search Engines, Findings of EMNLP 2023. arXiv:2304.09848
Critique & limitations
The paper's three contributions — naming the problem, the impression/visibility metric, and the first public GEO-bench — are real and have anchored the entire field. The bounded reading: the headline 40% is a per-method, per-domain upper bound measured against 2023–24 engines (an internal GPT-3.5 harness plus Perplexity.ai), so it should not be carried across time, engines, or domains as a flat expectation. The strongest counter-evidence is Puerto et al.'s C-SEO Bench (NeurIPS 2025 D&B), which finds many conversational-SEO rewrites ineffective or counterproductive under competition. Foundational does not mean turnkey: use the direction (content substance over keyword tricks), not the number.
Frequently asked questions
What is the 'GEO paper' and why does it matter?
What is the impression / visibility metric it introduced?
Does the 'up to 40%' figure still hold today?
What does the paper say about GEO versus SEO?
Is the code and benchmark available?
Related work
Sources
Primary
- GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
- GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
- GEO — official code & experiments repository · GEO-optim
- GEO-bench dataset (HuggingFace) · HuggingFace
- GEO project page · GEO-optim
- Retrieval-Augmented Generation for LLMs: A Survey (Gao et al. 2023) · arXiv · 2023-12-18
Secondary
- C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B
- Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings