Skip to content
Paper · Ecosystem

GEO: Generative Engine Optimization (Aggarwal et al. 2024)

Quick facts

Authors
Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande
Venue
KDD 2024 (Proc. 30th ACM SIGKDD)
Year
2024
DOI
10.1145/3637528.3671900
URL
https://arxiv.org/abs/2311.09735
Reproducibility
code-and-data

Plain-English summary

Aggarwal et al. asked a then-new question: if an AI search engine writes the answer instead of listing links, how does your content get into that answer? They formalized this as Generative Engine Optimization, built GEO-bench (10,000 queries across 25 domains) to measure it, and tested nine content rewrites. Citing sources, adding statistics, and adding quotations reliably increased how prominently a page was used in the synthesized answer — by up to 40% in their setup — while keyword stuffing, the classic SEO reflex, did not.

Key findings

  • Content-level rewrites (add citations, statistics, quotations) lifted visibility by up to ~40% on the paper's Position-Adjusted Word Count metric.
  • 'Up to 40%' is an upper bound for specific methods and domains, not an average effect — the headline figure is routinely over-generalized in practitioner writing.
  • Method effectiveness is domain- and engine-dependent: Cite Sources wins on factual queries, Authoritative on debate/history, Statistics Addition on law and opinion.
  • Keyword Stuffing — the traditional-SEO reflex — did not help and could hurt, early evidence that GEO is not SEO tactics relabelled.
  • Lower-ranked pages gained the most (Cite Sources gave a +115.1% lift to rank-5 pages), suggesting GEO partly rebalances search incumbency.
  • Effects held on a live engine (Perplexity.ai, up to ~22%), materially smaller than the internal-engine figure — external validity is bounded.

1. What this paper is — and why it anchors the whole field

This is the paper that gave the field its name. GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) is the academic origin of the term GEO.

Its move was to take a fuzzy practitioner intuition — “get my content into the AI answer” — and turn it into a measurable optimization problem with a public benchmark.

One scope honesty note up front: the paper’s definition is narrow and benchmark-scoped. It studies content rewrites against a fixed evaluation harness. Practitioners later broadened “GEO” to mean the whole discipline. This entry is a deep read of the paper, not of the broadened term.

AttributeValue
AuthorsAggarwal, Murahari, Rajpurohit, Kalyan, Narasimhan, Deshpande
VenueKDD 2024 (Proc. 30th ACM SIGKDD)
IdentifiersarXiv 2311.09735 · DOI 10.1145/3637528.3671900
ArtifactsCode + GEO-bench data, Apache-2.0

2. The problem it formalizes

The structural break the paper builds on: a results page used to be a ranked list of links; a generative engine returns a single synthesized answer. Visibility is no longer the same thing as ranking.

The paper’s abstraction is deliberately minimal:

  • The generative engine is a black box — the content creator cannot change the model, only the input web content.
  • The objective is to maximize the content’s visibility inside the synthesized answer, not its rank in a link list.
  • Success is therefore a property of how the answer uses your text, which is why the paper needs a new metric (see §4) rather than reusing rank.

The retrieval-and-grounding machinery that makes this possible — why an engine pulls some sources and grounds its answer on them — is the prerequisite, covered by the answer loop and surveyed in Gao et al. 2023. This paper takes that machinery as given and asks the optimization question on top of it.

3. Methodology — the GEO framework and GEO-bench

The paper evaluates nine content-optimization methods, applied as rewrites to a candidate web source:

#MethodOne-line intent
1AuthoritativeRewrite in a more authoritative tone
2Statistics AdditionAdd relevant quantitative data
3Keyword StuffingAdd more query keywords (the SEO reflex)
4Cite SourcesAdd citations to credible sources
5Quotation AdditionAdd relevant quotations
6Easy-to-UnderstandSimplify the language
7Fluency OptimizationImprove fluency
8Unique WordsAdd uncommon/unique vocabulary
9Technical TermsAdd domain technical terms

GEO-bench, the benchmark, is constructed as:

  • 10,000 queries (8K train / 1K validation / 1K test).
  • Drawn from 9 datasets: MS MARCO, ORCAS-I, Natural Questions, AllSouls, LIMA, Davinci-Debate, Perplexity.ai Discover, ELI5, and GPT-4-generated queries.
  • 25 domains (e.g. Arts, Health, Games) with seven categorizations.
  • Each query carries its top-5 Google results as the candidate source set.

Two engines are tested: an internal generative engine (GPT-3.5-turbo prompted over the top-5 Google results) and a deployed engine, Perplexity.ai, used as a real-world check.

4. The “impression” metric — the paper’s most-cited contribution

The single most reused idea in this paper is not a tactic; it is a metric. It reframes citation vs. mention into something continuous and position-aware.

The paper proposes two visibility measures:

Position-Adjusted Word Count (Imp_pwc):
  sum over cited sentences s of  |s| · e^(-pos(s)/|S|)
  divided by total response word count

= word count of your cited sentences, exponentially
  discounted by how late they appear in the answer.
Subjective Impression:
  a GPT-3.5-scored composite over 7 sub-dimensions:
  relevance · influence · uniqueness · subjective position ·
  subjective count · click-likelihood · diversity

Why this matters more than “was I cited, yes/no”: it captures how prominently a source shapes the answer, which is the practical surface citability optimizes for. Most later vendor KPIs and GEO measurement frameworks are descendants of this position-weighted idea.

5. Key findings

The headline and — more importantly — its boundaries:

FindingDetail
Headline liftGEO methods boosted visibility up to ~40% on Position-Adjusted Word Count
Best methodsQuotation Addition +41% (PAWC); Statistics Addition +37% (Subjective); Cite Sources +30% (PAWC)
Not an average”Up to 40%” is a per-method, per-domain upper bound, not a flat expected gain
Domain-dependentCite Sources → factual queries; Authoritative → debate/history; Statistics → law & opinion
SEO reflex failsKeyword Stuffing did not help, and could hurt
Incumbency rebalanceRank-5 pages gained most — Cite Sources gave them +115.1%
Live-engine checkOn Perplexity.ai the lift was up to ~22%, smaller than internal

Read findings 3 and 7 together: the actionable signal is content substance beats keyword tricks, and the actionable caution is the percentage is a ceiling, not a promise.

6. GEO Wiki critique

Steel-man first. Three contributions are genuinely foundational and have held up:

  1. Naming and framing. Turning “get into the AI answer” into a black-box optimization problem is the move the whole field is built on.
  2. The impression metric. Position-adjusted, continuous visibility is the right unit and propagated everywhere.
  3. A public benchmark. GEO-bench made the claim contestable — which is exactly what lets the next critique exist.

Bounded critique — four points:

  1. External validity. The engines tested are 2023–24 in form (an internal GPT-3.5 harness plus Perplexity.ai). Today’s ChatGPT Search, Gemini and AI Overviews differ in retrieval and synthesis; the paper cannot speak to them, and should not be quoted as if it does.
  2. Benchmark drift. GEO-bench’s corpus and the engines both age. The “40%” is bound to a 2024 snapshot and is not transferable across time as a constant.
  3. Replication points the other way. Puerto et al., C-SEO Bench (NeurIPS 2025 Datasets & Benchmarks), finds many conversational-SEO rewrites ineffective or counterproductive once multiple parties optimize against the same engine. The single-actor lift in this paper is an upper bound, not the equilibrium.
  4. Headline framing. “Up to 40%” travels through practitioner writing as “~40%”. The honest read is a per-method, per-domain maximum that shrank to ~22% on a live engine and may shrink further under competition and trust filtering (E-E-A-T pressures).

Position. Foundational is not turnkey. Take the direction — substance (sources, statistics, quotations) over keyword manipulation — and discard the number as a planning input.

7. Reproducibility

Verified at draft time (2026-05-17), not assumed:

ArtifactStatus
Source codePublic — github.com/GEO-optim/GEO (run_geo.py, geo_functions.py)
Benchmark dataPublic — HuggingFace GEO-optim/geo-bench
LicenseApache-2.0
Project pagegenerative-engines.com/GEO/

Field value: code-and-data. Both the method implementation and GEO-bench are openly available, so the headline experiments are independently reproducible — a meaningful strength relative to most papers in this space.

8. What it means for practitioners

What to carry away — and what not to:

  • Use: content-substance rewrites. Adding credible citations, concrete statistics, and relevant quotations is the durable, repeatedly-validated direction.
  • Use: the measurement mindset. Track prominence continuously, not citation as a binary — wire it into AI citation tracking.
  • Do not carry: the specific percentages, or any cross-engine / cross-domain extrapolation. The same rewrite behaves differently on ChatGPT Search vs Perplexity.ai.
  • Do not assume: single-actor gains survive once competitors optimize too (see §6, C-SEO Bench).

9. Further reading

References

  1. Aggarwal, Murahari, Rajpurohit, Kalyan, Narasimhan, Deshpande — GEO: Generative Engine Optimization, KDD 2024. arXiv:2311.09735 · DOI:10.1145/3637528.3671900
  2. GEO — official code & experiments: github.com/GEO-optim/GEO
  3. GEO-bench dataset: huggingface.co/datasets/GEO-optim/geo-bench
  4. GEO project page: generative-engines.com/GEO
  5. Gao et al. — Retrieval-Augmented Generation for LLMs: A Survey, 2023. arXiv:2312.10997
  6. Puerto, Gubri, Green, Oh, Yun — C-SEO Bench: Does Conversational SEO Work?, NeurIPS 2025 D&B. arXiv:2506.11097
  7. Liu, Zhang, Liang — Evaluating Verifiability in Generative Search Engines, Findings of EMNLP 2023. arXiv:2304.09848

Critique & limitations

The paper's three contributions — naming the problem, the impression/visibility metric, and the first public GEO-bench — are real and have anchored the entire field. The bounded reading: the headline 40% is a per-method, per-domain upper bound measured against 2023–24 engines (an internal GPT-3.5 harness plus Perplexity.ai), so it should not be carried across time, engines, or domains as a flat expectation. The strongest counter-evidence is Puerto et al.'s C-SEO Bench (NeurIPS 2025 D&B), which finds many conversational-SEO rewrites ineffective or counterproductive under competition. Foundational does not mean turnkey: use the direction (content substance over keyword tricks), not the number.

Frequently asked questions

What is the 'GEO paper' and why does it matter?
Aggarwal et al., GEO: Generative Engine Optimization (KDD 2024). It is the academic origin of the term GEO: the first paper to formalize optimizing content for AI-synthesized answers, with a public benchmark (GEO-bench) and a visibility metric.
What is the impression / visibility metric it introduced?
Two metrics: Position-Adjusted Word Count (word count of cited sentences, decayed by their position in the answer) and Subjective Impression (a GPT-3.5-scored composite over seven sub-dimensions like relevance, influence and uniqueness).
Does the 'up to 40%' figure still hold today?
Treat it as a bounded upper estimate, not a guarantee. It is a per-method, per-domain maximum measured against 2023–24 engines; on a live engine (Perplexity.ai) the lift was up to ~22%, and later work (C-SEO Bench) finds many such rewrites fail under competition.
What does the paper say about GEO versus SEO?
Keyword Stuffing — the classic SEO reflex — did not improve visibility and could reduce it, while content-substance rewrites (cite sources, add statistics) did. This is early evidence that GEO is not traditional SEO tactics relabelled.
Is the code and benchmark available?
Yes. Code is at github.com/GEO-optim/GEO and GEO-bench is on HuggingFace (GEO-optim/geo-bench), Apache-2.0 — so the headline experiments are independently reproducible.

Related work

Sources

Primary

  1. GEO: Generative Engine Optimization (Aggarwal et al., KDD 2024) · arXiv / KDD '24 · 2024-08-25
  2. GEO: Generative Engine Optimization (KDD '24 Proceedings) · ACM SIGKDD · 2024-08-25
  3. GEO — official code & experiments repository · GEO-optim
  4. GEO-bench dataset (HuggingFace) · HuggingFace
  5. GEO project page · GEO-optim
  6. Retrieval-Augmented Generation for LLMs: A Survey (Gao et al. 2023) · arXiv · 2023-12-18

Secondary

  1. C-SEO Bench: Does Conversational SEO Work? (Puerto et al. 2025) · arXiv / NeurIPS '25 D&B
  2. Evaluating Verifiability in Generative Search Engines (Liu et al. 2023) · arXiv / EMNLP '23 Findings
Last updated: 2026-05-17 Authors: Ray Yang Topic: Ecosystem