Standard · Infrastructure

robots.txt

Quick facts

What it is: RFC 9309 — a plain-text file at the host root declaring which paths a crawler is asked not to fetch. A voluntary request, not access authorization
The load-bearing distinction: Declared control ≠ enforcement. The file expresses intent; the network layer is what actually blocks anything
The AI-bot access pattern: Per category, in named user-agent groups — never one blanket `*` block. Each named bot consumes only its own group, so `*` + per-bot Allow silently fails
What it does not do: Grant access authorization, stop spoofed or non-compliant crawlers, or retract content already absorbed into model weights
Where it sits: The declared-control gate — upstream of crawler-layer enforcement, far upstream of citability. Necessary, not sufficient

1. What robots.txt is

robots.txt is a plain-text file at the host root (/robots.txt) declaring which paths a crawler is asked not to fetch. The Robots Exclusion Protocol behind it is older than the modern web — Martijn Koster proposed it in 1994 — and ran as a de-facto convention for ~25 years before the IETF standardized it as RFC 9309 in September 2022.

Definition (GEO Wiki working definition): robots.txt is a UTF-8 plain-text file at /robots.txt that addresses crawlers by user-agent and asks them — non-bindingly — to skip certain paths. It is the standardized expression of crawler-access intent, not access control.

The clarifier RFC 9309 surfaces of itself, before this entry says anything about AI: the rules “are not a form of access authorization,” and the protocol “is not a substitute for valid content security measures” (RFC 9309 §1). The gap between declared intent and actual enforcement is the thread §5 picks up.

What changed for GEO is not the file. It is the bot population the file addresses. Before AI crawlers, the policy decision was effectively “block Googlebot, yes or no” — one decision, one bot. After AI crawlers, the same file addresses roughly 30 declared agents split into three categories with opposite access consequences (training, retrieval, user-triggered; the categorical model lives in AI Crawlers §2). Same protocol, new audience, new mistakes.

2. How the protocol works

A robots.txt file is composed of groups — sets of rules bound to one or more user-agents — served at the host root, with a small fixed grammar for path matching. The mechanic is short enough to state in one section, and the rest of this page assumes it.

The file. UTF-8 plain text, host-scoped, served at the literal path /robots.txt with a 2xx response. A 4xx response is treated as “no rules declared” — the crawler is free to fetch any path (RFC 9309 §2.3.1.3). A 5xx is conservative: crawlers retry or treat the site as fully disallowed during the outage, depending on operator policy.

Records. A group consists of one or more User-agent: lines followed by one or more Allow: and Disallow: rules. Blank lines separate groups. Comments start with #.

Group selection is the load-bearing rule writers most often miss: a crawler picks the single most specific group matching its product token (RFC 9309 §2.2.1). Rules from non-matching groups do not merge into it. An * group never applies to a bot that has a named group elsewhere in the file.

Path matching is longest-match-wins. When Allow and Disallow rules of equal path length disagree, Google’s documented tiebreak is “the least restrictive rule wins” — i.e. Allow outranks Disallow on equal-length matches (How Google Interprets the robots.txt Specification). RFC 9309 leaves the tiebreak to operator choice; most major crawlers follow Google’s convention.

Wildcards. * matches zero or more characters; $ anchors to the end of the URL. Both are supported by Google, Bing, OpenAI, Anthropic, and Perplexity, evaluated per rule rather than as one combined regex.

Sitemap directive. Sitemap: https://... is hostless — it applies file-wide and is not bound to any group. Sitemap mechanics belong in Sitemap & IndexNow.

The worked precedence example:

User-agent: GPTBot
Disallow: /
Allow: /blog/

User-agent: *
Disallow: /private/

Reading: GPTBot picks the GPTBot group, sees Disallow: / plus Allow: /blog/, and concludes “fetch nothing except /blog/.” Every other bot picks the * group and only loses /private/. The * rule never reaches GPTBot, even though it looks like a sensible default — GPTBot has its own group and consumes only that group. The same logic powers §3’s recipes and §6’s headline anti-pattern.

3. The AI-bot picture, expressed in robots.txt syntax

The categorical model (training / retrieval / user-triggered) translates into robots.txt as named user-agent groups, one cluster per category. The categorical model itself — why the splits matter — sits in AI Crawlers §2; the recipes below are how you write the file once that policy is decided.

Opt out of training only — keep AI citation, drop future parametric memory:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
Disallow: /

Retrieval and user-triggered fetchers stay untouched, so live ChatGPT, Claude, Perplexity, and Google AI Overviews answers can still cite you.

Opt out of retrieval — accept that you will not appear in AI answers now:

User-agent: OAI-SearchBot
User-agent: PerplexityBot
User-agent: Claude-SearchBot
User-agent: Bingbot
Disallow: /

One caveat the protocol cannot express: blocking Googlebot removes you from Google Search and AI Overviews together — Google does not publish a separate “Search only, no AI Overviews” token. Google-Extended governs training use of fetched content, not crawling, and is therefore in the training-only block above, not this one.

Allow everything, named explicitly:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

This is identical in effect to an empty robots.txt, but stating it explicitly serves two purposes: it documents intent for internal review and audit tools, and it survives the unfortunate case where a downstream config layer prepends a blanket Disallow: / for *.

The merge gotcha, stated plainly. Because each named bot picks only its own group, the most common writer’s bug is a blanket Disallow: / under * combined with per-bot Allow: lines in named groups. It silently fails — see §6 for the canonical broken pattern.

The product-token list each vendor publishes — and the IP ranges and verification recipes — lives in the per-bot entries: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, OAI-SearchBot, ChatGPT-User. The fresh-as-of-now, audit-grade version of the recipe — what to grep in logs, how to verify what actually reached you — is the AI Crawler Access Audit playbook’s job.

4. What each major operator documents about robots.txt compliance

Documented compliance posture varies cleanly by operator. The table summarizes what each major AI-crawler operator publishes about how its bots handle robots.txt — and the carve-out each one makes for user-initiated fetches.

Operator	Tokens addressed	Documented robots.txt posture	Caveat
OpenAI	GPTBot · OAI-SearchBot · ChatGPT-User	GPTBot and OAI-SearchBot honor robots.txt as documented. As of late 2025, OpenAI’s bots docs state “because these actions are initiated by a user, robots.txt rules may not apply” for ChatGPT-User	The user-triggered carve-out was reframed explicitly in the December 2025 docs revision (OpenAI bots)
Anthropic	ClaudeBot · Claude-SearchBot · Claude-User	All three documented as honoring robots.txt. The 2025 docs introduced a granular three-bot framework with per-bot disallow recipes	Anthropic explicitly notes honoring the non-standard `Crawl-delay:` extension alongside standard `Disallow:` rules (Anthropic crawler docs)
Perplexity	PerplexityBot · Perplexity-User	PerplexityBot respects robots.txt — disallow it and page text is not indexed. Perplexity-User is documented as user-initiated, so robots.txt restrictions “generally do not apply”	The user-triggered carve-out is the operator’s own framing, in place since 2024 (Perplexity Crawlers; How Perplexity follows robots.txt)
Google	Googlebot · Google-Extended	Googlebot honors robots.txt. Google-Extended is a training-use opt-out token only — it governs whether fetched content is used to train Google’s generative AI models, not whether Googlebot crawls	Blocking Googlebot removes you from Search and AI Overviews together — there is no separate AIO opt-out (Google’s common crawlers)
Apple	Applebot · Applebot-Extended	Both honor robots.txt. Applebot-Extended is a training-use opt-out for Apple Intelligence and other generative AI; primary Applebot continues serving Spotlight, Siri, and Safari	Disallowing Applebot-Extended while leaving Applebot allowed preserves Spotlight/Siri visibility (About Applebot)
Microsoft (Bing)	Bingbot	Honors robots.txt; feeds both Bing Search and Bing Copilot	No separate AI-training opt-out token published as of 2026-05, unlike Google or Apple (Which crawlers does Bing use?)
Common Crawl	CCBot	Honors robots.txt and the non-standard `Crawl-delay:` extension	Common Crawl data feeds many third-party AI training corpora indirectly; an opt-out here propagates downstream to anyone training on Common Crawl dumps (CCBot)

The user-triggered footnote, recurring as a documented operator pattern rather than a bug. ChatGPT-User, Claude-User, and Perplexity-User are each described by their operators as user-initiated and generally not subject to standard robots.txt restrictions. The reasoning each operator offers is identical: a fetch driven by one human asking about one URL is a proxy for human browsing, not autonomous crawling. OpenAI surfaced the framing most explicitly in the late-2025 docs revision; Perplexity’s help page has carried the same framing since 2024. It is the cleanest counter-example to “robots.txt covers all AI access.”

The per-bot product-token strings, the IP-range JSON endpoints, and the exact disallow recipes belong to the spoke entries linked in the table. The kept-fresh, audit-grade version — what to grep in logs, how to verify what reached you — belongs to AI Crawler Access Audit.

5. Declared control vs enforcement

RFC 9309 does not present itself as access control. The standard’s own preamble names the gap that the rest of this section unpacks: declared intent is not enforcement.

RFC 9309’s own disclaimer. §1 of the standard states: “the rules in a robots.txt file are not a form of access authorization.” Later: “This document is not a substitute for valid content security measures, and information that is not meant to be accessed should be properly secured.” Both lines are in the published standard (RFC 9309) — editorial commentary did not add the disclaimer; the standards body did.

User-agent is a self-asserted claim, not an identity. “I see GPTBot in my logs” is not “OpenAI fetched this.” A spoofed UA bypasses any rule keyed on it; a non-compliant crawler ignores the file entirely. Real identity is established by published IP ranges plus forward-confirmed reverse DNS — every major AI-crawler operator publishes IP lists for exactly this purpose. The verification mechanic itself belongs to the AI Crawler Access Audit playbook.

The observed compliance record. Documented behavior and observed behavior have an honest gap:

What holds	The bounded reading
Major first-party AI bots publish that they honor robots.txt as documented	Honoring is a stated policy, not a technical guarantee — a vendor can change its docs (and has — see §4’s ChatGPT-User note)
The protocol expresses your intent unambiguously to any compliant agent	Intent ≠ enforcement; a non-compliant or spoofed agent ignores the file regardless of how cleanly it is written
Independent measurement of the expressed policy exists	Cloudflare’s 2025 survey found only ~14% of sampled domains’ robots.txt files target AI bots at all (Cloudflare, 2025-07-01) — most sites have no AI policy expressed in the first place
Documented compliance covers first-party operators	Independent reporting documents non-compliance at scale: in 2024 multiple news publishers reported a major answer engine fetching content from sites that had disallowed its crawler (TechCrunch, 2024-07-02); in 2025 Cloudflare reported an undeclared crawler impersonating a normal browser and rotating IPs to evade no-crawl directives across tens of thousands of domains (Cloudflare, 2025-08-04)
Enforcement is possible	But it is a network-layer problem — WAF rules, verified-bot allowlists, request-signing schemes — not a robots.txt one

The bot-specific account of the Perplexity incidents routes to PerplexityBot; the principle stated here is what generalizes — and the broader access-decision account against which this protocol-mechanic section is the counterpart lives in AI Crawlers §5.

The position, plainly. Write the policy in robots.txt because compliant crawlers obey it — that is a real, useful effect for the documented first-party AI bots. But verifying that the policy is holding is the network layer’s job. Do not conflate the two layers; doing so is how a site quietly bleeds traffic to a crawler that ignored the file while believing it is protected.

6. Anti-patterns

Most robots.txt mistakes for AI bots fall into four pattern classes: precedence and group-merge errors, mechanism confusion (treating the file as enforcement, or as a meta-tag, or as rate limiting), location errors, and freshness errors. Each row below names a pattern that looks right and fails.

Anti-pattern	Why it looks right	Why it actually fails
`User-agent: *` `Disallow: /` followed by per-bot `Allow:` in named groups	”I’m setting a default then allowing the bots I trust”	Each named bot picks only its own group (RFC 9309 §2.2.1). Rules from the `` group never reach a bot that has a named group — `Allow:` in the GPTBot group does not undo `Disallow: /` in the `` group, because the `*` group is invisible to GPTBot. The merge does not happen
Blocking `GPTBot` “to stay out of ChatGPT”	GPTBot is OpenAI’s bot	GPTBot is training-only. Live ChatGPT search answers are served by `OAI-SearchBot` (indexing) and `ChatGPT-User` (user-triggered). The categorical asymmetry is the AI Crawlers §3 picture — wrong-bot error
Treating robots.txt as enforcement against bad actors	”I disallowed it, so it can’t fetch”	The protocol is voluntary; RFC 9309 §1 says the rules “are not a form of access authorization.” Spoofed or non-compliant agents ignore the file entirely. Enforcement is a WAF / verified-bot-allowlist problem (AI Crawler Access Audit)
`Crawl-delay:` to rate-limit AI bots	”Rate limiting is a robots.txt directive”	`Crawl-delay:` is not in RFC 9309. Honoring is non-uniform: Common Crawl’s CCBot and Anthropic’s Claude bots document honoring it; Googlebot publicly does not. Network-layer rate limits are the reliable control across the AI-crawler set
`Noindex:` directives in robots.txt	”It controls indexing of my pages”	`Noindex` was never a robots.txt directive in any standard. Google retired support for the unofficial `Noindex:` line on 2019-09-01 (A note on unsupported rules in robots.txt). The correct mechanism is `<meta name="robots" content="noindex">` or the `X-Robots-Tag` HTTP header
robots.txt at `/foo/robots.txt` or on a subdomain that does not have its own	”The file exists, just in a convenient place”	RFC 9309 §2.3 requires the file at the literal `/robots.txt` path of each host. Subdomains and different schemes (http vs https) each need their own. A file at a non-root path is ignored
A static AI-bot allowlist, never revisited	”Only the bots I trust get through”	The AI-crawler population grew from ~3 named bots in 2023 to ~30 in 2026. A static allowlist silently excludes every newly shipped bot by default — including retrieval bots whose blocking costs you citation immediately
Disallowing the training crawler after the page already shipped	”Now my content is out of the training set”	robots.txt is forward-looking. It cannot retract content already absorbed into a prior training corpus. Whatever retraction mechanism exists is vendor-by-vendor policy (opt-out request forms, takedown channels), not the robots.txt file

The load-bearing line. Most “robots.txt for AI” mistakes are category or precedence errors, not protocol-syntax errors. The syntax is forgiving; the consequences are not.

7. Three root files, three jobs

The single most common category error in AI publishing infrastructure is conflating three root-level files — robots.txt, sitemap.xml, llms.txt. They have non-overlapping jobs.

File	What it controls	What it does not do
`robots.txt`	Access control — may a bot fetch this path? Standardized as RFC 9309	Does not curate, render, rank, or enforce. A request, not authorization
`sitemap.xml`	Discovery / completeness — here is everything, for indexing (Sitemap & IndexNow)	Does not curate, grant access, or signal quality. Not a “best of” list
`llms.txt`	Curation / clean rendering — read these pages first, in clean markdown (llms.txt)	Does not grant or deny access; not a ranking signal; not a discovery file

Three files, three jobs, none substitutes for another. robots.txt is no more “AI access control” than it was “Google access control” — it is the same file doing the same job, now addressed to a wider bot population with a categorical split. The change is in the audience, not in the file’s role.

8. Why this matters for GEO + how to act

Being reachable is necessary, not sufficient: robots.txt opens the gate to crawler access, and access in turn opens the gate to citation. Skip the file and most compliant AI bots assume nothing is restricted; write it wrong and the cost can be silent disappearance from AI answers.

Your intent	First stop
Decide the per-category access policy (training vs retrieval vs user-triggered)	AI Crawlers §2 — the categorical model
Write the actual robots.txt directives	This entry §3
Get the per-bot UA strings, IP ranges, and verification recipes	GPTBot · ClaudeBot · PerplexityBot · Google-Extended · Applebot-Extended · OAI-SearchBot · ChatGPT-User
Verify what actually reached your site	AI Crawler Access Audit
Curate which pages an LLM should read first	llms.txt
Make sure the page is liftable once fetched	Citability
The method that ties this all together	Generative Engine Optimization

Write the policy in robots.txt because compliant crawlers obey it; enforce it on the network layer because the rest do not. And remember: access is a gate, not a prize. Being reachable makes you a candidate — being citable is what wins the answer slot.

References

Standards & specifications:

IETF — RFC 9309: Robots Exclusion Protocol
Google Search Central — How Google Interprets the robots.txt Specification
Google Search Central — A note on unsupported rules in robots.txt (the 2019 Noindex: retirement)

Vendor crawler documentation:

OpenAI — Overview of OpenAI Crawlers · Publishers and Developers FAQ
Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?
Perplexity — Perplexity Crawlers · How does Perplexity follow robots.txt?
Google Search Central — Google’s common crawlers (including Google-Extended)
Microsoft Bing — Which crawlers does Bing use?
Apple — About Applebot
Common Crawl — CCBot

Independent measurement & reporting:

Cloudflare — From Googlebot to GPTBot: who’s crawling your site in 2025
Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
TechCrunch — News outlets are accusing Perplexity of plagiarism and unethical web scraping
Search Engine Land — Anthropic clarifies how Claude bots crawl sites and how to block them

Frequently asked questions

Does robots.txt actually stop AI crawlers?

robots.txt is a voluntary request, not enforcement — RFC 9309 itself states the rules 'are not a form of access authorization.' Compliant first-party AI bots (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, CCBot) honor robots.txt as documented; non-compliant or spoofed agents do not. Enforcement against the second group is a network-layer problem (WAF rules, verified-bot allowlists), not a robots.txt one.

Will blocking GPTBot remove me from ChatGPT search?

No — that is the headline wrong-bot mistake. GPTBot is training-only. Live ChatGPT search answers are served by OAI-SearchBot (indexing) and ChatGPT-User (user-triggered fetch). Blocking GPTBot keeps your content out of future model weights while leaving ChatGPT search citation intact. The categorical model — what each bot does and why blocking each one has opposite consequences — is owned by [AI Crawlers](/ai-crawlers); this entry is where you write the disallow lines.

Why is `User-agent: *` plus a per-bot `Allow:` not granting anything?

Because RFC 9309 §2.2.1 says each crawler picks the *single most specific group* matching its product token. Rules from non-matching groups do not merge. So a `User-agent: *` `Disallow: /` followed by a `User-agent: GPTBot` `Allow: /blog/` group means GPTBot only sees the GPTBot group — it gets `/blog/` and nothing else; the `*` group never reaches it. This is the headline writer's bug, and it is silent — the file looks correct.

Is `Crawl-delay:` a valid robots.txt directive?

No — `Crawl-delay:` is explicitly *not* in RFC 9309. Honoring is vendor-specific and non-uniform: Common Crawl's CCBot and Anthropic's Claude bots document honoring it; Google publicly states Googlebot ignores it. Because compliance varies across the AI-crawler set, network-layer rate limiting is the control you can actually rely on, not Crawl-delay.

Can robots.txt remove my content from a model that already trained on it?

No. robots.txt is forward-looking — it tells future crawler visits to skip your paths, but it cannot retract content already fetched into a prior training corpus. If a page shipped before you disallowed the training crawler, the only retraction mechanism is whatever the operator publishes (e.g. opt-out request forms), and that is vendor-by-vendor policy, not the robots.txt file's job.

Sources

Primary

RFC 9309: Robots Exclusion Protocol · IETF · 2022-09-01
How Google Interprets the robots.txt Specification · Google Search Central
A note on unsupported rules in robots.txt · Google Search Central · 2019-07-02
Overview of OpenAI Crawlers (GPTBot / OAI-SearchBot / ChatGPT-User) · OpenAI
Publishers and Developers FAQ — OpenAI Help Center · OpenAI
Does Anthropic crawl data from the web, and how can site owners block the crawler? · Anthropic
Perplexity Crawlers (PerplexityBot / Perplexity-User) · Perplexity AI
How does Perplexity follow robots.txt? · Perplexity AI
Google's common crawlers (Google-Extended) · Google Search Central
Which crawlers does Bing use? · Microsoft Bing
About Applebot · Apple
Common Crawl — CCBot · Common Crawl

Secondary

From Googlebot to GPTBot: who's crawling your site in 2025 · Cloudflare
Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives · Cloudflare
News outlets are accusing Perplexity of plagiarism and unethical web scraping · TechCrunch
Anthropic clarifies how Claude bots crawl sites and how to block them · Search Engine Land