robots.txt
Quick facts
- What it is
- RFC 9309 — a plain-text file at the host root declaring which paths a crawler is asked not to fetch. A voluntary request, not access authorization
- The load-bearing distinction
- Declared control ≠ enforcement. The file expresses intent; the network layer is what actually blocks anything
- The AI-bot access pattern
- Per category, in named user-agent groups — never one blanket `*` block. Each named bot consumes only its own group, so `*` + per-bot Allow silently fails
- What it does not do
- Grant access authorization, stop spoofed or non-compliant crawlers, or retract content already absorbed into model weights
- Where it sits
- The declared-control gate — upstream of crawler-layer enforcement, far upstream of citability. Necessary, not sufficient
1. What robots.txt is
robots.txt is a plain-text file at the host root (/robots.txt) declaring which paths a crawler is asked not to fetch. The Robots Exclusion Protocol behind it is older than the modern web — Martijn Koster proposed it in 1994 — and ran as a de-facto convention for ~25 years before the IETF standardized it as RFC 9309 in September 2022.
Definition (GEO Wiki working definition): robots.txt is a UTF-8 plain-text file at
/robots.txtthat addresses crawlers by user-agent and asks them — non-bindingly — to skip certain paths. It is the standardized expression of crawler-access intent, not access control.
The clarifier RFC 9309 surfaces of itself, before this entry says anything about AI: the rules “are not a form of access authorization,” and the protocol “is not a substitute for valid content security measures” (RFC 9309 §1). The gap between declared intent and actual enforcement is the thread §5 picks up.
What changed for GEO is not the file. It is the bot population the file addresses. Before AI crawlers, the policy decision was effectively “block Googlebot, yes or no” — one decision, one bot. After AI crawlers, the same file addresses roughly 30 declared agents split into three categories with opposite access consequences (training, retrieval, user-triggered; the categorical model lives in AI Crawlers §2). Same protocol, new audience, new mistakes.
2. How the protocol works
A robots.txt file is composed of groups — sets of rules bound to one or more user-agents — served at the host root, with a small fixed grammar for path matching. The mechanic is short enough to state in one section, and the rest of this page assumes it.
The file. UTF-8 plain text, host-scoped, served at the literal path /robots.txt with a 2xx response. A 4xx response is treated as “no rules declared” — the crawler is free to fetch any path (RFC 9309 §2.3.1.3). A 5xx is conservative: crawlers retry or treat the site as fully disallowed during the outage, depending on operator policy.
Records. A group consists of one or more User-agent: lines followed by one or more Allow: and Disallow: rules. Blank lines separate groups. Comments start with #.
Group selection is the load-bearing rule writers most often miss: a crawler picks the single most specific group matching its product token (RFC 9309 §2.2.1). Rules from non-matching groups do not merge into it. An * group never applies to a bot that has a named group elsewhere in the file.
Path matching is longest-match-wins. When Allow and Disallow rules of equal path length disagree, Google’s documented tiebreak is “the least restrictive rule wins” — i.e. Allow outranks Disallow on equal-length matches (How Google Interprets the robots.txt Specification). RFC 9309 leaves the tiebreak to operator choice; most major crawlers follow Google’s convention.
Wildcards. * matches zero or more characters; $ anchors to the end of the URL. Both are supported by Google, Bing, OpenAI, Anthropic, and Perplexity, evaluated per rule rather than as one combined regex.
Sitemap directive. Sitemap: https://... is hostless — it applies file-wide and is not bound to any group. Sitemap mechanics belong in Sitemap & IndexNow.
The worked precedence example:
User-agent: GPTBot
Disallow: /
Allow: /blog/
User-agent: *
Disallow: /private/
Reading: GPTBot picks the GPTBot group, sees Disallow: / plus Allow: /blog/, and concludes “fetch nothing except /blog/.” Every other bot picks the * group and only loses /private/. The * rule never reaches GPTBot, even though it looks like a sensible default — GPTBot has its own group and consumes only that group. The same logic powers §3’s recipes and §6’s headline anti-pattern.
3. The AI-bot picture, expressed in robots.txt syntax
The categorical model (training / retrieval / user-triggered) translates into robots.txt as named user-agent groups, one cluster per category. The categorical model itself — why the splits matter — sits in AI Crawlers §2; the recipes below are how you write the file once that policy is decided.
Opt out of training only — keep AI citation, drop future parametric memory:
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
Disallow: /
Retrieval and user-triggered fetchers stay untouched, so live ChatGPT, Claude, Perplexity, and Google AI Overviews answers can still cite you.
Opt out of retrieval — accept that you will not appear in AI answers now:
User-agent: OAI-SearchBot
User-agent: PerplexityBot
User-agent: Claude-SearchBot
User-agent: Bingbot
Disallow: /
One caveat the protocol cannot express: blocking Googlebot removes you from Google Search and AI Overviews together — Google does not publish a separate “Search only, no AI Overviews” token. Google-Extended governs training use of fetched content, not crawling, and is therefore in the training-only block above, not this one.
Allow everything, named explicitly:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
This is identical in effect to an empty robots.txt, but stating it explicitly serves two purposes: it documents intent for internal review and audit tools, and it survives the unfortunate case where a downstream config layer prepends a blanket Disallow: / for *.
The merge gotcha, stated plainly. Because each named bot picks only its own group, the most common writer’s bug is a blanket Disallow: / under * combined with per-bot Allow: lines in named groups. It silently fails — see §6 for the canonical broken pattern.
The product-token list each vendor publishes — and the IP ranges and verification recipes — lives in the per-bot entries: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, OAI-SearchBot, ChatGPT-User. The fresh-as-of-now, audit-grade version of the recipe — what to grep in logs, how to verify what actually reached you — is the AI Crawler Access Audit playbook’s job.
4. What each major operator documents about robots.txt compliance
Documented compliance posture varies cleanly by operator. The table summarizes what each major AI-crawler operator publishes about how its bots handle robots.txt — and the carve-out each one makes for user-initiated fetches.
| Operator | Tokens addressed | Documented robots.txt posture | Caveat |
|---|---|---|---|
| OpenAI | GPTBot · OAI-SearchBot · ChatGPT-User | GPTBot and OAI-SearchBot honor robots.txt as documented. As of late 2025, OpenAI’s bots docs state “because these actions are initiated by a user, robots.txt rules may not apply” for ChatGPT-User | The user-triggered carve-out was reframed explicitly in the December 2025 docs revision (OpenAI bots) |
| Anthropic | ClaudeBot · Claude-SearchBot · Claude-User | All three documented as honoring robots.txt. The 2025 docs introduced a granular three-bot framework with per-bot disallow recipes | Anthropic explicitly notes honoring the non-standard Crawl-delay: extension alongside standard Disallow: rules (Anthropic crawler docs) |
| Perplexity | PerplexityBot · Perplexity-User | PerplexityBot respects robots.txt — disallow it and page text is not indexed. Perplexity-User is documented as user-initiated, so robots.txt restrictions “generally do not apply” | The user-triggered carve-out is the operator’s own framing, in place since 2024 (Perplexity Crawlers; How Perplexity follows robots.txt) |
| Googlebot · Google-Extended | Googlebot honors robots.txt. Google-Extended is a training-use opt-out token only — it governs whether fetched content is used to train Google’s generative AI models, not whether Googlebot crawls | Blocking Googlebot removes you from Search and AI Overviews together — there is no separate AIO opt-out (Google’s common crawlers) | |
| Apple | Applebot · Applebot-Extended | Both honor robots.txt. Applebot-Extended is a training-use opt-out for Apple Intelligence and other generative AI; primary Applebot continues serving Spotlight, Siri, and Safari | Disallowing Applebot-Extended while leaving Applebot allowed preserves Spotlight/Siri visibility (About Applebot) |
| Microsoft (Bing) | Bingbot | Honors robots.txt; feeds both Bing Search and Bing Copilot | No separate AI-training opt-out token published as of 2026-05, unlike Google or Apple (Which crawlers does Bing use?) |
| Common Crawl | CCBot | Honors robots.txt and the non-standard Crawl-delay: extension | Common Crawl data feeds many third-party AI training corpora indirectly; an opt-out here propagates downstream to anyone training on Common Crawl dumps (CCBot) |
The user-triggered footnote, recurring as a documented operator pattern rather than a bug. ChatGPT-User, Claude-User, and Perplexity-User are each described by their operators as user-initiated and generally not subject to standard robots.txt restrictions. The reasoning each operator offers is identical: a fetch driven by one human asking about one URL is a proxy for human browsing, not autonomous crawling. OpenAI surfaced the framing most explicitly in the late-2025 docs revision; Perplexity’s help page has carried the same framing since 2024. It is the cleanest counter-example to “robots.txt covers all AI access.”
The per-bot product-token strings, the IP-range JSON endpoints, and the exact disallow recipes belong to the spoke entries linked in the table. The kept-fresh, audit-grade version — what to grep in logs, how to verify what reached you — belongs to AI Crawler Access Audit.
5. Declared control vs enforcement
RFC 9309 does not present itself as access control. The standard’s own preamble names the gap that the rest of this section unpacks: declared intent is not enforcement.
RFC 9309’s own disclaimer. §1 of the standard states: “the rules in a robots.txt file are not a form of access authorization.” Later: “This document is not a substitute for valid content security measures, and information that is not meant to be accessed should be properly secured.” Both lines are in the published standard (RFC 9309) — editorial commentary did not add the disclaimer; the standards body did.
User-agent is a self-asserted claim, not an identity. “I see GPTBot in my logs” is not “OpenAI fetched this.” A spoofed UA bypasses any rule keyed on it; a non-compliant crawler ignores the file entirely. Real identity is established by published IP ranges plus forward-confirmed reverse DNS — every major AI-crawler operator publishes IP lists for exactly this purpose. The verification mechanic itself belongs to the AI Crawler Access Audit playbook.
The observed compliance record. Documented behavior and observed behavior have an honest gap:
| What holds | The bounded reading |
|---|---|
| Major first-party AI bots publish that they honor robots.txt as documented | Honoring is a stated policy, not a technical guarantee — a vendor can change its docs (and has — see §4’s ChatGPT-User note) |
| The protocol expresses your intent unambiguously to any compliant agent | Intent ≠ enforcement; a non-compliant or spoofed agent ignores the file regardless of how cleanly it is written |
| Independent measurement of the expressed policy exists | Cloudflare’s 2025 survey found only ~14% of sampled domains’ robots.txt files target AI bots at all (Cloudflare, 2025-07-01) — most sites have no AI policy expressed in the first place |
| Documented compliance covers first-party operators | Independent reporting documents non-compliance at scale: in 2024 multiple news publishers reported a major answer engine fetching content from sites that had disallowed its crawler (TechCrunch, 2024-07-02); in 2025 Cloudflare reported an undeclared crawler impersonating a normal browser and rotating IPs to evade no-crawl directives across tens of thousands of domains (Cloudflare, 2025-08-04) |
| Enforcement is possible | But it is a network-layer problem — WAF rules, verified-bot allowlists, request-signing schemes — not a robots.txt one |
The bot-specific account of the Perplexity incidents routes to PerplexityBot; the principle stated here is what generalizes — and the broader access-decision account against which this protocol-mechanic section is the counterpart lives in AI Crawlers §5.
The position, plainly. Write the policy in robots.txt because compliant crawlers obey it — that is a real, useful effect for the documented first-party AI bots. But verifying that the policy is holding is the network layer’s job. Do not conflate the two layers; doing so is how a site quietly bleeds traffic to a crawler that ignored the file while believing it is protected.
6. Anti-patterns
Most robots.txt mistakes for AI bots fall into four pattern classes: precedence and group-merge errors, mechanism confusion (treating the file as enforcement, or as a meta-tag, or as rate limiting), location errors, and freshness errors. Each row below names a pattern that looks right and fails.
| Anti-pattern | Why it looks right | Why it actually fails |
|---|---|---|
User-agent: * Disallow: / followed by per-bot Allow: in named groups | ”I’m setting a default then allowing the bots I trust” | Each named bot picks only its own group (RFC 9309 §2.2.1). Rules from the * group never reach a bot that has a named group — Allow: in the GPTBot group does not undo Disallow: / in the * group, because the * group is invisible to GPTBot. The merge does not happen |
Blocking GPTBot “to stay out of ChatGPT” | GPTBot is OpenAI’s bot | GPTBot is training-only. Live ChatGPT search answers are served by OAI-SearchBot (indexing) and ChatGPT-User (user-triggered). The categorical asymmetry is the AI Crawlers §3 picture — wrong-bot error |
| Treating robots.txt as enforcement against bad actors | ”I disallowed it, so it can’t fetch” | The protocol is voluntary; RFC 9309 §1 says the rules “are not a form of access authorization.” Spoofed or non-compliant agents ignore the file entirely. Enforcement is a WAF / verified-bot-allowlist problem (AI Crawler Access Audit) |
Crawl-delay: to rate-limit AI bots | ”Rate limiting is a robots.txt directive” | Crawl-delay: is not in RFC 9309. Honoring is non-uniform: Common Crawl’s CCBot and Anthropic’s Claude bots document honoring it; Googlebot publicly does not. Network-layer rate limits are the reliable control across the AI-crawler set |
Noindex: directives in robots.txt | ”It controls indexing of my pages” | Noindex was never a robots.txt directive in any standard. Google retired support for the unofficial Noindex: line on 2019-09-01 (A note on unsupported rules in robots.txt). The correct mechanism is <meta name="robots" content="noindex"> or the X-Robots-Tag HTTP header |
robots.txt at /foo/robots.txt or on a subdomain that does not have its own | ”The file exists, just in a convenient place” | RFC 9309 §2.3 requires the file at the literal /robots.txt path of each host. Subdomains and different schemes (http vs https) each need their own. A file at a non-root path is ignored |
| A static AI-bot allowlist, never revisited | ”Only the bots I trust get through” | The AI-crawler population grew from ~3 named bots in 2023 to ~30 in 2026. A static allowlist silently excludes every newly shipped bot by default — including retrieval bots whose blocking costs you citation immediately |
| Disallowing the training crawler after the page already shipped | ”Now my content is out of the training set” | robots.txt is forward-looking. It cannot retract content already absorbed into a prior training corpus. Whatever retraction mechanism exists is vendor-by-vendor policy (opt-out request forms, takedown channels), not the robots.txt file |
The load-bearing line. Most “robots.txt for AI” mistakes are category or precedence errors, not protocol-syntax errors. The syntax is forgiving; the consequences are not.
7. Three root files, three jobs
The single most common category error in AI publishing infrastructure is conflating three root-level files — robots.txt, sitemap.xml, llms.txt. They have non-overlapping jobs.
| File | What it controls | What it does not do |
|---|---|---|
robots.txt | Access control — may a bot fetch this path? Standardized as RFC 9309 | Does not curate, render, rank, or enforce. A request, not authorization |
sitemap.xml | Discovery / completeness — here is everything, for indexing (Sitemap & IndexNow) | Does not curate, grant access, or signal quality. Not a “best of” list |
llms.txt | Curation / clean rendering — read these pages first, in clean markdown (llms.txt) | Does not grant or deny access; not a ranking signal; not a discovery file |
Three files, three jobs, none substitutes for another. robots.txt is no more “AI access control” than it was “Google access control” — it is the same file doing the same job, now addressed to a wider bot population with a categorical split. The change is in the audience, not in the file’s role.
8. Why this matters for GEO + how to act
Being reachable is necessary, not sufficient: robots.txt opens the gate to crawler access, and access in turn opens the gate to citation. Skip the file and most compliant AI bots assume nothing is restricted; write it wrong and the cost can be silent disappearance from AI answers.
| Your intent | First stop |
|---|---|
| Decide the per-category access policy (training vs retrieval vs user-triggered) | AI Crawlers §2 — the categorical model |
| Write the actual robots.txt directives | This entry §3 |
| Get the per-bot UA strings, IP ranges, and verification recipes | GPTBot · ClaudeBot · PerplexityBot · Google-Extended · Applebot-Extended · OAI-SearchBot · ChatGPT-User |
| Verify what actually reached your site | AI Crawler Access Audit |
| Curate which pages an LLM should read first | llms.txt |
| Make sure the page is liftable once fetched | Citability |
| The method that ties this all together | Generative Engine Optimization |
Write the policy in robots.txt because compliant crawlers obey it; enforce it on the network layer because the rest do not. And remember: access is a gate, not a prize. Being reachable makes you a candidate — being citable is what wins the answer slot.
References
Standards & specifications:
- IETF — RFC 9309: Robots Exclusion Protocol
- Google Search Central — How Google Interprets the robots.txt Specification
- Google Search Central — A note on unsupported rules in robots.txt (the 2019
Noindex:retirement)
Vendor crawler documentation:
- OpenAI — Overview of OpenAI Crawlers · Publishers and Developers FAQ
- Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?
- Perplexity — Perplexity Crawlers · How does Perplexity follow robots.txt?
- Google Search Central — Google’s common crawlers (including Google-Extended)
- Microsoft Bing — Which crawlers does Bing use?
- Apple — About Applebot
- Common Crawl — CCBot
Independent measurement & reporting:
- Cloudflare — From Googlebot to GPTBot: who’s crawling your site in 2025
- Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
- TechCrunch — News outlets are accusing Perplexity of plagiarism and unethical web scraping
- Search Engine Land — Anthropic clarifies how Claude bots crawl sites and how to block them
Frequently asked questions
Does robots.txt actually stop AI crawlers?
Will blocking GPTBot remove me from ChatGPT search?
Why is `User-agent: *` plus a per-bot `Allow:` not granting anything?
Is `Crawl-delay:` a valid robots.txt directive?
Can robots.txt remove my content from a model that already trained on it?
See also
Sources
Primary
- RFC 9309: Robots Exclusion Protocol · IETF · 2022-09-01
- How Google Interprets the robots.txt Specification · Google Search Central
- A note on unsupported rules in robots.txt · Google Search Central · 2019-07-02
- Overview of OpenAI Crawlers (GPTBot / OAI-SearchBot / ChatGPT-User) · OpenAI
- Publishers and Developers FAQ — OpenAI Help Center · OpenAI
- Does Anthropic crawl data from the web, and how can site owners block the crawler? · Anthropic
- Perplexity Crawlers (PerplexityBot / Perplexity-User) · Perplexity AI
- How does Perplexity follow robots.txt? · Perplexity AI
- Google's common crawlers (Google-Extended) · Google Search Central
- Which crawlers does Bing use? · Microsoft Bing
- About Applebot · Apple
- Common Crawl — CCBot · Common Crawl
Secondary
- From Googlebot to GPTBot: who's crawling your site in 2025 · Cloudflare
- Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives · Cloudflare
- News outlets are accusing Perplexity of plagiarism and unethical web scraping · TechCrunch
- Anthropic clarifies how Claude bots crawl sites and how to block them · Search Engine Land