AI Crawlers
Quick facts
- What it is
- The map of the crawler layer — automated agents that fetch your pages on behalf of an AI system, scoped to the roles that change AI visibility
- The load-bearing distinction
- Three categories with opposite consequences: training (feeds weights), retrieval (grounds a live answer + builds the answer index), user-triggered (one user's lookup)
- The access rule
- Per-category, not per-bot, and not binary. Blocking the training crawler does not block the retrieval crawler; blocking the retrieval crawler removes you from AI answers now
- The costliest mistake
- Blanket Disallow for all AI bots — kills citation to stop training. The default of 'block all AI' is the most expensive default, not the safe one
- Where it sits
- The retrievability gate — answer-loop step 2, upstream of citability. Necessary, not sufficient: reachable does not mean liftable
1. What an AI crawler is — the boundary, not a bot directory
An AI crawler is an automated agent that fetches your pages on behalf of an AI system. The useful boundary is by role, not by user agent.
Definition (GEO Wiki working definition): an AI crawler is an automated agent that fetches your pages on behalf of an AI system — scoped to the three roles that change your AI visibility (training, retrieval, user-triggered), not “every non-human user agent.”
This entry is the map of the crawler layer — the hub. Per-bot detail (exact UA strings, IP ranges, exact block recipes) lives in the spokes GPTBot, ClaudeBot, and PerplexityBot. Hub = the map; spoke = the detail. Nothing per-bot is duplicated here.
One seam, declared up front so the rest of the page can route instead of expand:
- The protocol you control crawlers with → robots.txt / llms.txt.
- The doing — write the policy, audit the logs → the AI Crawler Access Audit playbook.
- Whether the crawler can read the page once it arrives → SSR for AI Crawlers.
This page defines the crawler layer and its access logic; for the robots.txt recipe itself, see the AI Crawler Access Audit playbook.
2. The three categories — the load-bearing distinction
The single highest-value disambiguation on the page. Training ≠ retrieval ≠ user-agent, and each category has an opposite access consequence. Collapsing them is where the costly errors come from.
┌──────────────► TRAINING crawler
│ (GPTBot, ClaudeBot,
│ Google-Extended)
│ → feeds/refines model
│ weights; delayed,
│ parametric, not
one origin page ──────┤ attributable
fetched for │
three reasons ├──────────────► RETRIEVAL crawler
│ (OAI-SearchBot,
│ PerplexityBot,
│ Claude-SearchBot)
│ → grounds a live answer
│ + builds the answer
│ index; immediate,
│ citation-bearing
│
└──────────────► USER-TRIGGERED agent
(ChatGPT-User,
Claude-User,
Perplexity-User)
→ one user asked about
your URL right now
The load-bearing line, stated plainly: the access decision is per-category, not per-bot, and not binary. Blocking the training crawler does not block the retrieval crawler. Blocking the retrieval crawler removes you from AI answers now.
Where this sits in the loop: this is the retrievability gate — answer loop step 2 (be a candidate at all), strictly upstream of citability (be liftable once fetched). The loop mechanics sit in answer loop.
3. The canonical category × bot table — the site-quoted asset
This is the entry’s load-bearing asset: the categorized model the rest of the site quotes. The hub owns the map; the spokes own each cell’s fine print (UA strings, IP verification, exact block recipe). Rosters drift fast — this table is re-verified against primary docs each review.
| Category | Why it fetches | Blocking it costs you | Blocking it protects | Representative bots |
|---|---|---|---|---|
| Training | Collects content that may train/refine model weights | Future parametric “memory” of your content; effect is delayed and not attributable | Your content from entering future model weights | GPTBot · ClaudeBot · Google-Extended · CCBot · Applebot-Extended · Amazonbot · Meta-ExternalAgent · Bytespider |
| Retrieval / search | Grounds a live answer and builds the answer-engine index | Citation now — you disappear from AI answers immediately; effect is instant | Little — training largely happened or happens via other corpora | OAI-SearchBot · PerplexityBot · Claude-SearchBot · Googlebot (feeds AI Overviews — not separately blockable) · Bingbot (feeds Copilot) |
| User-triggered agent | One user pasted or asked about your URL right now | That user’s live lookup of your page failing mid-answer | Nothing automated — these fetch only on a human action | ChatGPT-User · Claude-User · Perplexity-User |
The asymmetry, named explicitly: a training block loses future parametric memory but keeps citation; a retrieval block loses citation now while training has already happened or happens via other corpora. They are not interchangeable controls.
Which retrieval bot feeds which surface — one cell each, mechanics routed to the platform pages: ChatGPT search (OAI-SearchBot + ChatGPT-User), Perplexity (PerplexityBot + Perplexity-User), Claude (Claude-SearchBot + Claude-User), Google AI Overviews (Googlebot; Google-Extended only governs training use, not crawling — see Google’s common crawlers), Bing Copilot (Bingbot). Per-bot UA strings and IP verification sit in the individual bot entries (linked above).
4. User-agent is a claim, not an identity
The technical-depth section. A user-agent string is self-asserted and trivially spoofed: “I see GPTBot in my logs” is not “OpenAI fetched this.” A policy keyed on the bare UA both over-blocks spoofed good bots and under-protects spoofed bad ones.
Identity is actually established by published IP ranges + forward-confirmed reverse DNS (rDNS). The major operators publish machine-readable IP lists for exactly this — OpenAI exposes per-bot .json IP files (see Overview of OpenAI Crawlers); Anthropic publishes a bots IP list (see Anthropic crawler docs); Perplexity publishes per-bot IP JSON (see Perplexity Crawlers).
Exactly one minimal, generic verification snippet — the principle, not a per-bot recipe:
# Forward-confirmed rDNS — the principle, not a per-bot recipe
1. reverse-lookup the request IP → host name
2. forward-lookup that host name → IP
3. step-2 IP == request IP AND host in the operator's domain
→ identity confirmed; else → treat the UA as unverified
The per-bot IP ranges, the exact rDNS domains, and runnable verification recipes are the spokes’ and AI Crawler Access Audit playbook’s job — deliberately not expanded here. The emerging alternative is cryptographic bot authentication — HTTP Message Signatures for automated traffic (IETF draft-meunier-web-bot-auth-architecture) — which would replace IP/UA guessing with a signed identity. It is an Internet-Draft with no formal IETF standing yet; treat it as a direction, not a control you can rely on today.
5. What the evidence says — declared control ≠ enforcement
The honesty section, carried with the same discipline as citability §5. The core fact: robots.txt is a voluntary request, not an access-control mechanism. RFC 9309 states the rules “are not a form of access authorization” and that the protocol “is not a substitute for valid content security measures” (see RFC 9309).
There is a real, documented gap between “I disallowed it” and “it stopped.” The cautionary anchor — carried bot-neutral as the principle, not as per-bot litigation: in 2024, multiple news outlets reported a major answer engine fetching content from sites that disallowed its crawler (TechCrunch, 2024-07-02); in 2025, Cloudflare reported an undeclared crawler impersonating a normal browser and rotating IPs to evade no-crawl directives across tens of thousands of domains (Cloudflare, 2025-08-04). The bot-specific account routes to PerplexityBot; the principle is what generalizes.
| What holds | The bounded reading |
|---|---|
| Major first-party bots honor robots.txt as documented | Honoring is a stated policy, not a technical guarantee |
| robots.txt expresses your intent unambiguously | Intent ≠ enforcement; a non-compliant or spoofed agent ignores it |
| Independent measurement exists | Cloudflare found only ~14% of sampled domains’ robots.txt files target AI bots at all (Cloudflare, 2025-07-01) — most sites have no policy expressed |
| Enforcement is possible | But it is a network-layer problem (WAF, verified-bot lists), not a robots.txt one |
The position, stated plainly: write the policy in robots.txt because compliant crawlers obey it — but the protocol mechanics belong to robots.txt, and verifying what actually reached you belongs to the AI Crawler Access Audit. Declared control is not enforcement.
6. Anti-patterns — when blocking backfires
Each anti-pattern below looks right and fails because it confuses a category, a mechanism, or a default.
| Anti-pattern | Why it looks right | Why it actually fails |
|---|---|---|
Blanket Disallow: / for all AI bots | ”Protect my content from AI” | Kills citation to stop training — wrong category, permanent and immediate visibility loss. The headline mistake |
| Block GPTBot “to stay out of ChatGPT” | GPTBot is OpenAI’s bot | Live ChatGPT answers use OAI-SearchBot / ChatGPT-User; GPTBot is training only. Wrong bot |
| Treat robots.txt as enforcement against bad actors | ”I disallowed it, so it can’t” | robots.txt is a voluntary request (§5); a spoofed or non-compliant agent ignores it entirely |
| Allow every crawler, but the page is CSR-only | ”Access is open” | Fetched but unread — the access win is void. Route to SSR for AI Crawlers |
| Static allowlist, never revisited | ”Only the bots I trust” | New bots ship quarterly; an allowlist silently excludes them by default |
The load-bearing line: the default of “block all AI” is not the safe default — it is the most expensive one. Blocking is a per-category trade against citation, not a hygiene step. OpenAI states plainly that appearing and being cited in ChatGPT search requires not blocking its search crawler (see Publishers and Developers FAQ).
7. SEO crawlers vs AI crawlers — invariant baseline vs what changes
This commits SEO vs GEO’s shared-baseline contract rather than re-deriving it. The crawl baseline is invariant; the access logic is what changes.
Invariant — on the “never drop” list: be reachable, return 200, don’t soft-404, keep a sane crawl budget and clean status codes. This is the same baseline Googlebot always needed; AI crawlers inherit it unchanged.
| Dimension | SEO crawler era | AI crawler era (the delta) |
|---|---|---|
| Number of agents | Effectively one that matters (Googlebot) | Many, shipping continuously |
| Access semantics | One “index or not” decision | A training / retrieval / user-triggered split with opposite consequences |
| robots.txt | Declared control and a strongly enforced norm | Still the declared control, a weaker-enforced norm (§5) |
| “Submit to the index” | Existed (sitemaps, ping) | Only partially exists — route the part that does to Sitemap & IndexNow |
The trap is reusing the SEO mental model — “one robots.txt decision, enforced” — on a layer where it is now “many bots, three categories, weakly enforced.”
8. Why this matters for GEO + how to act
Being reachable is necessary, not sufficient: it makes you a candidate. The next gate is being liftable once fetched — that is citability, a separate property this entry deliberately does not teach.
| Your intent | First stop |
|---|---|
| Verify what actually reaches my site | AI Crawler Access Audit |
| Write the access policy | robots.txt · llms.txt |
| Get the per-bot recipe (UA, IPs, block) | GPTBot · ClaudeBot · PerplexityBot |
| Make the fetched page readable | SSR for AI Crawlers |
| The next gate once retrievable | Citability |
| The method that ties it together | Generative Engine Optimization |
One line, routed not expanded: blocking is a per-category trade, verification is a network-layer job, and readability is a rendering job — this page is the map that tells you which door to take, not the door itself.
References
Official crawler documentation (as of 2026-05):
- OpenAI — Overview of OpenAI Crawlers · Publishers and Developers FAQ
- Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?
- Perplexity — Perplexity Crawlers
- Google Search Central — Overview of Google crawlers and fetchers · Google’s common crawlers (Google-Extended)
- Microsoft Bing — Which crawlers does Bing use?
Protocol & standards:
- IETF — RFC 9309: Robots Exclusion Protocol
- IETF — HTTP Message Signatures for automated traffic: Architecture (draft-meunier-web-bot-auth-architecture) (Internet-Draft; emerging, not yet a standard)
Independent measurement & reporting:
- Cloudflare — From Googlebot to GPTBot: who’s crawling your site in 2025
- Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
- TechCrunch — News outlets are accusing Perplexity of plagiarism and unethical web scraping
- Search Engine Land — Anthropic’s Claude bots make robots.txt decisions more granular
Frequently asked questions
What is an AI crawler?
Should I block AI crawlers to protect my content?
If I block GPTBot, am I out of ChatGPT?
Does robots.txt actually stop AI crawlers?
Is seeing 'GPTBot' in my logs proof OpenAI fetched the page?
See also
Sources
Primary
- Overview of OpenAI Crawlers (GPTBot / OAI-SearchBot / ChatGPT-User) · OpenAI
- Publishers and Developers FAQ — OpenAI Help Center · OpenAI
- Does Anthropic crawl data from the web, and how can site owners block the crawler? · Anthropic · 2026-04-07
- Perplexity Crawlers (PerplexityBot / Perplexity-User) · Perplexity AI
- Overview of Google crawlers and fetchers (user agents) · Google Search Central · 2026-02-09
- Google's common crawlers (Google-Extended) · Google Search Central · 2026-04-23
- Which crawlers does Bing use? · Microsoft Bing
- RFC 9309: Robots Exclusion Protocol · IETF · 2022-09-01
Secondary
- Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives · Cloudflare
- News outlets are accusing Perplexity of plagiarism and unethical web scraping · TechCrunch
- From Googlebot to GPTBot: who's crawling your site in 2025 · Cloudflare
- HTTP Message Signatures for automated traffic: Architecture (draft-meunier-web-bot-auth-architecture) · IETF (Internet-Draft)