Concept · Infrastructure

AI Crawlers

Quick facts

What it is: The map of the crawler layer — automated agents that fetch your pages on behalf of an AI system, scoped to the roles that change AI visibility
The load-bearing distinction: Three categories with opposite consequences: training (feeds weights), retrieval (grounds a live answer + builds the answer index), user-triggered (one user's lookup)
The access rule: Per-category, not per-bot, and not binary. Blocking the training crawler does not block the retrieval crawler; blocking the retrieval crawler removes you from AI answers now
The costliest mistake: Blanket Disallow for all AI bots — kills citation to stop training. The default of 'block all AI' is the most expensive default, not the safe one
Where it sits: The retrievability gate — answer-loop step 2, upstream of citability. Necessary, not sufficient: reachable does not mean liftable

1. What an AI crawler is — the boundary, not a bot directory

An AI crawler is an automated agent that fetches your pages on behalf of an AI system. The useful boundary is by role, not by user agent.

Definition (GEO Wiki working definition): an AI crawler is an automated agent that fetches your pages on behalf of an AI system — scoped to the three roles that change your AI visibility (training, retrieval, user-triggered), not “every non-human user agent.”

This entry is the map of the crawler layer — the hub. Per-bot detail (exact UA strings, IP ranges, exact block recipes) lives in the spokes GPTBot, ClaudeBot, and PerplexityBot. Hub = the map; spoke = the detail. Nothing per-bot is duplicated here.

One seam, declared up front so the rest of the page can route instead of expand:

The protocol you control crawlers with → robots.txt / llms.txt.
The doing — write the policy, audit the logs → the AI Crawler Access Audit playbook.
Whether the crawler can read the page once it arrives → SSR for AI Crawlers.

This page defines the crawler layer and its access logic; for the robots.txt recipe itself, see the AI Crawler Access Audit playbook.

2. The three categories — the load-bearing distinction

The single highest-value disambiguation on the page. Training ≠ retrieval ≠ user-agent, and each category has an opposite access consequence. Collapsing them is where the costly errors come from.

                         ┌──────────────► TRAINING crawler
                         │                 (GPTBot, ClaudeBot,
                         │                  Google-Extended)
                         │                 → feeds/refines model
                         │                   weights; delayed,
                         │                   parametric, not
   one origin page ──────┤                   attributable
   fetched for           │
   three reasons         ├──────────────► RETRIEVAL crawler
                         │                 (OAI-SearchBot,
                         │                  PerplexityBot,
                         │                  Claude-SearchBot)
                         │                 → grounds a live answer
                         │                   + builds the answer
                         │                   index; immediate,
                         │                   citation-bearing
                         │
                         └──────────────► USER-TRIGGERED agent
                                           (ChatGPT-User,
                                            Claude-User,
                                            Perplexity-User)
                                          → one user asked about
                                            your URL right now

The load-bearing line, stated plainly: the access decision is per-category, not per-bot, and not binary. Blocking the training crawler does not block the retrieval crawler. Blocking the retrieval crawler removes you from AI answers now.

Where this sits in the loop: this is the retrievability gate — answer loop step 2 (be a candidate at all), strictly upstream of citability (be liftable once fetched). The loop mechanics sit in answer loop.

3. The canonical category × bot table — the site-quoted asset

This is the entry’s load-bearing asset: the categorized model the rest of the site quotes. The hub owns the map; the spokes own each cell’s fine print (UA strings, IP verification, exact block recipe). Rosters drift fast — this table is re-verified against primary docs each review.

Category	Why it fetches	Blocking it costs you	Blocking it protects	Representative bots
Training	Collects content that may train/refine model weights	Future parametric “memory” of your content; effect is delayed and not attributable	Your content from entering future model weights	GPTBot · ClaudeBot · Google-Extended · CCBot · Applebot-Extended · Amazonbot · Meta-ExternalAgent · Bytespider
Retrieval / search	Grounds a live answer and builds the answer-engine index	Citation now — you disappear from AI answers immediately; effect is instant	Little — training largely happened or happens via other corpora	OAI-SearchBot · PerplexityBot · Claude-SearchBot · Googlebot (feeds AI Overviews — not separately blockable) · Bingbot (feeds Copilot)
User-triggered agent	One user pasted or asked about your URL right now	That user’s live lookup of your page failing mid-answer	Nothing automated — these fetch only on a human action	ChatGPT-User · Claude-User · Perplexity-User

The asymmetry, named explicitly: a training block loses future parametric memory but keeps citation; a retrieval block loses citation now while training has already happened or happens via other corpora. They are not interchangeable controls.

Which retrieval bot feeds which surface — one cell each, mechanics routed to the platform pages: ChatGPT search (OAI-SearchBot + ChatGPT-User), Perplexity (PerplexityBot + Perplexity-User), Claude (Claude-SearchBot + Claude-User), Google AI Overviews (Googlebot; Google-Extended only governs training use, not crawling — see Google’s common crawlers), Bing Copilot (Bingbot). Per-bot UA strings and IP verification sit in the individual bot entries (linked above).

4. User-agent is a claim, not an identity

The technical-depth section. A user-agent string is self-asserted and trivially spoofed: “I see GPTBot in my logs” is not “OpenAI fetched this.” A policy keyed on the bare UA both over-blocks spoofed good bots and under-protects spoofed bad ones.

Identity is actually established by published IP ranges + forward-confirmed reverse DNS (rDNS). The major operators publish machine-readable IP lists for exactly this — OpenAI exposes per-bot .json IP files (see Overview of OpenAI Crawlers); Anthropic publishes a bots IP list (see Anthropic crawler docs); Perplexity publishes per-bot IP JSON (see Perplexity Crawlers).

Exactly one minimal, generic verification snippet — the principle, not a per-bot recipe:

# Forward-confirmed rDNS — the principle, not a per-bot recipe
1. reverse-lookup the request IP        → host name
2. forward-lookup that host name        → IP
3. step-2 IP == request IP  AND  host in the operator's domain
   → identity confirmed; else → treat the UA as unverified

The per-bot IP ranges, the exact rDNS domains, and runnable verification recipes are the spokes’ and AI Crawler Access Audit playbook’s job — deliberately not expanded here. The emerging alternative is cryptographic bot authentication — HTTP Message Signatures for automated traffic (IETF draft-meunier-web-bot-auth-architecture) — which would replace IP/UA guessing with a signed identity. It is an Internet-Draft with no formal IETF standing yet; treat it as a direction, not a control you can rely on today.

5. What the evidence says — declared control ≠ enforcement

The honesty section, carried with the same discipline as citability §5. The core fact: robots.txt is a voluntary request, not an access-control mechanism. RFC 9309 states the rules “are not a form of access authorization” and that the protocol “is not a substitute for valid content security measures” (see RFC 9309).

There is a real, documented gap between “I disallowed it” and “it stopped.” The cautionary anchor — carried bot-neutral as the principle, not as per-bot litigation: in 2024, multiple news outlets reported a major answer engine fetching content from sites that disallowed its crawler (TechCrunch, 2024-07-02); in 2025, Cloudflare reported an undeclared crawler impersonating a normal browser and rotating IPs to evade no-crawl directives across tens of thousands of domains (Cloudflare, 2025-08-04). The bot-specific account routes to PerplexityBot; the principle is what generalizes.

What holds	The bounded reading
Major first-party bots honor robots.txt as documented	Honoring is a stated policy, not a technical guarantee
robots.txt expresses your intent unambiguously	Intent ≠ enforcement; a non-compliant or spoofed agent ignores it
Independent measurement exists	Cloudflare found only ~14% of sampled domains’ robots.txt files target AI bots at all (Cloudflare, 2025-07-01) — most sites have no policy expressed
Enforcement is possible	But it is a network-layer problem (WAF, verified-bot lists), not a robots.txt one

The position, stated plainly: write the policy in robots.txt because compliant crawlers obey it — but the protocol mechanics belong to robots.txt, and verifying what actually reached you belongs to the AI Crawler Access Audit. Declared control is not enforcement.

6. Anti-patterns — when blocking backfires

Each anti-pattern below looks right and fails because it confuses a category, a mechanism, or a default.

Anti-pattern	Why it looks right	Why it actually fails
Blanket `Disallow: /` for all AI bots	”Protect my content from AI”	Kills citation to stop training — wrong category, permanent and immediate visibility loss. The headline mistake
Block GPTBot “to stay out of ChatGPT”	GPTBot is OpenAI’s bot	Live ChatGPT answers use OAI-SearchBot / ChatGPT-User; GPTBot is training only. Wrong bot
Treat robots.txt as enforcement against bad actors	”I disallowed it, so it can’t”	robots.txt is a voluntary request (§5); a spoofed or non-compliant agent ignores it entirely
Allow every crawler, but the page is CSR-only	”Access is open”	Fetched but unread — the access win is void. Route to SSR for AI Crawlers
Static allowlist, never revisited	”Only the bots I trust”	New bots ship quarterly; an allowlist silently excludes them by default

The load-bearing line: the default of “block all AI” is not the safe default — it is the most expensive one. Blocking is a per-category trade against citation, not a hygiene step. OpenAI states plainly that appearing and being cited in ChatGPT search requires not blocking its search crawler (see Publishers and Developers FAQ).

7. SEO crawlers vs AI crawlers — invariant baseline vs what changes

This commits SEO vs GEO’s shared-baseline contract rather than re-deriving it. The crawl baseline is invariant; the access logic is what changes.

Invariant — on the “never drop” list: be reachable, return 200, don’t soft-404, keep a sane crawl budget and clean status codes. This is the same baseline Googlebot always needed; AI crawlers inherit it unchanged.

Dimension	SEO crawler era	AI crawler era (the delta)
Number of agents	Effectively one that matters (Googlebot)	Many, shipping continuously
Access semantics	One “index or not” decision	A training / retrieval / user-triggered split with opposite consequences
robots.txt	Declared control and a strongly enforced norm	Still the declared control, a weaker-enforced norm (§5)
“Submit to the index”	Existed (sitemaps, ping)	Only partially exists — route the part that does to Sitemap & IndexNow

The trap is reusing the SEO mental model — “one robots.txt decision, enforced” — on a layer where it is now “many bots, three categories, weakly enforced.”

8. Why this matters for GEO + how to act

Being reachable is necessary, not sufficient: it makes you a candidate. The next gate is being liftable once fetched — that is citability, a separate property this entry deliberately does not teach.

Your intent	First stop
Verify what actually reaches my site	AI Crawler Access Audit
Write the access policy	robots.txt · llms.txt
Get the per-bot recipe (UA, IPs, block)	GPTBot · ClaudeBot · PerplexityBot
Make the fetched page readable	SSR for AI Crawlers
The next gate once retrievable	Citability
The method that ties it together	Generative Engine Optimization

One line, routed not expanded: blocking is a per-category trade, verification is a network-layer job, and readability is a rendering job — this page is the map that tells you which door to take, not the door itself.

References

Official crawler documentation (as of 2026-05):

OpenAI — Overview of OpenAI Crawlers · Publishers and Developers FAQ
Anthropic — Does Anthropic crawl data from the web, and how can site owners block the crawler?
Perplexity — Perplexity Crawlers
Google Search Central — Overview of Google crawlers and fetchers · Google’s common crawlers (Google-Extended)
Microsoft Bing — Which crawlers does Bing use?

Protocol & standards:

IETF — RFC 9309: Robots Exclusion Protocol
IETF — HTTP Message Signatures for automated traffic: Architecture (draft-meunier-web-bot-auth-architecture) (Internet-Draft; emerging, not yet a standard)

Independent measurement & reporting:

Cloudflare — From Googlebot to GPTBot: who’s crawling your site in 2025
Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
TechCrunch — News outlets are accusing Perplexity of plagiarism and unethical web scraping
Search Engine Land — Anthropic’s Claude bots make robots.txt decisions more granular

Frequently asked questions

What is an AI crawler?

An automated agent that fetches your pages on behalf of an AI system. For GEO the useful definition is scoped to the three roles that change your AI visibility — training, retrieval, and user-triggered — not 'every non-human user agent.' This entry is the map of that layer; per-bot detail (exact user-agent strings, IP ranges, block recipes) lives in the per-bot spokes.

Should I block AI crawlers to protect my content?

Only after deciding per category, never as blanket hygiene. Blocking the training crawler (GPTBot, ClaudeBot, Google-Extended) keeps your content out of future model weights while preserving citation. Blocking the retrieval crawler (OAI-SearchBot, PerplexityBot, Claude-SearchBot) removes you from AI answers immediately. A blanket Disallow does the second to achieve the first — the most expensive default, not the safe one.

If I block GPTBot, am I out of ChatGPT?

No — that is the classic wrong-bot mistake. GPTBot is training only. Live ChatGPT answers are served by OAI-SearchBot (search indexing) and ChatGPT-User (user-triggered fetch). Blocking GPTBot keeps you out of training data while leaving ChatGPT search citation fully intact.

Does robots.txt actually stop AI crawlers?

robots.txt is a voluntary request, not access control — RFC 9309 itself states the rules 'are not a form of access authorization.' Major first-party crawlers honor it as documented, but honoring is a policy, not a guarantee, and a self-asserted user-agent string is trivially spoofed. Enforcement against non-compliant or spoofed agents is a network-layer problem, not a robots.txt one.

Is seeing 'GPTBot' in my logs proof OpenAI fetched the page?

No. A user-agent string is a self-asserted claim, not an identity. Identity is established by published IP ranges plus forward-confirmed reverse DNS — OpenAI, Anthropic, and Perplexity all publish IP lists for exactly this. A policy keyed on the unverified UA string both over-blocks spoofed good bots and under-protects spoofed bad ones.

Sources