25% of the top 1,000 global websites block GPTBot. Many without realizing it. Here is how to decide intelligently in 2026.

Key takeaways:

  • Two families of AI bots exist: those that train models and those that retrieve information in real time to answer a user.
  • Blocking all AI bots cuts your visibility in ChatGPT, Claude and Perplexity overnight.
  • WordPress and Shopify plugins have enabled this blocking by default since 2024, without explicit warning.
  • The right reflex in 2026: block training, allow retrieval, measure the result.

Should You Allow GPTBot, ClaudeBot and PerplexityBot in robots.txt?

You should treat them separately. GPTBot and ClaudeBot are used to train OpenAI’s and Anthropic’s models. PerplexityBot indexes the web for Perplexity’s real-time answers. The decision depends on your GEO strategy and your tolerance for seeing your content used without compensation.

Here is the quick answer in a table:

Bot Role 2026 Recommendation
GPTBot (OpenAI) Training GPT models Block if you refuse free training
ClaudeBot (Anthropic) Training Claude models Block for the same reason
PerplexityBot (Perplexity) Indexing + real-time retrieval Allow to generate citations
OAI-SearchBot (OpenAI) Real-time retrieval for ChatGPT Search Allow absolutely
Claude-Web / Claude-SearchBot Real-time retrieval for Claude.ai Allow absolutely

Training Crawlers vs Retrieval Crawlers: The Critical Distinction

Most websites treat AI bots as a single block. This is the most costly strategic mistake of 2026. A Cloudflare study published in January 2026 shows that training crawlers represent 5 to 10 times more volume than retrieval crawlers, yet generate zero traffic in return.

What is a training crawler?

A training crawler ingests your content in bulk to feed the next foundation model. GPTBot, ClaudeBot, CCBot (Common Crawl) and anthropic-ai fall into this category. They pass once, copy what they can, and leave. You receive no citation, no traffic and no usage notification.

What is a retrieval crawler?

A retrieval crawler (also called active agent) fetches information at the moment a user asks a question. OAI-SearchBot, ChatGPT-User, Claude-Web, Claude-SearchBot and PerplexityBot operate this way. They cite their sources, generate traffic to your site and represent your new acquisition lever.

Why this difference changes everything in 2026

Blocking both families indiscriminately deprives you of visibility in ChatGPT, Claude and Perplexity. Allowing both exposes you to free scraping of your texts. The 2026 strategy is to surgically separate these two flows. Anthropic actually split its main bot into two agents earlier this year to enable this granularity.

The Full List of AI User-Agents to Know in 2026

Here is the up-to-date inventory of active AI bots in 2026, validated by official documentation from OpenAI, Anthropic, Perplexity and Google.

Vendor User-agent Type
OpenAI GPTBot Training
OpenAI OAI-SearchBot Retrieval (ChatGPT Search)
OpenAI ChatGPT-User User action (browsing)
Anthropic ClaudeBot Training
Anthropic anthropic-ai Bulk training
Anthropic Claude-Web / Claude-SearchBot Retrieval
Anthropic Claude-User User action
Perplexity PerplexityBot Retrieval + index
Perplexity Perplexity-User User action
Google Google-Extended Gemini training
Apple Applebot-Extended Apple Intelligence training
Meta Meta-ExternalAgent Training + agent
Common Crawl CCBot Training (public dataset)
ByteDance Bytespider Training (non-compliant with robots.txt)

Note two sensitive points. Bytespider and some Perplexity crawlers regularly ignore robots.txt according to independent audits published in 2025. Blocking must then happen at the server level or via Cloudflare. Second point: Google-Extended only blocks Gemini training, not classic Google Search indexing. Blocking Google-Extended therefore does not penalize your traditional SEO.

The Silent WordPress and Shopify Plugin Bug

This is the most discreet trap of 2024-2025. Several popular SEO plugins on WordPress and Shopify apps added a “block AI bots” button in their settings. Enabled by default. No alert. The result: thousands of e-commerce sites and blogs cut their access to ChatGPT, Claude and Perplexity overnight without realizing it.

The typical symptom: a site that was cited in ChatGPT in 2024 disappears abruptly from responses in early 2025. AI traffic drops to zero. Marketing teams blame the algorithm when the cause lies in the robots.txt automatically generated by their tech stack.

Check these three points immediately:

  • Open your https://yoursite.com/robots.txt in a browser.
  • Search for the strings GPTBot, ClaudeBot, PerplexityBot, anthropic-ai.
  • If you see a Disallow: / under any of these user-agents, verify that it is intentional.

What Are the Risks of Blocking All AI Bots?

Blocking all AI bots produces three direct effects. You lose your citations in generative answers. Your brand name stops appearing when a prospect queries ChatGPT or Claude. Your competitors who allowed the right bots capture the query.

The second risk concerns monitoring. Without allowing retrieval bots, a tracking tool like Cockpyt AI cannot detect any AI activity on your site. I can only measure your AI Share of Voice if your pages are actually crawlable by OAI-SearchBot, Claude-Web and PerplexityBot. The data does not exist otherwise.

The third risk is commercial. A B2B SaaS brand that blocks everything drops out of training datasets, and therefore out of spontaneous AI recommendations within the next 18 to 24 months. This is a delayed effect that is hard to reverse.

How Much Does an Unblocked AI Bot Cost on a Large Site?

For a site with fewer than 10,000 URLs, the bandwidth cost of AI bots remains negligible. For a large media outlet, documentation publisher or marketplace with hundreds of thousands of pages, the bill changes radically.

The orders of magnitude observed in 2025-2026 on large sites:

  • 1 to 10 TB of monthly bandwidth consumed by combined AI crawlers.
  • $1,000 to $10,000 per month in corresponding infrastructure cost.
  • 15 to 40% of server load devoted to AI bots on some publishers (source to verify).

A major French media outlet shared in 2025 that its AI crawlers consumed more bandwidth than its human users during off-peak hours. The decision to block or not becomes a real economic trade-off, not a philosophical question.

How to Configure Your robots.txt in 2026: 4 Strategic Scenarios

No universal strategy exists. Your configuration depends on your revenue model and the role of content in your acquisition.

Scenario 1: E-commerce

You want to maximize citations in ChatGPT Shopping, Perplexity and Claude. You block training (GPTBot, ClaudeBot, CCBot) to protect your exclusive product descriptions. You allow all retrieval bots. You then monitor which product pages surface in AI answers.

Scenario 2: Media / Publisher

Your content is your main asset. Block all training bots without exception. Allow retrieval sparingly: OAI-SearchBot, Claude-SearchBot and PerplexityBot are enough. Watch your logs to detect non-compliant crawlers (Bytespider in particular).

Scenario 3: B2B SaaS

You want to become the default brand cited in your category. Allow broadly, including training bots, on your blog and documentation. Block training on sensitive product pages. This is the most aggressive strategy to gain AI Share of Voice quickly.

Scenario 4: Institutional or Public Site

You distribute public service or institutional information. Allow everything without restriction. Your mission is dissemination, not content monetization.

The 2026 robots.txt Template Ready to Copy

Here is the recommended template for most sites in 2026. Adapt it to your scenario.

# --- Training crawlers: BLOCKED ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

# --- Retrieval crawlers: ALLOWED ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Rest of the web ---
User-agent: *
Allow: /

How to Verify Your Site Is Crawled by the Right Bots?

Three methods coexist in 2026.

The server log method. Search for the strings gptbot, claudebot, perplexitybot, oai-searchbot in your Nginx or Apache logs. A simple command: grep -Ei "gptbot|claudebot|perplexitybot|oai-searchbot" access.log. A well-configured site should see dozens to thousands of hits per day depending on its size.

The Cloudflare or CDN method. Cloudflare dashboards now offer an “AI Scrapers and Crawlers” category that aggregates visits by bot. You instantly see who is passing, at what frequency, on which pages.

The AI visibility monitoring method. A tool like Cockpyt AI simulates user queries in ChatGPT, Claude and Perplexity, detects whether your brand is cited, and correlates with your robots.txt configuration. You get the full picture: presence in datasets, presence in real-time answers and competitive comparison.

FAQ: robots.txt and AI Bots in 2026

Does blocking GPTBot remove my content already ingested by ChatGPT?

No. The block prevents future ingestions, not those already completed. Your content present in GPT-4 or GPT-5 will stay there until the next model training. OpenAI does not offer a retroactive removal mechanism via robots.txt.

Does blocking Google-Extended penalize my SEO on Google Search?

No. Google-Extended only controls the use of your content to train Gemini and Google’s AI features. The classic Googlebot remains independent and continues to index your site normally for Google Search.

Does PerplexityBot really respect robots.txt?

Not always. Several independent audits published in 2024 and 2025 documented Perplexity crawlers ignoring directives. For reliable blocking, back up the robots.txt directive with a firewall rule at the Cloudflare or CDN level.

What happens if I have no robots.txt file?

All bots access your entire site by default. Your content therefore feeds all AI models and appears in all real-time answers. This is the most permissive configuration possible.

Are ClaudeBot and Claude-Web the same bot?

No. Anthropic separated its agents in early 2026. ClaudeBot remains the training crawler. Claude-Web and Claude-SearchBot are the retrieval agents that fetch information in real time when a user queries Claude.ai. You can block the first and allow the second.

Should I create an llms.txt file in addition to robots.txt?

The adoption of llms.txt remains marginal in 2026. The main AI models do not systematically account for it. Focus your efforts on robots.txt, which controls actual access.

How do I know if a bot visiting my site is legitimate?

Check the source IP against the official ranges published by OpenAI, Anthropic and Perplexity. Any bot that declares itself as GPTBot but comes from an IP not listed by OpenAI is probably a disguised scraper.

Sources and References

  • Cubitrek, “Robots.txt 2026: Managing AI Crawler Budgets”, May 2026 — cubitrek.com
  • xSeek, “GPTBot: Should You Block It or Allow It?”, April 2026 — xseek.io
  • Witscode, “Robots.txt Strategy 2026”, March 2026 — witscode.com
  • Mersel AI, “How to Block or Allow AI Bots on Your Website”, March 2026 — mersel.ai
  • Cloudflare Radar, “AI crawlers traffic analysis”, January 2026.
  • OpenAI, “GPTBot user agent documentation”, August 2023, updated 2025.
  • Anthropic, “Claude bot user agents”, official documentation 2025-2026.
Florian Zorgnotti

I’m Florian Zorgnotti, an SEO consultant based in Nice since 2016. I’ve led 300+ projects, specializing in WordPress, Shopify, and Generative Engine Optimization (GEO) to help brands grow their visibility in search and AI platforms.