AI Crawlability Test
GuidesBy Soorya· April 15, 2025· 9 min read

robots.txt for AI Crawlers: A Complete 2025 Guide

A comprehensive guide to configuring robots.txt for AI bots in 2025. Learn which user agents to allow, how to protect sensitive paths, and best practices for AI-friendly crawler configuration.

robots.txt has been a cornerstone of web infrastructure since 1994. Originally designed to prevent crawlers from overwhelming servers, it's now the primary mechanism for controlling which AI systems can index your content. As AI crawlers proliferate, understanding how to configure robots.txt for the AI era is essential.

How AI Crawlers Use robots.txt

Every major AI system — OpenAI, Anthropic, Google, Microsoft, Perplexity — deploys web crawlers to build training datasets and power real-time browsing features. These crawlers all respect robots.txt by default, though compliance policies vary. Blocking or allowing these bots via robots.txt directly affects whether your content appears in AI-generated answers.

Major AI Crawler User Agents

Here are the primary AI crawler user agents you should be aware of:

  • GPTBot — OpenAI's crawler for training data and ChatGPT browsing
  • ChatGPT-User — Used when ChatGPT browses in real time during a conversation
  • OAI-SearchBot — OpenAI's search-focused crawler
  • ClaudeBot — Anthropic's web crawler (training and research)
  • anthropic-ai — Anthropic's general crawler identifier
  • PerplexityBot — Perplexity AI's indexing crawler
  • Google-Extended — Google's opt-in/opt-out token for Gemini training data
  • Googlebot — Google's main crawler (also powers AI Overviews)
  • Bingbot — Microsoft's crawler (powers Copilot and Bing Chat)
  • YouBot — You.com's AI search crawler
  • cohere-ai — Cohere's training crawler

The Recommended Configuration

For most websites that want to be visible in AI search, the recommended approach is to allow all AI crawlers while blocking only sensitive paths like authentication pages and internal APIs:

User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /private/

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

When to Block AI Crawlers

There are legitimate reasons to block AI crawlers on certain paths or entirely:

  • Paywalled content — Block crawlers from accessing subscriber-only pages
  • Sensitive user data — Prevent crawling of pages containing personal information
  • Duplicate content — Block paginated, filtered, or parameter-heavy URLs that add noise without value
  • Training data opt-out — Use Google-Extended block if you don't want content in Gemini training while still appearing in Googlebot results
  • Unstable pages — Block pages under development or with placeholder content

Selective Opt-Out Example

To opt out of AI training data while keeping search visibility, use selective blocking. For example, to allow general Googlebot but opt out of Gemini training:

# Allow normal search indexing
User-agent: Googlebot
Allow: /

# Opt out of AI training data for Google
User-agent: Google-Extended
Disallow: /

# Allow AI browsing/answer features
User-agent: GPTBot
Allow: /

User-agent: PerplexityBot
Allow: /

Always Include a Sitemap Reference

The Sitemap: directive at the bottom of robots.txt is crucial. It tells all crawlers — including AI crawlers that may not know your URL structure — where to find a complete list of your pages. Without it, crawlers may miss large portions of your site.

Sitemap: https://yourdomain.com/sitemap.xml

Common Mistakes to Avoid

  • Using a catch-all Disallow: / under User-agent: * — this blocks all crawlers including AI
  • Blocking /assets/ or /static/ — AI crawlers may need CSS/JS context to render pages
  • Not including a Sitemap: directive — means crawlers rely on link discovery alone
  • Forgetting to test your robots.txt with Google Search Console or a robots.txt validator
  • Using regex patterns — robots.txt only supports simple wildcards (*) and end-of-line anchors ($)

Testing Your robots.txt

After creating or updating your robots.txt, test it thoroughly. Use the AI Crawlability Test to verify the file is accessible and that key directives are present. Google Search Console also provides a robots.txt testing tool for Googlebot-specific validation.

Remember: robots.txt rules are a statement of intent, not an enforcement mechanism. Well-behaved crawlers respect them; malicious bots typically don't. For sensitive content, rely on authentication and access controls rather than robots.txt alone.

robots.txtGPTBotClaudeBotAI crawlersweb crawlerscrawl directives

More articles

Check your site's AI crawlability

Free, instant, live checks — no account required.

Run free test →