robots.txt for AI Crawlers: A Complete 2025 Guide
A comprehensive guide to configuring robots.txt for AI bots in 2025. Learn which user agents to allow, how to protect sensitive paths, and best practices for AI-friendly crawler configuration.
robots.txt has been a cornerstone of web infrastructure since 1994. Originally designed to prevent crawlers from overwhelming servers, it's now the primary mechanism for controlling which AI systems can index your content. As AI crawlers proliferate, understanding how to configure robots.txt for the AI era is essential.
How AI Crawlers Use robots.txt
Every major AI system — OpenAI, Anthropic, Google, Microsoft, Perplexity — deploys web crawlers to build training datasets and power real-time browsing features. These crawlers all respect robots.txt by default, though compliance policies vary. Blocking or allowing these bots via robots.txt directly affects whether your content appears in AI-generated answers.
Major AI Crawler User Agents
Here are the primary AI crawler user agents you should be aware of:
GPTBot— OpenAI's crawler for training data and ChatGPT browsingChatGPT-User— Used when ChatGPT browses in real time during a conversationOAI-SearchBot— OpenAI's search-focused crawlerClaudeBot— Anthropic's web crawler (training and research)anthropic-ai— Anthropic's general crawler identifierPerplexityBot— Perplexity AI's indexing crawlerGoogle-Extended— Google's opt-in/opt-out token for Gemini training dataGooglebot— Google's main crawler (also powers AI Overviews)Bingbot— Microsoft's crawler (powers Copilot and Bing Chat)YouBot— You.com's AI search crawlercohere-ai— Cohere's training crawler
The Recommended Configuration
For most websites that want to be visible in AI search, the recommended approach is to allow all AI crawlers while blocking only sensitive paths like authentication pages and internal APIs:
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /private/
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yourdomain.com/sitemap.xmlWhen to Block AI Crawlers
There are legitimate reasons to block AI crawlers on certain paths or entirely:
- Paywalled content — Block crawlers from accessing subscriber-only pages
- Sensitive user data — Prevent crawling of pages containing personal information
- Duplicate content — Block paginated, filtered, or parameter-heavy URLs that add noise without value
- Training data opt-out — Use
Google-Extendedblock if you don't want content in Gemini training while still appearing in Googlebot results - Unstable pages — Block pages under development or with placeholder content
Selective Opt-Out Example
To opt out of AI training data while keeping search visibility, use selective blocking. For example, to allow general Googlebot but opt out of Gemini training:
# Allow normal search indexing
User-agent: Googlebot
Allow: /
# Opt out of AI training data for Google
User-agent: Google-Extended
Disallow: /
# Allow AI browsing/answer features
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /Always Include a Sitemap Reference
The Sitemap: directive at the bottom of robots.txt is crucial. It tells all crawlers — including AI crawlers that may not know your URL structure — where to find a complete list of your pages. Without it, crawlers may miss large portions of your site.
Sitemap: https://yourdomain.com/sitemap.xmlCommon Mistakes to Avoid
- Using a catch-all
Disallow: /underUser-agent: *— this blocks all crawlers including AI - Blocking
/assets/or/static/— AI crawlers may need CSS/JS context to render pages - Not including a
Sitemap:directive — means crawlers rely on link discovery alone - Forgetting to test your robots.txt with Google Search Console or a robots.txt validator
- Using regex patterns — robots.txt only supports simple wildcards (
*) and end-of-line anchors ($)
Testing Your robots.txt
After creating or updating your robots.txt, test it thoroughly. Use the AI Crawlability Test to verify the file is accessible and that key directives are present. Google Search Console also provides a robots.txt testing tool for Googlebot-specific validation.
Remember: robots.txt rules are a statement of intent, not an enforcement mechanism. Well-behaved crawlers respect them; malicious bots typically don't. For sensitive content, rely on authentication and access controls rather than robots.txt alone.