How to Block AI Crawlers and Prevent Model Training on Your Content

Between 2024–2025, a new wave of “AI crawlers” emerged—agents that automatically scan websites and collect content to train large language models (LLMs). While some (e.g., OpenAI’s GPTBot or Google-Extended) behave correctly and respect robots.txt, others (such as various anonymous scrapers) often ignore these signals.

If you do not want your website content to end up in AI training datasets, there are several effective protection layers you can deploy.

1) First Line of Defense: `robots.txt`

The robots.txt file belongs in the web root (e.g., https://example.com/robots.txt). The configuration below blocks access for most known AI/LLM crawlers while allowing popular search engines (Google, Bing, Seznam, DuckDuckGo) to keep indexing your site.

# =====================================================================
# ROBOTS.TXT — BLOCK AI/LLM BOTS AND AI SEARCH CRAWLERS
# Last update: 2025-11-04
# =====================================================================

User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: aiGPTBot
Disallow: /
User-agent: DataForSeoBot
Disallow: /

# Allow normal indexing by major search engines
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: SeznamBot
Disallow:
User-agent: DuckDuckBot
Disallow:
User-agent: *
Disallow:

This is simple and non-invasive—but only works if the crawler respects robots.txt. Many newer AI bots don’t.

2) Second Line of Defense: `.htaccess` Blocking

To physically prevent access, block known AI crawlers at the server level using .htaccess. This approach is immediate—bots receive 403 Forbidden and never reach your content.

# =====================================================================
# .HTACCESS — BLOCK AI / LLM / DATA CRAWLERS
# =====================================================================

<IfModule mod_rewrite.c>
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (GPTBot|OAI-SearchBot|ChatGPT-User) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ClaudeBot|Claude-Web|Perplexity|PerplexityBot|PerplexityCrawler|Perplexity-User) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Google-Extended|Applebot-Extended|Meta-ExternalAgent|facebookexternalhit|CCBot|cohere-ai|Bytespider|Amazonbot|YouBot|aiGPTBot|DataForSeoBot|Diffbot|PhindBot|Kagi) [NC]
RewriteRule .* - [F,L]

ErrorDocument 403 "Access denied. AI crawlers are not allowed on this site."
</IfModule>

Quick test:

curl -A "GPTBot" -I https://yourdomain.com/

You should see HTTP/1.1 403 Forbidden.

3) Recommended Add-ons & Hardening

Meta tag (HTML) — add into <head>:

<meta name="robots" content="noai, noimageai">

HTTP header — set via .htaccess or server config:
```
Header set X-Robots-Tag "noai, noimageai"
```
Cloudflare Firewall Rule — for example:
```
(http.user_agent contains "GPTBot") → Block
```
Continue adding user agents from the lists above for broader coverage.
Log monitoring — watch access to /robots.txt, inspect User-Agent strings and reverse DNS. If a bot repeatedly ignores blocks, consider WAF/IP blocking or rate limiting.

4) Why Bother Blocking AI Scrapers?

Protect your IP — prevent your texts, images, and data from being used in model training without consent.
Reduce server load — some AI bots crawl aggressively and consume bandwidth/CPU.
Preserve SEO signals — uncontrolled copying/scraping can dilute your content’s value.
Compliance — important for commercial sites processing personal or sensitive data.

AI changes how the web consumes content. If you prefer to keep your articles, photos, and datasets out of third-party training corpora, you’re fully entitled to restrict access.

Using robots.txt and .htaccess together provides two simple yet highly effective protection layers. Combined with Cloudflare WAF, you currently get the strongest practical shield against AI crawlers.

Tip: On PrestaShop, Joomla, or Wordpress, place the .htaccess block before system rewrite sections (e.g., # BEGIN WordPress or your CMS SEF block). After changes, confirm with logs or the curl test above.

WIKI webhosting

1) First Line of Defense: `robots.txt`

2) Second Line of Defense: `.htaccess` Blocking

3) Recommended Add-ons & Hardening

4) Why Bother Blocking AI Scrapers?

Best sellers

PHP WebHosting 20GB

E-Mail Hosting 10 GB

Managed VPS hosting

1U Server Economic+

WIKI webhosting