Understanding AI Scraping
AI scraping involves using automated bots to collect large volumes of data from websites. This data is then used to train AI models, such as OpenAI’s GPTBot for ChatGPT. While some scraping is done with good intentions, unauthorized scraping can lead to a host of issues, including intellectual property theft, server overload, and a drop in website traffic as users get information directly from AI responses without visiting the source site.
Identifying the Threats
There are various types of bots that scrape data, including:
- Web Crawlers/Spiders: These bots systematically browse the internet and index content. Examples include Googlebot and Bingbot.
- Shell Scripts and HTML Scrapers: These tools download content from websites by accessing the HTML structure.
- Screen Scrapers: These programs capture data by simulating human browsing behavior.
- Human Copy: Manual copying and pasting of content.
Detailed Steps to Prevent AI Scraping
1. Use a robots.txt File
The robots.txt
file is the first line of defense against bots. This file tells web crawlers which parts of your site they can and cannot access. Here’s how to use robots.txt
to block specific bots effectively:
Understanding robots.txt
The robots.txt
file, also known as the Robots Exclusion Protocol, is a simple text file placed at the root of your website (e.g., https://www.example.com/robots.txt
). It contains directives that guide the behavior of web crawlers. Each directive consists of a user-agent (which specifies the bot) and a set of instructions (which pages the bot is allowed or disallowed to access).
Basic Structure of robots.txt
A typical robots.txt
file looks like this:
User-agent: *
Disallow: /private/
In this example:
User-agent: *
applies the rule to all web crawlers.Disallow: /private/
tells crawlers not to access the/private/
directory.
Blocking Specific Bots
To block specific bots, you need to identify their user-agent strings and use the Disallow
directive. Here’s an example that includes a comprehensive list of known AI scraping bots:
# Block AI and LLM bots
User-agent: AlphaAI
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: OmgiliBot
Disallow: /
User-agent: OpenAI
Disallow: /
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.diviengine.com/sitemaps.xml
- AlphaAI, anthropic-ai, Applebot-Extended, Bytespider, CCBot, Claude-Web, ClaudeBot, Diffbot, FacebookBot, Google-Extended, GPTBot, ImagesiftBot, Omgili, OmgiliBot, OpenAI: These lines block various AI scraping bots from accessing any part of your site.
- User-agent: *: This line applies rules to all other bots.
- Disallow: /wp-admin/: This line blocks access to the
/wp-admin/
directory. - Allow: /wp-admin/admin-ajax.php: This line allows access to the
admin-ajax.php
file within the/wp-admin/
directory, necessary for many WordPress functions. - Sitemap: This line specifies the location of your sitemap.
Advanced robots.txt Techniques
- Selective Blocking: You can allow bots to access certain parts of your site while blocking others. For example, to allow access to the
/public/
directory but block everything else:
User-agent: GPTBot
Allow: /public/
Disallow: /
- Crawl-Delay Directive: This directive tells bots to wait for a specified number of seconds between requests, reducing server load:
User-agent: *
Crawl-delay: 10
Limitations of robots.txt
While robots.txt
is useful, it’s not foolproof. Well-behaved bots will respect it, but malicious bots might ignore it entirely. Therefore, combining robots.txt
with other security measures is essential.
2. Implement Rate Limiting
Rate limiting controls the number of requests a user can make to your server in a given timeframe. This helps prevent bots from overwhelming your server with requests. WordPress plugins like Wordfence and Sucuri can be used to set up rate limiting rules.
- Setting Up Rate Limiting in Wordfence:
- Install and activate the Wordfence plugin.
- Navigate to Wordfence > Firewall > Rate Limiting.
- Configure the rules to limit the number of requests per minute from a single IP address.
- Using Sucuri for Rate Limiting:
- Install and activate the Sucuri plugin.
- Go to Sucuri Security > Firewall (WAF) > Settings.
- Enable rate limiting and set the desired limits.
3. IP Blocking
Blocking IP addresses associated with malicious activity can be an effective measure. Use tools like Fail2Ban to monitor and block suspicious IP addresses. You can also use services like Cloudflare to filter traffic and block known bad actors.
- Cloudflare IP Blocking:
- Log in to your Cloudflare account.
- Go to Firewall > Firewall Rules.
- Create a new rule to block traffic from specific IP addresses or ranges.
4. CAPTCHA Verification
Implementing CAPTCHA challenges on your site can help distinguish between human users and bots. Use plugins like Divi Form Builder to add CAPTCHA forms to your login, registration, and comment sections.
- Adding CAPTCHA Using Divi Form Builder Plugin:
- Install and activate the plugin.
- Go to your Divi Form Builder Module > Spam Protection > Google CAPTCHA.
- Configure the module to use Google CAPTCHA.

5. Honeypots
Honeypots are hidden fields that real users won’t see or interact with, but bots will. When a bot fills out these fields, you can detect and block it. Plugins like Divi Form Builder can help you implement honeypots on your WordPress site.
- Setting Up Honeypots with Divi Form Builder:
- Install and activate the Divi Form Builder plugin.
- Go to your Divi Form Builder Module > Spam Protection > Google CAPTCHA.
- Configure the module to use Honeypots.

6. Content Obfuscation
Obfuscating your content makes it more difficult for bots to scrape it. This can involve techniques like rendering text as images, using CSS sprites, or dynamically changing HTML element IDs. However, these methods can also affect user experience and SEO, so use them judiciously.
- Rendering Text as Images:
- Use a tool like Photoshop or an online service to convert text to images.
- Embed these images in your content instead of plain text.
7. Firewall Protection
A robust web application firewall (WAF) can block unwanted traffic before it reaches your site. Services like Cloudflare WAF or Astra Security offer comprehensive protection against bot traffic.
- Setting Up Cloudflare WAF:
- Log in to your Cloudflare account.
- Go to Firewall > Firewall Rules.
- Create rules to block or challenge unwanted traffic.
8. Disable REST API
The WordPress REST API can be an entry point for scrapers. If you don’t need it, consider disabling it or restricting access. Use the Disable REST API plugin to control access to the REST API.
- Using Disable REST API Plugin:
- Install and activate the plugin.
- The plugin will automatically block access to the REST API for non-logged-in users.
9. Monitor and Audit Traffic
Regularly monitor your site’s traffic to identify unusual patterns that might indicate scraping. Use plugins like WP Activity Log to keep track of user and bot activity on your site.
- Setting Up WP Activity Log:
- Install and activate the plugin.
- Go to WP Activity Log > Settings.
- Configure the plugin to monitor and log desired activities.
Advanced Measures
For those with technical expertise, additional measures can be implemented:
- Server-side Request Validation: Implementing server-side checks can help identify and block unauthorized requests. For instance, you can check for valid user sessions or implement token-based authentication for API requests.
- Rate Limiters on API Endpoints: Apply rate limiting directly on your API endpoints to prevent excessive requests.
- Bot Mitigation Services: Services like Qrator Labs provide advanced bot mitigation solutions that can protect your site from scraping and other malicious activities.
Conclusion
Protecting your WordPress site from being scraped by AI bots is crucial to safeguarding your content and maintaining your site’s performance. By implementing a combination of these strategies, you can significantly reduce the risk of unauthorized scraping. Remember, while no solution is foolproof, taking proactive measures can help you stay ahead of the scrapers and protect your digital assets.