Cloudflare releases feature to block AI bots that collect training data in bulk



The popularity of generative AI has led to a surge in demand for content used to train and infer models, with some AI companies using web scraping bots to collect data. Cloudflare, a content delivery network (CDN), has announced that it has introduced a feature to block

web scraping bots for AI training in bulk.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click
https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click



To prepare a dataset for AI training, a huge amount of data is required, so some AI companies run AI bots to collect training data and scrape images and photos from the internet.

For example, it was a big topic when the AI search engine Perplexity ignored the robots.txt files of websites and continued to scrape the web even on websites that prohibited bots from crawling them.

Perplexity, a generative AI search engine, ignores robots.txt to extract information from websites - GIGAZINE



Below is a graph summarizing the number of requests per day (vertical axis) from AI bots observed by Cloudflare from 2023 to 2024. We can see that the number of requests from AI bots has been increasing sharply since the end of 2023. According to Cloudflare, the top four AI bots with the highest number of requests were 'Bytespider' by ByteDance, which operates TikTok, Amazonbot by Amazon, ClaudeBot by Anthropic, and GPTBot by OpenAI.



The graph below summarizes the number of domains (vertical axis) that are prohibited from accessing each AI bot (horizontal axis) after analyzing the robots.txt of the top 10,000 Internet domains. From this, we can see that GPTBot was the most frequently prohibited, but Bytespider and ClaudeBot, which also had a large number of requests, were hardly prohibited at all.



Cloudflare introduced a feature in September 2023 that allows you to block AI bots that properly follow robots.txt. However, even if you prohibit AI bots in robots.txt, they can be bypassed if the user agent is spoofed.

So, we're now announcing a new feature that allows you to block all AI bots with one click, whether they are robots.txt compliant or not.



Cloudflare analyzes AI bot traffic and adjusts AI bot detection. The AI bot blocking feature will be automatically updated whenever new indicators of malicious AI bots identified as widespread web scraping are found. In addition, a form has been set up to report suspected AI bots to Cloudflare.

'Customers don't want AI bots visiting their websites, especially not bots that are engaging in fraudulent activities. We are concerned that some AI companies will continue to adapt to evade bot detection by trying to get around the rules to access content,' Cloudflare said.

in Software,   Web Service, Posted by log1i_yk