Cloudflare New Tool Shields Websites from AI Scraping

Cloudflare, a publicly traded cloud service provider, has introduced a new, free tool to protect websites on its platform from bot scraping for AI model training data.

Certain AI vendors, such as Google, OpenAI, and Apple, enable website owners to restrict data scraping and model training by updating their website’s robots.txt file, which dictates bot access to specific web pages. However, Cloudflare highlights in the announcement post of its bot-combating tool that not all AI scrapers comply with this protocol.

The company writes on its official blog, “Customers don’t want AI bots visiting their websites and especially those that do so dishonestly. We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection.”

To tackle this issue, Cloudflare examined AI bot and crawler traffic to enhance its automatic bot detection models. These models consider various factors, including whether an AI bot attempts to disguise itself as a legitimate web browser user by mimicking their behavior and appearance.

“When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. Based on these signals, our models [can] appropriately flag traffic from evasive AI bots as bots.” Cloudflare writes.

Cloudflare has established a reporting system for hosts to identify suspected AI bots and crawlers and will maintain a manual blacklist for these entities.

The issue of AI bots has become increasingly pressing due to the surge in demand for training data, which is driven by the rapid growth of generative AI.

Numerous websites concerned about AI vendors using their content for model training without permission or compensation have chosen to block AI scrapers and crawlers. Studies indicate that around 26% of the top 1,000 websites have blocked OpenAI’s bot, while over 600 news publishers have also taken this measure.

However, blocking is not a foolproof solution. Some AI vendors seemingly disregard standard bot exclusion protocols to gain a competitive edge in the AI landscape. Recent allegations suggest that Perplexity, an AI search engine, masqueraded as legitimate visitors to scrape website content, while OpenAI and Anthropic have reportedly ignored robots.txt rules on occasion.

In a letter to publishers last month, TollBit, a content licensing startup, revealed that it sees “numerous AI agents” disregarding the robots.txt standard.

Cloudflare’s tool could offer a solution, but its effectiveness hinges on accurately identifying covert AI bots. Moreover, this tool doesn’t address the underlying issue: publishers may lose valuable referral traffic from AI-powered tools like Google’s AI Overviews, which exclude sites that block specific AI crawlers.