Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)

· Jul 29, 2024 at 10:25 AM

Hundreds of sites have put old Anthropic scrapers on their blocklist, while leaving a new one unblocked.

Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones) — Image: Unsplash

Update 7/30/24: After this story was originally published, an Anthropic spokesperson told 404 Media that CLAUDEBOT will respect block requests for its older two crawlers. “The 'ANTHROPIC-AI' and 'CLAUDE-WEB' user agents are no longer in use,” the spokesperson said. “We have configured ClaudeBot, our centralized user agent, to respect any existing robots.txt directives that were previously set for these deprecated user agents. This attempts to respect website owners' preferences, even if they haven't updated their robots.txt files.” The original text of this story follows below:

Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt.

In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic’s real (and new) scraper bot unblocked.

This is an example of “how much of a mess the robots.txt landscape is right now,” the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers—many of them operated by AI companies—and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work.

“The ecosystem of agents is changing quickly, so it’s basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively,” they added.

Dark Visitors tracks hundreds of web crawlers and scrapers, attempts to explain what each scraper does, and lets website owners constantly update their site’s robots.txt file, which is a set of instructions that tells bots if they have permission to crawl a site. We have seen time and time again that AI companies will often find surreptitious ways of crawling sites that they aren’t supposed to, or, in some cases, they simply ignore robots.txt. This has led some sites to block all crawlers regardless of what they do, or to only specifically allow a few select ones (Reddit is now only being crawled by Google because of this). This can have the effect of blocking search engines, internet archiving tools, and academic research, even if that wasn’t the website owner’s intention.

In Anthropic’s case, the robots.txt files of some popular websites, including Reuters.com and the Condé Nast family of websites, are blocking two AI scraper bots called “ANTHROPIC-AI” and “CLAUDE-WEB,” which are bots that were once owned by Anthropic and used by its Claude AI chatbot. But Anthropic’s current and active crawler is called “CLAUDEBOT.” Neither Reuters nor Condé Nast, for example, blocks CLAUDEBOT. This means that these websites—and hundreds of others who have copy pasted old blocker lists—are not actually blocking Anthropic.

Last week, repair guide site iFixit said that Anthropic’s crawlers had hit its website nearly a million times in one day, and the coding documentation deployment service Read the Docs published a blog post saying that various crawlers had hit its servers at a huge scale. One crawler, it said, accessed 10 TB worth of files in a single day and 73 TB total in May: “This cost us over $5,000 in bandwidth charges, and we had to block the crawler,” they wrote. “We are asking all AI companies to be more respectful of the sites they are crawling. They are risking many sites blocking them for abuse, irrespective of the other copyright and moral issues that are at play in the industry.”

The Anthropic finding was published in a paper by the Data Provenance Initiative that more broadly shows the pervasive confusion content creators and website owners face when trying to block AI tools from being trained on their work. The onus on blocking AI scrapers is put entirely on website owners, and the number of scrapers is constantly increasing. New scraper bots—often called “user agents”—are popping up all the time, AI companies sometimes ignore the stated wishes of website owners, and bots that are seemingly connected to well-known companies sometimes aren’t connected to them at all.

In its paper, the Data Provenance Initiative wrote that “the origin and reason for these unrecognized agents [ANTHROPIC-AI and CLAUDE-WEB] remains unclear—Anthropic reports no ownership of these.” Originally, the Data Provenance Initiative was unsure whether these bots were operated by Anthropic at all, and there’s not much public evidence that ANTHROPIC-AI ever existed besides the fact that it had been widely circulated on robots.txt block lists, which are often copy-pasted from site to site.

Anthropic told 404 Media that both ANTHROPIC-AI and CLAUDE-WEB were old crawlers once used by the company but which are no longer in use. Anthropic did not answer a question about whether the real agent, CLAUDEBOT, respect robots.txt for sites that have blocked CLAUDE-WEB or ANTHROPIC-AI, or when the switch was made. But the operator of Dark Visitors said that CLAUDE-WEB was in operation until very recently, and had seen CLAUDE-WEB on their test website as recently as July 12.

“These inconsistencies and omissions across AI agents suggest that a significant burden is placed on the domain creator to understand evolving agent specifications across (a growing number of) developers,” the Data Provenance Initiative report noted.

Shayne Longpre, the lead author of the study, told me that “there are many, many websites that are listing that they’re blocking the fake Anthropic agents, but they’re not listing CLAUDEBOT, the actual Anthropic agent. So this means websites are actually not blocking the crawlers that they think they are blocking.”

Robb Knight, a software developer who found that Perplexity was circumventing robots.txt to scrape websites it wasn’t supposed to, told 404 Media there are many cases where it’s hard to tell what a user agent does or who operates it. “What’s happening to people, including me, is copy-pasting lists of agents without verifying every agent is a real one,” he said. Knight added that the Wall Street Journal and many News Corp-owned websites are currently blocking a bot called “Perplexity-ai,” which may or may not even exist (Perplexity’s crawler is called “PerplexityBot.”)

“We couldn't see any evidence of this agent anywhere,” he said. “My guess is at some point someone at a News Corp property added this user agent then it's been copied over to other sites under their ownership.”

Other experts I spoke to agreed that the current user agent landscape is very confusing, but said that most webmasters can and should err on the side of being aggressive about blocking suspected AI crawlers, because “blocking” an agent that doesn’t exist doesn’t cause any harm.

“If they don’t end up existing, blocking them has no impact,” Walter Haydock, CEO of the cybersecurity company StackAware said. “More broadly, this shows that there is a lot of confusion and uncertainty over how AI training does (and should) work. Blocking agents from AI companies relies on the companies respecting robots.txt files. And also knowing about all of the AI scraping agents out there. The combined likelihood of this happening is pretty low for most organizations, so I anticipate more creators are going to move their content behind paywalls to prevent unrestrained scraping.”

Cory Dransfeldt, a software developer who is maintaining an AI bot blocklist on GitHub, said that “my inclination is to err on the side of being aggressive in blocking bots given the behavior of companies like Perplexity.”

“There's absolutely a good deal of copying and pasting of [robots.txt] lists going on,” he said. “The folks I've spoken to have been frustrated by the tech industry's broad embrace of web scraping and are looking for ways to combat it.”

About the author

Jason is a cofounder of 404 Media. He was previously the editor-in-chief of Motherboard. He loves the Freedom of Information Act and surfing.

Account

Navigation

Follow us

Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)