Advertisement
AI

Anthropic AI Scraper Hits iFixit’s Website a Million Times in a Day

“We're just the largest database of repair information in the world, no big deal if they take it all without asking and swamp our servers in the process.”
Anthropic AI Scraper Hits iFixit’s Website a Million Times in a Day
Image: Jason Koebler

The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”

iFixit CEO Kyle Wiens tweeted Wednesday “Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours? You're not only taking our content without paying, you're tying up our devops resources. Not cool.”

Wiens sent me server logs that showed thousands of requests per minute for a several hour period. “We're just the largest database of repair information in the world, no big deal if they take it all without asking and swamp our servers in the process,” he told me, adding that iFixit’s website has millions of total pages. These include repair guides, revision histories for those guides, blogs, news posts, and research, forums, community-contributed repair guides and question-and-answer sections, etc. 

This sort of scraping has become incredibly commonplace, and a recent study by the Data Provenance Institute shows that website owners are increasingly trying to signal to AI companies that they do not want their content scraped for the purpose of training commercial AI tools. Wiens said that iFixit modified its robots.txt file this week to specifically block Anthropic’s crawler bots. 

This is particularly notable because, when I asked Anthropic about the fact that its bot hit iFixit a million times in a day, I was sent a blog post by the company that puts the onus on website owners to specifically block Anthropic’s crawler, called ClaudeBot. 

“As per industry standard, Anthropic uses a variety of data sources for model development, such as publicly available data from the internet gathered via a web crawler,” the blog post reads. “Our crawling should not be intrusive or disruptive. We aim for minimal disruption by being thoughtful about how quickly we crawl the same domains and respecting Crawl-delay where appropriate.”

OpenAI Training Bot Crawls ‘World’s Lamest Content Farm’ 3 Million Times in One Day
“If you were wondering what they’re using to train GPT-5, well, now you know.”

The post adds that “opting out of being crawled by ClaudeBot requires modifying the robots.txt file” to block its crawler, meaning that instructing companies to not scrape content based on terms of service alone doesn’t actually do anything in practice unless a website wanted to sue the AI company. 

Across the board, AI companies almost never respect terms of service, which is interesting because many of them have very long terms of service agreements themselves that sometimes restrict what users can do. In a paper published last week that we’ve already written about a few times, researchers at the Data Provenance Institute found that many websites have requested that their content not be scraped in their terms of service, but that often does not do anything. 

This is a shame, lead author Shayne Longpre told me, because terms of service allows website owners to be more nuanced about the types of crawlers they want to allow or block than robots.txt does. 

“The tragedy is that terms of service are specific and nuanced, but not machine readable and robots.txt is machine readable, but incredibly coarse and unspecific,” Longpre said. “With terms of service, I suspect the only ones that are being complied with are the very large companies that maybe have filed lawsuits, but they seem to be otherwise ignored.”

Advertisement