Advertisement
AI

The Backlash Against AI Scraping Is Real and Measurable

In the last year, the number of websites specifically restricting OpenAI and other AI scraper bots has gone through the roof.
The Backlash Against AI Scraping Is Real and Measurable
Image: Avram Swam/Unsplash

There has been a real backlash to AI’s companies’ mass scraping of the internet to train their tools that can be measured by the number of website owners specifically blocking AI company scraper bots, according to a new analysis by researchers at the Data Provenance Initiative, a group of academics from MIT and universities around the world. 

The analysis, published Friday, is called “Consent in Crisis: The Rapid Decline of the AI Data Commons,” and has found that, in the last year, “there has been a rapid crescendo of data restrictions from web sources” restricting web scraper bots (sometimes called “user agents”) from training on their websites. 

Specifically, about 5 percent of the 14,000 websites analyzed had modified their robots.txt file to block AI scrapers. That may not seem like a lot, but 28 percent of the “most actively maintained, critical sources,” meaning websites that are regularly updated and are not dormant, have restricted AI scraping in the last year. An analysis of these sites’ terms of service found that, in addition to robots.txt restrictions, many sites also have added AI scraping restrictions to their terms of service documents in the last year.

This change has happened almost entirely within the last year, the researchers found. In mid 2023, about 1 percent of websites in the researchers’ sample had fully restricted AI scraping. Now, they estimate about 5-7 percent do, and say that the number only captures “full restricted” domains meaning their robots.txt file does not allow for any AI scraping. 

The study, led by Shayne Longpre of MIT and done in conjunction with a few dozen researchers at the Data Provenance Initiative, called this change an “emerging crisis” not just for commercial AI companies like OpenAI and Perplexity, but for researchers hoping to train AI for academic purposes. The New York Times said this shows that the data used to train AI is “disappearing fast.”

“I think it’s a pretty incredible rise in consent being revoked for crawlers, that we haven’t seen to this degree before and now it’s happened all in less than a year,” Longpre told 404 Media. “I think it’s a reflection of people trying to protect their own markets, economies, and livelihoods, or, in some cases, of people that are looking to commercialize their data in specific ways.” 

The analysis suggests that the backlash to AI tools from content creators and website owners who do not want their work to be used for AI training purposes without permission or compensation is not only real but is becoming increasingly widespread. The analysis also highlights the limitations of robots.txt, which is essentially a text file that instructs bots whether they are specifically allowed or blocked from accessing a website. While many companies respect robots.txt instructions, some do not. Several large companies, including Perplexity, have been caught using proxies and surreptitious user agents to circumvent and ignore robots.txt.

Some websites have chosen only to restrict some specific AI companies’ scraper bots. In some cases this is happening because website owners—Reddit comes to mind—have signed specific deals with specific AI companies that allow some of them to scrape their content for use in AI training but have restricted others. Other times, website owners may not know the full breadth of AI companies that have web crawler bots, and so may not know which ones they should block. 

OpenAI’s scraper bots are the most commonly blocked, according to the study. If this trend holds, it is possible that OpenAI’s popularity and size will actually start to work against it as other upstarts fly under the radar and are not blocked.

“There’s asymmetries in what’s being blocked, because it’s a huge burden on website owners who can’t go through hundreds, sometimes thousands of crawlers to figure out which ones they want to block,” Longpre said. “So OpenAI is by far the most blocked, and other companies are flying under the radar. It’s unclear how this will play out because the big companies will maybe start to license that data. Others decide to just ignore the robots.txt anyway.”

Longpre says he doesn’t know what will happen to commercial AI companies as more websites block scraping. Some experts have theorized that AI companies will essentially run out of new text to scrape, and that we will stop seeing major improvement between generations of generative AI models. Another possibility is that different companies’ models will start to differentiate themselves based on which data they were able to license. 

“The gap between companies that can license data and everyone else who can’t is already starting to show,” he said.

Longpre worries, however, that academics creating open source models to perform research will be left in the dust because robots.txt does not allow for the easy differentiation between commercial scraping and academic scraping (it is also worth noting that some training data sets originally created for academic use have been misappropriated into commercial models). He also worries that website owners will begin to block all types of scrapers, even if they are not explicitly intended to create commercial AI tools. Search engines and web archives could, for example, begin to get blocked. 

“We would love a world where people can in detail express their preferences in a way that could be machine readable—things like commercial vs non commercial, ‘requires attribution,’” Longpre said. “In general, this access to study the web, even if it’s not for AI, is a really important thing for transparency, accountability, journalism, and even web archives, which are being hit by this. So you have this weird dilemma where prosocial uses of this information are already being impacted.” 

Advertisement