Advertisement
AI

Developer Creates Infinite Maze That Traps AI Training Bots

"Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself."
Developer Creates Infinite Maze That Traps AI Training Bots

A pseudonymous coder has created and released an open source “tar pit” to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources.

“It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,” Aaron B, the creator of Nepenthes, told 404 Media. 

“Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,” they added. “But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.”

Human users can see how Nepenthes works by clicking here, though I must warn that the page loads incredibly slowly (on purpose) and links endlessly to pages that load the same way. It looks like this, in practice:

0:00
/0:31

Aaron B’s website says “THIS IS DELIBERATELY MALICIOUS CODE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN’T FULLY COMFORTABLE WITH WHAT YOU’RE DOING.” It also notes it can be deployed “defensively” to “flood our valid URLs within your site’s domain name, making it unlikely the crawler will access the real content” and “offensively” to actively trap and waste computing power: “Let's say you've got horsepower and bandwidth to burn, and just want to see these AI models burn. Nepenthes has what you need … In short, let them suck down as much bullshit as they have diskspace for and choke on it.”

We have previously written about the difficulty that website owners have had in blocking the web crawlers that train large language models. It is possible to use robots.txt to ask specific bots not to crawl a webpage, but different companies use different bots, the names of those bots often change, and some companies do not honor robots.txt requests or find ways to get around them. Nearly endless internet art projects have proven particularly difficult for bots to crawl; last year, the man who wrote The Internet for Dummies had “the world’s lamest content farm,” a website made up of billions of interconnected single-page sites, hit more than 3 million times by OpenAI’s training bot in a single day. Anthropic’s AI scraper later hit the DIY repair company iFixit more than a million times in a day

“Hearing these stories recently definitely pushed me into putting out a release,” Aaron B said. “It's also sort of an art work, just me unleashing shear unadulterated rage at how things are going. I was just sick and tired of how the internet is evolving into a money extraction panopticon, how the world as a whole is slipping into fascism and oligarchs are calling all the shots - and it's gotten bad enough we can't boycott or vote our way out, we have to start causing real pain to those above for any change to occur.”

Since they made and deployed a proof-of-concept, Aaron B said their pages have been hit millions of times by internet-scraping bots. On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media “If that’s, true, I’ve several million lines of access log that says even Google Almighty didn’t graduate” to avoiding the trap.

Advertisement