AI Music Generator Suno Admits It Was Trained on ‘Essentially All Music Files on the Internet’

· Aug 1, 2024 at 4:52 PM

"Suno’s training data includes essentially all music files of reasonable quality that are accessible on the open internet."

AI Music Generator Suno Admits It Was Trained on ‘Essentially All Music Files on the Internet’

The AI music generator company Suno has admitted that its product was trained on “essentially all music files of reasonable quality that are accessible on the open internet,” which included a total of “tens of millions of recordings.”

While not surprising, the admission, which was made as part of court proceedings responding to a massive recording industry lawsuit against the company, shows yet again that many AI tools are trained on, essentially, anything that companies can get their hands on. It shows that the general goal is to consume the sum total of human knowledge and output to create AI tools that then often compete with humans. Rather than trying to argue that Suno was not trained on copyrighted songs, the company is instead making a Fair Use argument to say that the law should allow for AI training on copyrighted works without permission or compensation.

“It is no secret that the tens of millions of recordings that Suno’s model was trained on presumably included recordings whose rights are owned by the Plaintiffs in this case,” Suno wrote in a filing Thursday, adding that Suno was trained on “as many [recordings] as can be located … Accordingly, Suno’s training data includes essentially all music files of reasonable quality that are accessible on the open internet, abiding by paywalls, password protections, and the like, combined with similarly available text descriptions.”

Suno’s response does not explain exactly how it scraped the songs, but says that its tool was “constructed by showing the program tens of millions of instances of different kinds of recordings gathered from publicly available sources.”

“This case is not a ‘whodunnit.’ Irrespective of whether UMG’s [Universal Music Group’s] particular version of ‘Johnny B. Goode’ was in Suno’s training data, many UMG recordings probably were given the massive size of the catalog UMG has assembled through decades of M&A transactions,” the company added.

Suno also wrote that many labels’ licensing agreements do not allow for the training of AI on copyrighted songs, which the company says is an example of labels “trying to leverage their exclusive rights under copyright law to strong-arm music users into categorically avoiding artificial intelligence products.” Suno has positioned itself as a company that is trying to fight record label monopolies to “empower many more people to create music” using AI tools.

The admissions in the filing are both stunning in its scale and not at all surprising given what we know about how AI tools work and the numerous articles we and others have done that have shown both brazen and surreptitious attempts to train generative AI tools on the collective output of humankind. This particular lawsuit, as we wrote earlier and discussed on our podcast, will likely be one of the most important tests of whether AI companies will be able to convince a court that mass data scraping for the purpose of a for-profit, generative product that competes with humans, is a protected, transformative fair use of copyrighted material.

The filing shows that Suno intends to make an argument that its product is “designed for originality … to create new songs that didn’t and often couldn’t previously exist.”

In a press release issued Thursday, the Recording Industry Association of America wrote “After months of evading and misleading, defendants have finally admitted their massive unlicensed copying of artists’ recordings. It’s a major concession of facts they spent months trying to hide and acknowledged only when forced by a lawsuit.”

About the author

Jason is a cofounder of 404 Media. He was previously the editor-in-chief of Motherboard. He loves the Freedom of Information Act and surfing.