Music Used to Train AI: The Atlantic's Searchable Database

The Atlantic Exposes the Music Behind AI: A Searchable Database Goes Public

Artificial intelligence is transforming the music industry at a breathtaking pace — generating melodies, mimicking artists, and composing entire tracks from scratch. But one critical question has lingered in the background for years: whose music is actually being used to teach these AI systems? Thanks to investigative work by Atlantic reporter Alex Reisner, that question now has a far more concrete answer. Reisner uncovered four significant datasets of music that have been used to train AI models and, in a landmark move for transparency, made them fully searchable by the public.

This development has sent shockwaves through the music community, reigniting conversations about copyright, consent, and compensation in the age of artificial intelligence. Here is everything you need to know about what was found, who is implicated, and why it matters.

What Did The Atlantic Actually Find?

Reisner's investigation identified four distinct datasets that researchers and AI developers have been using as training material for machine learning models focused on audio and music generation. The scale of these datasets is staggering.

Dataset 1: Approximately 12 million tracks — one of the largest music training datasets ever reported publicly.
Dataset 2: Roughly 9 million tracks, making it the second-largest dataset identified in the investigation.
Datasets 3 and 4: Each containing over 100,000 songs, which, while smaller in comparison, still represent a substantial volume of copyrighted and licensed material.

Together, these four datasets account for more than 21 million individual music tracks. To put that in perspective, Spotify's entire catalog sits at around 100 million songs — meaning these training datasets alone represent a significant slice of all recorded music in circulation.

The Atlantic's searchable tool allows artists, songwriters, record labels, and curious members of the public to look up whether their music appears in any of these collections. That kind of direct access to previously opaque data is rare and potentially game-changing for legal and regulatory efforts moving forward.

Which AI Companies Have Used These Datasets?

One of the most consequential aspects of Reisner's reporting is that it is not purely speculative. While it is impossible to determine with certainty every organization that has downloaded and used these datasets — they have reportedly been downloaded thousands of times — two major technology companies have explicitly confirmed their use in published research papers.

Google and Stability AI have both acknowledged utilizing at least some of this training data. Google, the parent company of YouTube and a leading force in AI development, and Stability AI, known for its generative AI tools, are among the most prominent names in the industry. Their confirmation lends significant weight to the argument that large-scale, potentially unauthorized use of copyrighted music in AI training is not a fringe concern — it is an industry-wide practice.

The acknowledgment by these firms, even if buried in academic research papers rather than public announcements, provides ammunition for artists and rights holders who have been pursuing legal action against AI companies over intellectual property violations.

Are These Datasets Legal? The Copyright Gray Zone

The legality of using music to train AI models sits in a deeply contested gray area of intellectual property law. Some of the sources feeding into these datasets, such as the Free Music Archive, are explicitly free to stream for personal use. However, personal streaming rights are very different from the rights required to feed millions of audio files into a machine learning algorithm designed to replicate musical patterns and styles.

Under current copyright law in the United States, the doctrine of fair use may provide some protection for AI companies that argue their training constitutes transformative use. However, courts have not definitively settled this question, and several high-profile lawsuits — including cases involving visual art generators — are working their way through the legal system and setting precedents that could directly affect music AI as well.

Many artists and industry advocates argue that regardless of legal technicalities, using someone's creative work to build a commercial product without their knowledge, permission, or compensation is ethically indefensible. The Atlantic's database now gives those artists the evidence they need to know whether their work has been part of this process.

Why This Matters for Artists and the Music Industry

The implications of Reisner's findings extend far beyond a single investigative article. For working musicians, producers, and songwriters — many of whom are already struggling in a streaming economy that pays fractions of a cent per play — the prospect of their life's work being harvested for free to train competitors is deeply alarming.

The searchable database gives individual artists a tool for accountability. Rather than relying entirely on class-action lawsuits or regulatory bodies to act on their behalf, a musician can now directly check whether their catalog has been included in datasets confirmed to be used by companies like Google. That information could be critical in future litigation or licensing negotiations.

For the music industry as a whole, this moment represents a potential inflection point. Labels and publishing companies have significantly more legal resources than independent artists, and now that the training data pipeline is more visible, expect the pace of legal challenges to accelerate considerably.

The Broader Conversation About AI Transparency

The Atlantic's work is part of a wider and growing demand for transparency in how AI systems are built. Across creative industries — visual art, literature, journalism, and now music — the same fundamental complaint is emerging: AI companies have industrialized the scraping of human creative output without disclosure, compensation, or consent.

Regulatory bodies in the European Union are already moving toward stricter disclosure requirements for AI training data under the EU AI Act. In the United States, the Copyright Office has been soliciting public comments on AI and copyright, signaling that federal guidance may eventually follow.

Making these datasets publicly searchable is a meaningful act of journalistic accountability. It shifts the conversation from abstract policy debates to something tangible — a list of songs, a name you recognize, a track you recorded in your bedroom that ended up teaching a machine how to sound like you.

What Comes Next?

The discovery and publication of these datasets is unlikely to be the end of the story. As AI audio tools become more sophisticated and more commercially valuable, the stakes around training data will only rise. Artists and advocacy groups will continue pushing for opt-in frameworks that require explicit consent before any music can be used for AI training. Some companies may begin offering licensing deals as a proactive measure to avoid litigation. Others may double down on fair use arguments in court.

What The Atlantic has done is ensure that the conversation can no longer happen in the dark. The music is out there — and now, so is the evidence.