The Atlantic Uncovers OpenSubtitles Data Set at the Heart of AI Training
A treasure trove of Hollywood dialogue has been repurposed to train AI systems, raising questions about transparency and creative rights
Hollywood writing has long been a staple of cultural influence, but its role in shaping the capabilities of artificial intelligence is now becoming undeniable. The Atlantic found that dialogue from movies and TV shows has been utilized by major tech companies like Apple, Anthropic, and others to train AI systems, marking a significant shift in how creative content is repurposed. From classics like The Godfather to contemporary series like Breaking Bad, these works are part of extensive data sets that power AI-driven tools and systems.
The scope of this usage is vast. Data sets derived from subtitle repositories, such as OpenSubtitles.org, have played a key role. These subtitles, which often originate from DVDs, Blu-rays, or streaming services, are uploaded by users and offer a unique resource for AI training. Unlike scripts, subtitles capture the rhythm and nuances of spoken dialogue, making them invaluable for developing chatbots and other generative AI systems capable of natural conversation.
A closer examination of these data sets reveals their comprehensive nature. They include subtitles from over 53,000 films and 85,000 TV episodes, encompassing everything from Seinfeld and The Simpsons to the Academy Awards broadcasts. AI systems trained on such data are able to mimic styles and characters from across decades of storytelling. Companies such as Meta, Nvidia, and Salesforce have utilized these subtitles in developing large language models (LLMs) like Claude and NeMo Megatron. While many companies claim their use is for research or open-source development, the potential for these models to replace human writers looms large.
The ethical and legal questions surrounding this practice are significant. Subtitles are generally considered derivative works, and their use without explicit permission could constitute copyright infringement. Writers, actors, and other creatives have expressed deep concerns over the lack of transparency in AI training processes. Vince Gilligan, creator of Breaking Bad, has referred to generative AI as a sophisticated form of plagiarism, encapsulating the unease felt by many in the creative community.
Legal battles are underway to address these issues, with lawsuits targeting the use of copyrighted material in AI training. Tech companies, however, argue that such practices fall under “fair use.” Courts have yet to settle this contentious debate, leaving a gray area that further complicates the relationship between AI and the creative arts.
OpenSubtitles.org, the source of much of this data, was never intended for this purpose. Initially aimed at facilitating multilingual translations, the data set has now been co-opted for AI development. Its creator expressed a mix of surprise and resignation over its role in training LLMs, underscoring the lack of control artists and contributors have once their work enters the digital ecosystem.
The implications of this widespread data usage are profound. While these AI systems are undeniably innovative, their reliance on creative works raises questions about consent, attribution, and compensation. Writers and artists, whose work forms the backbone of these systems, are often left in the dark about how their contributions are being used—and what they might be owed in return. As AI continues to advance, the debate over its relationship with human creativity is far from over.