Published in 2020, YouTube Subtitles includes subtitles from over 12,000 videos that have since been deleted from YouTube, and in at least one instance, a creator’s entire deleted online presence has been incorporated into numerous AI models. (Source: Image by RR)

Content from Popular Channels and Educational Institutions Used Without Permission

An investigation by Proof News has revealed that several major AI companies, including Apple, Nvidia, Anthropic, and Salesforce, have been using subtitles from over 173,000 YouTube videos without the creators’ knowledge or consent to train their AI models. This practice goes against YouTube’s policies, which prohibit harvesting materials from the platform without permission. As reported in proofnews.org, the dataset, called YouTube Subtitles, includes transcripts from educational channels like Khan Academy and media outlets such as the BBC and NPR, as well as videos from popular YouTube creators and even those promoting conspiracy theories.

Content creators affected by this, such as David Pakman and Dave Wiskus, have expressed their frustration, noting that their work has been used without compensation and could potentially undermine their livelihoods. They argue that using their content without consent is theft and disrespectful, highlighting the need for proper regulation and compensation for creators whose work is utilized in AI training. Representatives from the companies involved either declined to comment or justified their actions by stating the data was publicly available, despite it being against YouTube’s terms of service.

The dataset, which forms part of a larger compilation called the Pile, includes not just YouTube subtitles but also material from other sources like the European Parliament and Wikipedia. AI companies, including those with substantial financial backing, have used this data to train high-profile models, raising concerns about the ethics and legality of using such datasets. This issue has sparked lawsuits from authors whose works were similarly used without permission, with ongoing litigation highlighting the complex legal landscape surrounding AI training data.

Many creators remain uncertain and deeply concerned about the future, fearing that AI technology could generate content similar to theirs and potentially replace their roles entirely. Instances like David Pakman encountering a fake Tucker Carlson video on TikTok that used his own words, down to the exact cadence, underscore the unsettling capabilities of AI in mimicking human content. This incident highlights a broader fear among creators that AI-generated content could become indistinguishable from their own, leading to a proliferation of digital copycats that could diminish the value and uniqueness of original work. The ethical implications of such technology are significant, as creators worry about losing control over their intellectual property and the potential economic impact on their livelihoods.

read more at proofnews.org