×
Meta secretly trained its AI models on a Russian ‘shadow library,’ court docs show
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Meta’s use of pirated books database LibGen to train its AI language models has been revealed through court-ordered document unredaction, marking a significant development in an ongoing copyright lawsuit filed by authors.

The core revelation: Meta accessed and utilized Library Genesis (LibGen), a controversial pirated content database, for AI model training, despite internal concerns about the legality and optics of this approach.

  • Internal company discussions about using LibGen data were escalated to CEO Mark Zuckerberg
  • Meta employees expressed hesitation about accessing LibGen data from corporate laptops
  • The company’s AI team ultimately received approval to use the pirated materials

Legal context and implications: The case, Kadrey et al. v. Meta Platforms, represents a pivotal moment in determining how tech companies can legally utilize creative works for AI training.

  • Authors Richard Kadrey, Christopher Golden, and comedian Sarah Silverman filed the lawsuit in July 2023
  • Meta previously acknowledged using the Books3 dataset but had not disclosed its direct use of LibGen
  • The company maintains its actions fall under “fair use” doctrine and disputes the plaintiffs’ claims
  • LibGen itself faces ongoing legal challenges, including a recent $30 million judgment in 2024

Court developments: Judge Vince Chhabria’s ruling against Meta’s redaction attempts highlights growing judicial scrutiny of AI companies’ transparency.

  • The judge described Meta’s redaction approach as “preposterous” and aimed at avoiding negative publicity
  • Meta was ordered to file unredacted versions of key documents
  • The court warned Meta against making further broad redaction requests
  • Plaintiffs argue Meta not only used copyrighted material without permission but also participated in its distribution through torrenting

Questions of precedent: The unfolding legal battle between content creators and tech companies raises fundamental questions about AI training practices and intellectual property rights.

  • The case could establish important precedents for how AI companies can legally access and use training data
  • The outcome may influence future AI development practices and relationships between tech companies and content creators
  • Meta’s use of pirated materials suggests potential challenges in legally sourcing comprehensive training data for AI models
Meta Secretly Trained Its AI on a Notorious Russian 'Shadow Library,' Newly Unredacted Court Docs Reveal

Recent News

AI boosts SkinCeuticals sales with Appier’s marketing tech

Data-driven AI marketing tools helped L'Oréal achieve a 152% increase in ad spending returns and 48% revenue growth for SkinCeuticals' online store.

Two-way street: AI etiquette emerges as machines learn from human manners

Users increasingly rely on social niceties with AI assistants, reflecting our tendency to humanize technology despite knowing it lacks consciousness.

AI-driven FOMO stalls purchase decisions for smartphone consumers

Current AI smartphone features provide limited practical value for many users, especially retirees and those outside tech-focused professions, leaving consumers uncertain whether to upgrade functioning older devices.