Meta’s AI Tool Llama Almost Remembers Harry Potter Book, Research Reveals


A recent analysis shows that Meta’s Llama model has retained “Harry Potter and the Sorcerer’s Stone” to such a degree that it can replicate verbatim passages from 42% of the text. Researchers hailing from Stanford, Cornell, and West Virginia University investigated various books within the Books3 dataset, a compilation of pirated texts utilized for training Meta’s Llama models. Books3 is currently entangled in a copyright infringement lawsuit against Meta, Kadrey v. Meta Platforms, Inc. The findings indicate critical considerations for AI firms encountering analogous legal hurdles.

The study reveals that the Llama 3.1 model “retains some texts, such as Harry Potter and 1984, nearly in full.” Specifically, Llama 3.1 can accurately reproduce verbatim excerpts from 42% of the initial Harry Potter novel at least half of the time. In total, it could provide passages from 91% of the book, albeit with less reliability.

“The degree of verbatim retention of works from the Books3 dataset is more pronounced than previously acknowledged,” the paper asserts. Nonetheless, the researchers discovered that “retention differs significantly from model to model and from text to text within each model, as well as fluctuating in various sections of individual texts.” For example, Llama 3.1 retained merely 0.13% of “Sandman Slim” by Richard Kadrey, a lead plaintiff in the class action copyright lawsuit against Meta.

While several results seem detrimental, they do not constitute conclusive proof for plaintiffs in AI copyright infringement cases. “These findings provide all participants in the AI copyright discussion something to hold onto,” journalist Timothy B. Lee noted in his Understanding AI newsletter. “Contradictory findings like these may raise questions about the wisdom of grouping J.K. Rowling, Richard Kadrey, and countless other authors into a single mass lawsuit. This could be advantageous for Meta, as most authors lack the means to initiate individual legal actions.”

What accounts for Llama’s varied reproduction rates among different texts? “I believe the disparity arises from the fact that Harry Potter is an exceptionally well-known book. It’s frequently referenced, and I am certain that considerable portions of it on third-party sites ended up in the training data on the internet,” remarked James Grimmelmann, a professor of digital and information law at Cornell University, as referenced in the paper.

Grimmelmann further indicated that “AI organizations can make decisions that either boost or diminish memorization. It’s not an unavoidable characteristic of AI; they possess control over it.”

Meta has been approached for a statement regarding the study’s outcomes, and the article will be revised should a response be provided.

Disclosure: Ziff Davis, the parent company of Mashable, initiated legal action against OpenAI in April, claiming it violated Ziff Davis copyrights while training and operating its AI systems.