It's been a thrilling (in the loosest sense) week for copyright enthusiasts, as two rulings have been issued by judges in the Northern Direct of California in two separate cases brought by book authors (namely Bartz v Anthropic and Kadrey v Meta) on the question of fair use in AI training. In both cases, the relevant judge granted summary judgment (at least partially) in favour of the AI developer, but the two courts had quite different views on the topic of "market dilution".
Background
The two cases involve relatively similar fact patterns.
Bartz v Anthropic is a class-action case brought by three book authors (Andrea Bartz, Charles Graeber and Kirk Wallace Johnson) against Anthropic PBC (Anthropic), an AI software firm, regarding its Large Language Model (LLM) called Claude.
Based on evidence and submissions filed in the proceedings, it appears that, in early 2021, one of the co-founders of Anthropic downloaded Books3, an online library of nearly 200,000 books which had been assembled from pirated copies. Anthropic subsequently downloaded at least seven million further copies of books from repositories based on pirate sources (which Anthropic knew to have been pirated), including books from the plaintiff authors. Successive versions of Claude were trained on these books.
In 2024, Anthropic changed track and spent many millions of dollars acquiring millions of print books, many of them in used condition. It stripped them from their bindings, cut them to size and scanned the books into digital form, then discarding the paper originals. These copies also included the plaintiff authors' books.
Anthropic created a "research library" from all the above sources, and planned to store everything forever, even if not used for training LLMs. Sets or subsets of the books considered most appropriate for training were copied again to create training copies, and the training copies were successively copied to be cleaned, tokenized, and "compressed" into any given trained LLM. Once trained, Claude did not output any further copies to the public - in fact, there was no dispute that additional software was placed between the user and Claude to ensure no infringing output ever reached the users.
For Kadrey v Meta, this Kat won't go through all the details given the similar facts, but in summary thirteen authors—mostly famous fiction writers— sued Meta for downloading their books from online “shadow libraries” and using the books to train Meta’s generative AI models (specifically, its LLM, Llama). It also transpired in evidence that Meta had post-trained its models to prevent them from “memorising” and outputting certain text from their training data, including copyrighted material, so Llama was unable to reproduce any significant percentage of the relevant works as outputs.
Llama-kat and Klaude
The Courts' Rulings
Bartz v Anthropic
The plaintiff authors brought their claim against Anthropic in August 2024, alleging that Anthropic had infringed their copyright by pirating copies for its library and reproducing them to train its LLMs. Anthropic moved for summary judgment on the basis of fair use.
Notably, in its motion, Anthropic argued that pirating initial copies of the authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs.
- the purpose and character of the use, including whether it is of a commercial nature;
- the nature of the copyright-protected work;
- the amount and substantiality of the portion used in respect to the work as a whole; and
- the effect of the use upon the potential market for or value of the work.
The second factor pointed against fair use, for all of the copies, as it was broadly accepted that all of the books contained expressive elements, whereas the third factor pointed for fair use, as the use of the books was reasonably necessary, except for the pirated copies, to which Anthropic had no entitlement.
As for factors one and four, the judge found the following:
In respect of the copies used to train Claude, the purpose and character of the use was "quintessentially transformative". The judge compared it to a reader aspiring to be a writer, who reads books not to replicate and supplant them, but to create something different.
On factor four, the plaintiff authors had argued that training LLMs will result in an "explosion of works competing with their works". However the judge considered that this complaint was "no different than it would be if they complained that training schoolchildren to write well
would result in an explosion of competing works. This is not the kind of competitive or
creative displacement that concerns the Copyright Act. The Act seeks to advance original
works of authorship, not to protect authors against competition" (per Sega, 977 F.2d). This was consequently fair use for the purposes of Section 107.
As for the digitisation of the books purchased in print form, this was also transformative, as retaining them in a digital format eased storage and searchability. Importantly, it wasn't transformative just because it was part of the process for LLM training, which is what Anthropic had argued. The format change did not itself affect the authors' rightful entitlements, and so factor four was neutral. This use was therefore also considered to be fair under section 107.
However, in respect of the pirated copies, these plainly displaced demand for the authors' books, copy for copy, and the fact that the books may have been acquired with the intention of copying them for some further transformative use, did not excuse the initial infringement (even if they were immediately used and immediately discarded, which was not the case here). Even Anthropic accepted that this would "destroy the entire publishing market". The acquisition of pirated copies was not therefore fair use.
Kadrey v Meta
The judge in Kadrey largely came to the same conclusion as the Bartz ruling on factors one, two and three. However, although ultimately coming to the same conclusion, Judge Chhabria noted in respect of factor four:
"This is not a case where an original work is being compared to one secondary work. Nor is this case like the previous fair use cases involving creation of a digital tool. In those cases, like Google Books and Perfect 10, the tool could at most be used to access part or all of the original works. This case, unlike any of those cases, involves a technology that can generate literally millions of secondary works, with a miniscule fraction of the time and creativity used to create the original works it was trained on. No other use—whether it’s the creation of a single secondary work or the creation of other digital tools—has anything near the potential to flood the market with competing works the way that LLM training does. And so the concept of market dilution becomes highly relevant."
However, the judge found that the plaintiffs had failed to present any meaningful evidence on market dilution at all. Had they done so, factor four would have needed to go to a jury for decision, given the importance of market dilution in the context of AI training.
The judge was indeed adamant in noting that "this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one."
Comment
Although on their face these appear to be favourable decisions for AI developers, it is clear from the detail of the rulings that the position is much more nuanced, and in fact may be seen by rightsholders as offering some hope, particularly in respect of market dilution being an important element of the analysis on factor four. Also, it is interesting that both Anthropic and Meta had at some point considered licensing books in bulk, but later abandoned these plans in favour of other, more cost effective (and possibly dubious) options. The ruling in Bartz regarding pirated copies, which were not excused even by the subsequent transformative use, may encourage developers to reassess licensing going forwards, potentially leading to the growth of licensing markets.
No comments:
All comments must be moderated by a member of the IPKat team before they appear on the blog. Comments will not be allowed if the contravene the IPKat policy that readers' comments should not be obscene or defamatory; they should not consist of ad hominem attacks on members of the blog team or other comment-posters and they should make a constructive contribution to the discussion of the post on which they purport to comment.
It is also the IPKat policy that comments should not be made completely anonymously, and users should use a consistent name or pseudonym (which should not itself be defamatory or obscene, or that of another real person), either in the "identity" field, or at the beginning of the comment. Current practice is to, however, allow a limited number of comments that contravene this policy, provided that the comment has a high degree of relevance and the comment chain does not become too difficult to follow.
Learn more here: http://ipkitten.blogspot.com/p/want-to-complain.html