![]() |
![]() |
|
Click the comments link on any story to see comments or add your own. Subscribe to this blog |
29 Jun 2025
Two sets of authors sued Anthropic and Meta in San Francisco for copyright infringement, arguing that the companies had pirated their works to train their LLMs. Everyone agreed that a key question was whether fair use allowed it, and in both cases the courts looked at the fair use issue before dealing with other aspects of the cases. Even though the facts in both cases were very similar, last week two judges in the same court wrote opinions coming to very different conclusions. How can that happen? Is fair use broken? Fair use is a confusing corner of copyright law. The first case dealing with it was decided in 1841, but for over a century, there was nothing in the statutes about it, just a string of court decisions. In 1976 Congress codified fair use into the law. Rather than providing a concrete definition, the law tells judges to decide using these four factors: (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. For the past half century courts have been interpreting the law, which has not clarified the situation as much as one might hope. In each case the judge looks at the four factors, using prior cases for guidance if he or she can find any, then adds up the results using whatever weights seem reasonable, and that's the result. In the first case, Judge William Alsup issued an order deciding whether LLM scanning was fair use. The case was filed against Anthropic, whose LLM is Claude. They wanted to use published books as training material, because books are generally high quality content. So they downloaded a bunch of giant archives including Books2, LibGen and PiLiMi, which everyone including them knows are pirated, and started using them to train the LLM. After a while they became "“not so gung ho about” training on pirated books “for legal reasons”" (this is a quote from the judge's order, quoting from an Anthropic document) so they spent a lot of money to bulk purchase millions of mostly used books, sliced off the bindings, and scanned them. They then used the scans to train Claude. The authors' books were in both collections, but the judge considered the pirated collection and the scanned collection different enough to opine separately on them. He found that the first factor, the purpose, favored Anthropic, since an LLM is not like a book, and unlike in some other cases, the authors did not claim that the LLM output large chunks of their text. The second factor, the nature of the work, favors the authors since published books have always been covered by copyright. The third factor, the amount used, favors Anthropic since LLMs need all the training material they can get and there's no plausible way they could have gotten similar results with only part of the books. The fourth factor, the effect on the market for the books, is harder to analyze. The authors didn't claim that Claude produced their books or competitors ("knockoffs" said the judge) to them. They did claim a vaguer kind of competition, such as summaries of the facts in the non-fiction books, and also that they should be paid for LLM training. Judge Alsup waves both of those away. For the first, he likens it to schoolchildren reading books to learn how to write well, for the second, he notes that they have no inherent right to be paid for training, and that Anthropic has said that negotiating training rights for each book would be more expensive than it's worth. So overall, he agrees that LLM training is fair use, but he also says that pirating books to do so is not, and he will figure out what they owe for that later. In this case, Anthropic shot themselves in the foot. If they'd only used the pirated books, they could have made an argument that there was no reasonable alternative, except that then they proved that there was by buying and scanning all those books. I can't feel too sorry for them, though. Two days later, Judge Vince Chhabria issued an order in the other case, authors vs. Meta (i.e., Facebook) whose LLM is Llama. The facts in this case are quite similar. Meta downloaded a lot of training material, including Wikipedia, Github, arXiv, Stack Exchange, and Project Gutenberg which are OK for copyright purposes, but also several "shadow libraries" of pirated books including LibGen, and Anna's Archive which includes LibGen and others. Meta investigated getting licenses for the books but found it difficult and CEO Mark Zuckerberg said to use LibGen. The authors claimed that downloading their works and using them to train Llama is infringing. The opinion starts with a 2½ page discussion of his understanding of AI and LLMs. Let me just say that he appears to have quaffed deeply of the AI Kool-aid, including an aside in which he accuses Judge Alsup of "blowing off the most important factor in the fair use analysis", not language one normally sees one judge direct at another, particularly not at one he sees in the courthouse lunchroom every day. Meta "torrented" the shadow libraries to download them quickly. An optional but common feature of torrent programs is "leeching", passing along chunks of the downloaded material to other torrent users. The authors claim that means Meta redistributed their books, another copyright violation. But it's not clear whether Meta enabled leeching and even if so, Meta says there's no reason to assume that the leeched bits included those particular books. Neither side asked for summary judgment on that issue, so it will be dealt with later. He then turns to the four factor analysis decide if fair use permits what Meta did. The first factor analysis is similar to Alsup's, the LLM is not like a book, nobody claims that the LLM can repeat significant chunks of the authors' works. He waved away some unpersuasive arguments that LLM training is the same as a human reading, so this factor favors Meta. Unlike Judge Alsup, he found that downloading the pirated copies of the books was excusable since its purpose was to enable the LLM training. Factor two, nature of the work, favors the plaintiffs, for the same reason Judge Alsup found, the material is published books. Factor three, amount used, favors Meta, again for the same reason Judge Alsup found, more training material is better for LLMs. Factor four, the effect on the potential market, I find, ah, a bit much. Maybe it can generate works that are similar enough (in subject matter or genre) that they will compete with the originals and thereby indirectly substitute for them. Or someone might use LLMs to generate massive amounts of text in significantly less time than it would take to write that text, and using a fraction of the creativity. People could thus use LLMs to create books and then sell them, competing with books written by human authors for sales and attention Or perhaps People might even be motivated to make those books available for free, given how easily it will presumably be to prompt an LLM to create them. He goes on to opine about how different kinds of works might be affected differently by this. But then, When considering market dilution, the proper comparison isn’t to a world with no LLMs, but to a world where LLMs weren’t trained on copyrighted books. Perhaps an LLM trained only on public domain works could still be capable of quickly generating large numbers of books that could compete for sales with copyrighted books. But there is plenty of evidence in the record that training on books substantially benefits LLMs’ creativity and ability to generate long pieces of text. He then rejects arguments that market dilution isn't relevant for the fourth factor, and goes on for more pages of speculation that I am not going to try to summarize. But then he sighs deeply, observes that the plaintiffs didn't claim this kind of market harm, so he can't consider it, and Meta wins on fair use. He concludes by saying that if training turns out not to be fair use, that doesn't mean they can't train on books, but that they have to figure out how to pay for it. He practically begs the plaintiffs to refile their suit: Because the issue of market dilution is so important in this context, had the plaintiffs presented any evidence that a jury could use to find in their favor on the issue, factor four would have needed to go to a jury. Or perhaps the plaintiffs could even have made a strong enough showing to win on the fair use issue at summary judgment. But the plaintiffs presented no meaningful evidence on market dilution at all. The judge is way out over his skis here. Yes, it is possible that LLMs can do all sort of stuff, and it is true that Amazon is already awash in AI generated slop. But it is pure speculation that an LLM can wreck the market for books written by humans, and doubly speculative to assert that an LLM with copyrighted books in the training corpus would wreck it worse than one trained without. As far as I can tell, every major LLM has pirated books in its training material so we have no evidence either way. So now we have one ruling that says that LLM training is OK if you base it on books you've gotten legally, not if on pirated books. And we have another that says it's OK from pirated books, but if an author complains that the LLM might compete with them, then it's not OK. This does not give useful guidance to anyone. These are just preliminary rulings in both cases, and it will be months if not years before all of the other issues are resolved. At that point both cases will surely be appealed, and since they're in the same judicial district (indeed the same building) the appeals will go to the Ninth Circuit, and probably will be consolidated since the issues are similar. Then the circuit will get another chance to describe a coherent rule. There are 29 judges in the circuit, appointed by presidents from Clinton to Biden. Three judges will be chosen at random for the appeals panel so it's anyone's guess who they'll be or how they'll see the issues.
|
TopicsMy other sitesOther blogsCAUCE A keen grasp of the obvious Related sitesCoalition Against Unsolicited Commercial E-mail |
© 2005-2024 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will
not give, sell, or otherwise transfer addresses maintained by this
website to any other party for the purposes of initiating, or enabling
others to initiate, electronic mail messages.