Click the comments link on any story to see comments or add your own.
Subscribe to this blog
08 Jan 2024
In the past few months there have been four similar suits filed in New York against OpenAI and Microsoft. All four look superficially similar, and all are likely to be heard by the same judge, but one of them is a lot stronger than the other three.
The first was filed in September by the Authors Guild on behalf of several well known fiction writers, purporting to be a class action on behalf of every author of a work of fiction that has sold over 5,000 copies. It lays out in great detail the many books their authors have written, and then describes in snarky detail how LLMs and GPT work. Then for each author, they say that one can prompt GPT and get accurate summaries of the books, and in a few cases it wrote plot outlines of sequels. While this certainly shows that GPT was trained on their books, the proper response is so what?
There is a separate complaint that the copies of the books they used to train came from online pirate text archives. That may well be true, but again, so what? If OpenAI had bought an ebook copy of each book they used for training, the Guild would still have the identical complaint. If they object to the pirated books (which they have every right to do), they need to go after the pirates.
The Guild, to which I belonged long ago, before they sued Google and lost about book scanning, has persuaded itself that every word its members write is a priceless jewel, any use whatsoever needs to be licensed and paid, and fair use basically doesn't exist.
This is, to put it mildly, not what copyright law says. If you couldn't write summaries, book reviews would be illegal. If someone used the plot outlines to write and publish sequels, that would be a copyright problem, but a one-off response to a query, again, so what? In the recent Andy Warhol case, the Supreme Court ruled that Warhol's art prints based on a copyrighted photograph weren't a problem, but licensing those prints for magazine covers in competition with the photograph was. Two important parts of fair use are whether the use is ``transformative'', doing something different from the original, and what effect it has on the market for the original. In this case, a summary or possible sequel plot are not the same thing as the original book, and the effect on the market nonexistent. I expect this suit will be disposed of quickly in OpenAI's favor.
In November nonfiction writer Julian Sancton filed a very similar suit, amended the following month to include 11 other authors, this time purporting to be a class action on behalf of everyone who's ever written a nonfiction book. It makes nearly identical claims that GPT was trained on their books, and complains in slightly more detail that the copies they were trained on were pirated. This case has been assigned to the same judge and I expect it to be equally unsuccessful.
In early January journalists Nicholas Basbanes and Nicholas Gage filed yet another copycat suit, again complaining that GPT was trained on pirated copies of their books, with the class this time purporting to be every author whose works have been used to train the LLMs. I presume this will be consolidated with the other two cases since the classes overlap, and will meet the same fate.
The one case that is somewhat stronger was filed at the end of December by The New York Times. It makes all of the same complaints about GPT using their materinal without permission, but unlike the other three cases, they make an argument that is at least somewhat plausible that it's not fair use.
One of the attachments to the compplaint shows a hundred examples where they prompted GPT-4 with the first part of an article, and it responded with the rest of the article or a close paraphrase. Given the way LLMs work, one response would be to say, well, if you give it the first half of the article, what else would you expect? But I think this also provides some support for the argument that the way OpenAI has used the Times' articles is too close to a substitute for the Times itself.
In the US, whether something is fair use is very case specific and judges have to look at four factors listed in the law, the fourth being the effect on the market for the work. If the Times can make a credible argument that people use GPT to evade their paywall, or to get their Wirecutter column's product advice without looking at the column, that would be a strong fourth factor argument against fair use.
Finally, remember that "how are newspapers supposed to make money?" is an interesting question, but not one that is particularly relevant to this case. In the U.S. the point of copyright law is to give authors an incentive to write stuff, but not to make any sort of promise that they'll be financially successful. When Craigslist destroyed the classified ad business, that was great for all of the people who can now place ads for free, financially unfortunate for the newspapers that depended on classified ads, but it was not up to Craigslist to replace the lost income. In the same vein, while there are open questions of what is allowable under fair use and what is not, "the newspapers need the money", even if true, is not part of the discussion.
All four cases name Microsoft as a co-defendant, and it is obvious that the reason they do is that Microsoft has much deeper pockets than OpenAI. Unless OpenAI and the Times settle quickly (not out of the question since they were negotiating before the suit was filed), this case looks like a long slog with a great deal of discovery about exactly what training material was used, how they used it, and duelling expert reports on what that means.
My other sites
© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.