The battle over how large language models can use published works is just beginning. Copyright law isn’t ready.
Bryan Walsh is the editor of Vox’s Future Perfect section, which covers the policies, people, and forces that could make the future a better place for everyone. He worked at Time magazine for 15 years as a foreign correspondent in Asia, a climate writer, and an international editor — and wrote a book on existential risk.
Four years ago, I published my first book: End Times: A Brief Guide to the End of the World.
It did … okay? I earned a Q&A with the site you’re reading now — thanks, Dylan! — and the book eventually helped get me the job of running Future Perfect. I had one day where I went from radio hit to radio hit, trying to explain in five-minute segments to morning DJs from Philadelphia to Phoenix why we should all be more worried about the threat of human extinction and what we could do to prevent it.
But a bestseller it was not. Let’s put it this way — about every six months, I receive a letter from my publisher containing a “non-paying royalty statement,” which is sort of like getting a Christmas card from your parents, only instead of money, it just contains a note telling you how much they’ve spent raising you.
So I’ll admit that I was a bit chuffed when I received an email a couple of months ago from people at aisafety.info, who are aiming to create a centralized hub for explaining questions about AI safety and AI alignment — how to make AI accountable to human goals — to a general audience. To that end, they were building a large language model — with the delightful name “Stampy” — that could act as a chatbot, answering questions people might have about the subject. (The website was just soft launched, while Stampy is still in the prototype stage.) And they were asking permission to use my book End Times, which contains a long chapter on existential risks from AI, as part of the data Stampy would be trained on.
My first thought, like any author’s: Someone has actually read (or at least is aware of the existence of) my book! But then I had a second thought: As a writer, what does it mean to allow a chatbot to be trained on your own work? (And for free, no less.) Was I contributing to a project that could help people better understand a complex and important subject like AI safety? Or was I just speeding along the process of my own obsolescence?
Training days
These are live questions right now, with large language models like ChatGPT becoming more widespread and more capable. As my colleague Sara Morrison reported this summer, a number of class action lawsuits have already been filed against big tech firms like Google and OpenAI on behalf of writers and artists who claim that their work, including entire books, had been used to train chatbots without their permission and without remuneration. In August, a group of prominent novelists — including Game of Thrones author George R.R. Martin, who really has some other deadlines he should attend to — filed suit against ChatGPT maker OpenAI for “systematic theft on a massive scale.”
Such concerns aren’t entirely new — tech companies have long come under fire for harnessing people’s data to improve and perfect their products, often in ways that are far from transparent for the average user. But AI feels different, as attorney Ryan Clarkson, whose law firm is behind some of the class action lawsuits, told Sara. “Up until this point, tech companies have not done what they’re doing now with generative AI, which is to take everyone’s information and feed it into a product that can then contribute to people’s professional obsolescence and totally decimate their privacy in ways previously unimaginable.”
I should note here that what aisafety.info is doing is fundamentally different from the work of companies like Meta or Microsoft. For one thing, they asked me, the author, for permission before using my work. Which was very polite!
Beyond that, aisafety.info is a nonprofit research group, meaning that no one will be making money off the training data provided by my work. (A fact which, I suspect, will not surprise my publisher.) Stampy the chatbot will be an educational tool, and as someone who runs a section at Vox that cares deeply about the risk of powerful AI, I’m largely glad that my work can play some small role in making that bot smarter.
And we desperately need more reliable sources of information about AI risk. “I think the general understanding of AI alignment and safety is very poor,” Robert Miles of aisafety.info told me. “I would say that people care a lot more than they used to, but they don’t know a lot more.”
Chatbots, trained on the right source materials, can be excellent educational tools. An AI tutor can scale itself to the educational level of its student and can be kept up to date with the latest information about the subject. Plus, there’s the pleasant irony of using some of the latest breakthroughs in language models to create an educational tool designed to help people understand the potential danger of the very technology they’re using.
What’s “fair use” for AI?
I think that training a chatbot for nonprofit, educational purposes, with the express permission of the authors of the works on which it’s trained, seems okay. But do novelists like George R.R. Martin or John Grisham have a case against for-profit companies that take their work without that express permission?
The law, unfortunately, is far from clear on this question. As Harvard Law professor and First Amendment expert Rebecca Tushnet explained in an interview published in the Harvard Gazette, digital companies have generally been able to employ concepts of fair use to defend harvesting existing intellectual property. “The internet as we know it today, with Google and image search and Google Books, wouldn’t exist if it weren’t fair use to use these words for an output that was not copying” the original, she said.
One way to consider this is to think about how humans, like myself, write books. When I was researching and writing End Times, I was drawing upon and synthesizing the existing work of hundreds of different authors. Sometimes I would quote them directly, though there are specific rules about how much of an individual work another author can directly quote from under fair use. (The rough rule is 300 words when quoting from a published book, or around 200 words for a briefer article or paper.)
More often, though, what I read and processed in my research rattled around in my brain, combined with other reporting and reasoning, and came out as my own work — my work informed by my own sources. Or, in other words, informed by my own personal training dataset.
The difference, when it comes to AI, is one of scale. ChatGPT can “read” more published words in a few seconds than I could in several lifetimes and, unlike me, that data isn’t immediately replaced in my human-limited short-term memory by whatever I’m thinking of next. (Playoff baseball, if I’m being honest.) Legal scholars can draw on hundreds of years of copyright law to determine what to do in human cases, but laws that can accurately and fairly govern, or even understand, what AI can do with the same material have yet to be written.
As Tushnet goes on to argue, we should be focusing less on legal questions that may not be answerable under current law, and more on shaping what we want and don’t want from language models. Chatbots trained to spread the gospel of AI safety, yes. AI-written versions of the next book in the Games of Thrones series, maybe not so much.
A version of this newsletter originally appeared in the Future Perfect newsletter. Sign up here!
Source: vox.com