The $1.5 billion Anthropic settlement won’t stop AI companies from training on books

You probably heard that on Friday, Anthropic, parent company of the AI large language model Claude, settled a lawsuit brought by author groups. Anthropic will pay $3,000 per book for approximately 500,000 books.
Does this mean that authors are safe from having their books used as AI training materials without permission?
No!
Anthropic is paying only for the most egregious piracy
Anthropic trained its model on digital libraries like LibGen. LibGen was apparently created by breaking the encryption on ebooks. Anthropic (and some other AI companies) fed LibGen into their models for training purposes.
There are two separate acts here:
- Breaking the encryption to create a library of copyrighted books without permission.
- Training the LLM on the library of content.
The illegal act addressed by this settlement is 1) breaking the encryption. The judge ruled that 2) training the on the library of content was a transformative use, and therefore covered by the fair use exceptions to copyright.
The other LLM companies who used stolen content will likely settle on similar terms.
They’ll then find other legal ways to access the content, including machines that take physical books, remove the bindings, and scan them. That hasn’t been ruled illegal.
Here are some things that are still unknown:
- When and how will Anthropic distribute the $3,000 per book?
- If they work through publishers, which is the most efficient way to make the distribution, how will publishers distribute the content to copyright holders (authors), and will the publishers retain any of the money?
- What impact will this have on other AI companies?
- How will this affect permissions for other digital content, like web pages?
A framework for payment and permission is coming
LLMs will get access to content. The only question is how.
For web pages, the CloudFlare system for blocking or licensing content looks quite promising.
And publishers are likely to make deals with LLMs, as Harper Collins already did, in which they license content from copyright holders who’ve given permission in exchange for payment. Publishers already have systems for paying authors; this will fit in nicely, generating a new source of revenue for both authors and publishers.
(I accepted $2,500 to license one of my books; here’s why.)
This might also be fixed through legislation that clarifies whether reading content for AI training is or is not a copyright violation. (Donald Trump stupidly said such licensing was too cumbersome, completely ignoring the IP considerations. Luckily, he doesn’t make copyright law or enforce lawsuits between private companies.)
Your heartfelt beliefs in favor or or opposed to training AI on copyrighted content don’t matter. The law does, and so do licensing schemes and judges’ rulings.
This is still likely going to the Supreme Court. And if you think you know which way they’ll rule, you’re deluded.