> We need to find way to make sure the access to this data the models were train...

Al-Khwarizmi · on Oct 5, 2023

I'm actually on your side because I think copyright laws should be radically nerfed and things like books3 are greatly beneficial to society, but I wouldn't buy your argument about "no good way to read plain text files".

The text file contents all the text in the book in an (at the very least) machine-readable way, it is perfectly feasible to write a program to put it in ebook form or to play it as an audiobook, and then the text file becomes desirable for laypeople to read.

sillysaurusx · on Oct 5, 2023

Unfortunately this is true of any encoding scheme, short of scrambling the order of the paragraphs. And although scrambling the order might seem tempting, it destroys the ability to train large context windows — 32k context tokens is enough to fit most of a book into a prompt, and this window will only grow bigger.

If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows. That’s why it seemed important to justify a plain text training format, since anything on top of it would be equivalent.

Indeed, one alternate training format would be to ship the raw html from every epub file, then process it into text at runtime. But this makes it trivial to reconstruct the original epub file and use it in an actual book reader.

It’s frustrating that we can’t share the epub files, because there are so many advantages: you can scrape the metadata, you can tweak the rendering to plain text, you can get semantic info from the images (and even OCR them — it turns out that lots of coding epubs show code examples as screenshots, because epub html rendering is so primitive, so this would be the only way to let your model learn from those).

All of that is why I’m leaning towards "make a script to spider all of libgen and cache the epub files locally". But I haven’t finished calculating how much disk space this would require.

I’m sad that researchers will have to wait days for their training data instead of downloading it in a few hours from a high speed cache, but it seems like any such cache would be swiftly DMCA’ed, so there’s no alternative.

dcow · on Oct 5, 2023

I don't know if this is tangential or not, but thank you. You've helped me progress my armchair understanding of how we might give the concept of copyright more finesse in the digital age.

The insight is that encoding content is not functionally the same as copying verbatim a work which is the original intent of the concept.

For example: if I have legally obtained a copy of e.g. https://archive.org/details/free_culture, then I am at liberty to encode it in whatever format I need to be able to feed it to a machine/tool for whatever purpose I want. I am not infringing on copyright because I legally obtained the work, and the machine is not because machines can't.

I think at least this much OpenAI has in their favor. If they can prove they legally obtained all the training material (I do think it's fair for them to be required to pay once for it) then I don't think there's any world in which it makes sense to try and allow content creators to further extract royalties from that process alone.

If a user asks an AI for a copyrighted poem, for instance, and the user goes and republishes that poem as their own, I do think it may make sense to grant the original author royalties under current law.

I really hope we can legally pick these two concepts apart and focus on each scenario independently. I see a lot of people here arguing that giving a work that you have legally obtained the rights to view to a machine model is inherently copyright infringement because you had to copy the work and yada yada is it fair use? I really think this is wrong both in the interpretation of copyright and practically as ideas can't be owned and it makes no sense to limit which ideas were used to make what commercial product etc. (we don't do that today and AI doesn't change that).

sangnoir · on Oct 5, 2023

>If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows

Only if the training data is under copyright! How about data that is in the public domain, or granted license specifically for training - Mozilla's commonvoice is an example.

Maybe someday there will be a rulings that do for ML training datasets what earlier ones did to legalize cleanroom design.

sillysaurusx · on Oct 5, 2023

Sadly there just isn’t enough public domain data. It would mean that it’ll be impossible to catch up to ChatGPT.

One way to see this is to imagine a Midjourney competitor trained solely on public domain images. The visual quality of the model will always be worse.

As for licensing, I agree for commercial entities, but there should be exceptions. If a model is open source, it benefits everyone, and so it shouldn’t need to have been licensed. There are a few reasons why this is pretty important, but the main one is that without it, the open source community has no chance whatsoever of creating cutting edge models.

dcow · on Oct 6, 2023

I think you can reasonably imagine that if content is available in a library then everyone could organize an effort to check out the content and add it to the model. That effort sounds like useless theatre at that point so just let the commons keep an `all-booksN.zip` corpus around for the purpose.

You know, you could probably even argue traditionally that taking all the books in the world and adding them to a corpus for the purpose of creating an LLM would be a transformative work since it doesn't compete with or detract from any of the originals…

stavros · on Oct 5, 2023

Why can't I read or bookmark a huge text file? I can and have. Books3 (though this is the first time I'm hearing of it) shouldn't be fair use because it's inconvenient to read, it should be fair use because reading is not the intent. The intent is to train computers with it.

jdkee · on Oct 5, 2023

It does contain copies of copyrighted works.

stavros · on Oct 5, 2023

Well, yes, otherwise you wouldn't need fair use, you'd just use it.

benxh · on Oct 5, 2023

I would like to get in touch with you related to books4. Do you happen to have discord? or would twitter be ok?

There's currently multiple attempts at creating what you describe as books4.

sillysaurusx · on Oct 5, 2023

Please do! Twitter DM is the most reliable. Or if you put some contact info in your profile I can reach out.

benxh · on Oct 7, 2023

Added, and reached out on Twitter.