Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair.

Hear, hear. Unfortunately this is more or less impossible given current copyright law.

Suppose you scrape libgen and turn it into training data, then you release the training data publicly. Since the vast majority of every book appears verbatim in the training data, is this sufficiently transformative?

I think yes, it is, because nobody is going to read those books from the training data. When I made books3, I felt it was important to render each book into high quality text. But it turns out that when you convert Jurassic Park into a text file, there's no good way to read it anymore. Good luck trying to bookmark wherever you left off -- it's all one gigantic file.

But nobody seems to agree. The Danish Rights Alliance (https://rettighedsalliancen.com/) aggressively DMCA'ed anyone that hosted books3, even going so far as to DMCA The Pile from academictorrents: https://academictorrents.com/details/0d366035664fdf51cfbe9f7... with the justification that ~100 copyrighted books appear in the training data, so therefore they have the right to DMCA. Right now most of the world seems to agree, but I'm hoping that opinion will shift as the years tick by. Surely no one can believe that a plain text document poses a serious threat of economic harm to the original author. So the question is whether the original author should be allowed to deny everyone else the right to transform their work into a form that machines can read.

For my part, I've been planning a books4 dataset, but this time similar to LAION: it's a script that spiders libgen torrents (https://libgen.gs/torrents/libgen/) and converts all the epubs into text files. That way, if LAION isn't infringing, then books4 can't be infringing either. (Of course, hosting the actual training data anywhere is pretty hard nowadays, but it should only take a few days to convert 38TB of libgen into ~2TB of plain text.)

This is the only way to create an open source competitor to ChatGPT.



I'm actually on your side because I think copyright laws should be radically nerfed and things like books3 are greatly beneficial to society, but I wouldn't buy your argument about "no good way to read plain text files".

The text file contents all the text in the book in an (at the very least) machine-readable way, it is perfectly feasible to write a program to put it in ebook form or to play it as an audiobook, and then the text file becomes desirable for laypeople to read.


Unfortunately this is true of any encoding scheme, short of scrambling the order of the paragraphs. And although scrambling the order might seem tempting, it destroys the ability to train large context windows — 32k context tokens is enough to fit most of a book into a prompt, and this window will only grow bigger.

If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows. That’s why it seemed important to justify a plain text training format, since anything on top of it would be equivalent.

Indeed, one alternate training format would be to ship the raw html from every epub file, then process it into text at runtime. But this makes it trivial to reconstruct the original epub file and use it in an actual book reader.

It’s frustrating that we can’t share the epub files, because there are so many advantages: you can scrape the metadata, you can tweak the rendering to plain text, you can get semantic info from the images (and even OCR them — it turns out that lots of coding epubs show code examples as screenshots, because epub html rendering is so primitive, so this would be the only way to let your model learn from those).

All of that is why I’m leaning towards "make a script to spider all of libgen and cache the epub files locally". But I haven’t finished calculating how much disk space this would require.

I’m sad that researchers will have to wait days for their training data instead of downloading it in a few hours from a high speed cache, but it seems like any such cache would be swiftly DMCA’ed, so there’s no alternative.


I don't know if this is tangential or not, but thank you. You've helped me progress my armchair understanding of how we might give the concept of copyright more finesse in the digital age.

The insight is that encoding content is not functionally the same as copying verbatim a work which is the original intent of the concept.

For example: if I have legally obtained a copy of e.g. https://archive.org/details/free_culture, then I am at liberty to encode it in whatever format I need to be able to feed it to a machine/tool for whatever purpose I want. I am not infringing on copyright because I legally obtained the work, and the machine is not because machines can't.

I think at least this much OpenAI has in their favor. If they can prove they legally obtained all the training material (I do think it's fair for them to be required to pay once for it) then I don't think there's any world in which it makes sense to try and allow content creators to further extract royalties from that process alone.

If a user asks an AI for a copyrighted poem, for instance, and the user goes and republishes that poem as their own, I do think it may make sense to grant the original author royalties under current law.

I really hope we can legally pick these two concepts apart and focus on each scenario independently. I see a lot of people here arguing that giving a work that you have legally obtained the rights to view to a machine model is inherently copyright infringement because you had to copy the work and yada yada is it fair use? I really think this is wrong both in the interpretation of copyright and practically as ideas can't be owned and it makes no sense to limit which ideas were used to make what commercial product etc. (we don't do that today and AI doesn't change that).


>If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows

Only if the training data is under copyright! How about data that is in the public domain, or granted license specifically for training - Mozilla's commonvoice is an example.

Maybe someday there will be a rulings that do for ML training datasets what earlier ones did to legalize cleanroom design.


Sadly there just isn’t enough public domain data. It would mean that it’ll be impossible to catch up to ChatGPT.

One way to see this is to imagine a Midjourney competitor trained solely on public domain images. The visual quality of the model will always be worse.

As for licensing, I agree for commercial entities, but there should be exceptions. If a model is open source, it benefits everyone, and so it shouldn’t need to have been licensed. There are a few reasons why this is pretty important, but the main one is that without it, the open source community has no chance whatsoever of creating cutting edge models.


I think you can reasonably imagine that if content is available in a library then everyone could organize an effort to check out the content and add it to the model. That effort sounds like useless theatre at that point so just let the commons keep an `all-booksN.zip` corpus around for the purpose.

You know, you could probably even argue traditionally that taking all the books in the world and adding them to a corpus for the purpose of creating an LLM would be a transformative work since it doesn't compete with or detract from any of the originals…


Why can't I read or bookmark a huge text file? I can and have. Books3 (though this is the first time I'm hearing of it) shouldn't be fair use because it's inconvenient to read, it should be fair use because reading is not the intent. The intent is to train computers with it.


It does contain copies of copyrighted works.


Well, yes, otherwise you wouldn't need fair use, you'd just use it.


I would like to get in touch with you related to books4. Do you happen to have discord? or would twitter be ok?

There's currently multiple attempts at creating what you describe as books4.


Please do! Twitter DM is the most reliable. Or if you put some contact info in your profile I can reach out.


Added, and reached out on Twitter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: