Hacker News new | past | comments | ask | show | jobs | submit login
AI Books4 Dataset for training LLMs further (reddit.com)
46 points by ImperiumOfMan 14 days ago | hide | past | favorite | 48 comments



Very cool. Guess you can just pirate 400K books stealing millions from authors as long as its a "dataset for LLMs" instead of a regular old piracy dump. No doubt there are books in there that took decades of experience to write and knowledge that took a lifetime to learn. Stealing is just OK as long as its for AI lol, and as we devalue writing we guarantee the quality of writing will continue to degrade.


Probably copyright infringement in some jurisdictions, but it's not 'stealing'. Let's not confuse terms with specific meaning.

I don't think you understand what stealing means if you are unable to recognize that distributing free digital copies of paid books without agreed upon renumeration to the owners is stealing. It may or may not also be copyright infringement in a legal sense, but it sure is stealing in a common sense.

Hmm... but the whole internet is also indexed by Google and they make a lot of money out of the content.

Should they pay too?

If this was enforced would there be search engines?


One of the reasons behind the backlash against Google's AMP was that they would contain the full experience. You are on a Google's OS (at least significant amount of users are), on a Google browser, using a Google search service and from there you go to Google servers to get the content. The main difference is the last step.


It's different for websites because they are intentionally served up for free. Google had a very hard time getting a license to even index book contents. And there's an ongoing tension with paywalled websites. They do pay for some news sites, in the Google News Showcase.


What do you think the market is for people who want to read books or articles condensed into lines of text suitable for model training?

I think it's 0. Maybe a handful if pirate libraries of readable ebooks and papers didn't already exist, but even a handful of copyright violators wouldn't be a serious commercial threat to the publishers.

So far, governments have been unable to do anything about pirate libraries. Complaining about AI datasets, poorly formatted for human consumption, seems misplaced.

Criticizing this as "stealing" or "piracy" is a vote for a future where only big tech, or the major publishers themselves, can train models on good datasets. Nobody else will have the money or market power to license that much material.


The market for the entire contents of say OpenAI's entire codebase mangled and corrupted at parts might be 0 as well, doesn't make it not stealing. When NVIDIA's or Samsung's IP was leaked including binary firmware that is hard to read and understood by very few, that doesn't magically mean stealing and releasing it became legal.

It doesn't magically mean anything, but you're comparing a case where there's a plausible argument for fair use (no commercial impact from the training set sharing, not done for profit, transformative), against... what specific instance are you talking about with NV or Samsung? Theft of trade secrets? Samsung leaking their own IP to chatgpt, through stupidity, isn't illegal.

This doesn't even affect the major AI companies that are doing training, does it? Don't you think Google, OpenAI, Anthropic all have their own datasets by now? If they wanted to use bulk content from pirate ebook and academic paper libraries, they would've already mirrored most of those sites a long time ago.


I don't know if there is a lot of confusion, purposely misleading deception, or both, but I cannot grasp the argument for "training on copyrighted data is a violation of copyright".

What I can grasp is "output of copyrighted data is violation of copyright", and the fix seems to be a straight forward dumb software contentID filter on the output.


The argument is "An artist getting inspired by a work and creating something similar is allowed, but photocopying, photographing a painting or statue, or using samples in music-making can be fraught with legal problems.

One of the major dividing lines between legal and illegal is whether it's a purely mechanical process, or done by human hands. The training of LLMs and diffusion models is a purely mechanical process, more like photocopying or photographing than an artist gaining inspiration"


The training process is not purely mechanical. The amount of time that goes into selecting, cleaning, reformatting and otherwise preparing the training data is significant not to mention all of the other work involved in the actual training. If your measure is simply the amount of work done by humans then we should simply reclassify ML models as art pieces and dismiss all of the criticism as gatekeeping by people who don't understand hypermodern art.

>but photocopying, photographing a painting or statue, or using samples in music-making can be fraught with legal problems.

That's part of where I think a lot of confusion is - LLM training doesn't do any of that. There is no coherent corpus of data in these models.


This isn't a model trained on the books, it is the complete text of the books themselves with the label "dataset" slapped on it. Calling it a dataset doesn't make it not piracy.

The confusion seems to be on your part, because that's not the argument being made here.

The argument is that the 400k books being shared here are themselves copyrighted works that the poster probably does not have the right to be distributing, and that is not very nice of them.

"It's just a dataset, bro!" being used to bulldoze people's rights to their own work might have negative consequences.


So if I break into one of these llm shops and download their models’ weights they will totally be cool with me distributing them publically?

The Pirate Bay should consider rebranding as The Dataset Bay.

I am sure there are wild times ahead. At some time it will become outlawed in the way regular piracy is. Not because of book authors or publishers, but either because of Hollywood or AI safety being used by bigtech to limit competition.


Wait until you find out about libraries. Free books that humans are using to train the LLM’s in their heads.

How many books a human can read, understand, tokenize, memorize in 15 minutes, 15 days, 15 months, 15 years?

What's the rate of forgetting for the ingested material for human brain?


Libraries pay authors for their copies.

> [removed]

https://web.archive.org/web/20240519104217/https://old.reddi...

magnet:?xt=urn:btih:a904e660355c49006b2e7d43893d31bf3c2be9cc&dn=libstc2.jsonl.zst&tr=udp://tracker.opentrackr.org:1337/announce&tr=https://tracker1.ctix.cn:443/announce&tr=udp://open.demonii....

> More than 400,000 fiction and non-fiction book full-texts. Multiple languages, curated, deduplicated.

> More than 6,000,000 scholarly publications, magazines, and manuals full-texts. Multiple languages, curated, deduplicated.

> 150,000,000 metadata records


It's kind of a grey area. It's sci-hub allowing to read articles for free a bad thing? Is it legal?

There's not always going to be a correspondence between what is wrong from a legal pov and what is wrong from a moral pov.


How many books you can read and remember exactly as is, plus reproduce remix the text as is?

What's this number for an LLM?


That’s some really nice intellectual property they have there.

Be a shame if someone thought about it.


The copyright situation around all this is very... interesting. Pretty clear that this dataset is not legal but what about resulting models? What if the texts actually where bought 'properly'?


Buying a copy of the book would not give you any copyright license. You could only make copies for personal use.


If you are in a jurisdiction with TDM exceptions, buying a personal copy does allow you to train on it.


The race is on to figure out a way to get LLMs to produce content to be used for training other LLMs in a satisfactory way. Eventually the dataset question will get figured out in the courts but if there’s a technique to generate more training data in an automated way then the court decision doesn’t matter.

Edit: also, I don’t believe court decisions can be enforced retroactively so existing LLMs would be safe but I’m most definitely not a lawyer.


If you steal a PC and use it to build a very successful app, would that app be legal? Would the use of the said app by third parties be legal?


It's strange that you could actually use these datasets to poison other people's LLMs - since there's no way you could go through everything and vet all the data in the dataset.

This is simply piracy. This reads exactly like a scene release of any pirated material.

Scene releases are in human-consumable formats.

Books as lines of text are not human-consumable (by reasonable humans).

There's a strong case for sharing this being fair use, even if distributing hundreds of thousands of ebooks would ordinarily violate copyright.

I don't know what OpenAI's training corpus is or how they got it, but in case anyone has forgotten, Google has its own license-free collection of scanned books. It had already run that through OCR (and probably periodically re-OCRs) to generate its google books indexes, so there is no copyright argument against Google's book-trained AI models (assuming they train on books).

An argument that Books4 is piracy is an argument for an oligopoly on good AI models.


> There's a strong case for sharing this being fair use, even if distributing hundreds of thousands of ebooks would ordinarily violate copyright.

If you're using the resulting model trained with this for (academic) research (independently or at a university), that's true. If you're earning money with the model (cough chatGPT, Bard, Gemini, et. al cough), that's false.

Edit: Original version of the above paragraph started as "If you're doing...", making the below comment true. Edit is made to clarify the point further.


If you're complaining about the training, you're confusing the entity doing the sharing in this case, with the entities doing the training.

ETA: If you're complaining about use of the model, this becomes a dispute over how close to an original work an output can get before it becomes a copyright violation. That seems unresolvable. Look at court cases where musicians sue over short sequences of notes. Nobody knows where the threshold should be for that, for poetry, for flash fiction, for academic papers, for novels, for paintings, or for voice samples. Courts just go with whatever they think feels right in a particular case. That is an untenable legal situation in a world with AI models like these.

Also, to reiterate, Google has its own fulltext, license-free corpus of books, which it undoubtedly has used to train models. What's your view on that? Should Google be the only entity allowed to train on large books datasets because it happened to get one through a loophole of collaborating with libraries to OCR their collections?


> You're confusing the entity doing the sharing...

Yeah, you're right. I fixed my comment to clarify my point, thanks.

> Also, to reiterate, Google has its own fulltext, license-free corpus of books, which...

Loopholes doesn't invalidate/override morals. I find training of any model with a corpus without consent from their respective authors immoral, regardless of laws around it.

Same for code, images, sound, text, whatever. These systems can leak their training data, or recreate them verbatim with the correct prompts. Furthermore, "claimed clean" datasets like "The Stack" are not clean by any means because the dataset is huge and tools are not good enough. Plus, even if the licenses allow this, there's still morality/consent aspect to it.

BTW, Fair Use explicitly defines the use should be non-profit. If you profit from the resulting model, it's no fair use by definition.

Any large artistic work which can be reproduced by these systems should be ingested with consent, period.

To be clear, I use none of the AI systems available on the market today.


I understand your moral position. But you must realize that's not what copyright does. That's wishful thinking, the sort often manifest in modern written works on the copyright page where the publisher helpfully writes, "The moral right of the author has been asserted". Begging the question: What moral right, other than the legal copyright that was already asserted separately?

Note the irony, too: It's the publisher writing about morality on behalf of the author. It's the publisher that would give consent for any AI-training use in your concept of the proper legal and moral order of things. The author probably doesn't know much about copyright law, but wants the work monetized as much as possible, because they like having a house and eating food, and sure, creating more content sometimes. To that end, they've assigned copyright to their publisher. Or independent creators would have to create a union and assign licensing power to the union leadership. But again, only licensing to big tech, because nobody else could afford it.

Also see the ETA: paragraph in GP, regarding the difficulty of judging whether simple works are copyrightable, or whether complex works have actually been reproduced in a copyright-violating way.

This feels like an argument over angels on the head of a pin. Look at the markets. They're all-in on AI. Big tech (and big startup) AI models will move forward, no matter what licensing agreements between publishers and AI behemoths, or copyright exceptions, end up being necessary. The only question is whether others who don't have that power or money should be able to try to train their own models (admittedly probably much lower-parameter, because they don't have datacenters full of H100s) on datasets that are at least in the same ballpark.


> * Begging the question: What moral right, other than the legal copyright that was already asserted separately?*

Some countries have a legal concept that is called that, distinct from the usual commercial rights.

https://cyber.harvard.edu/property/library/moralprimer.html


Where did this come from? Even if assembled from multiple sources the full text for 400k books and 6M other publications would be very hard to get.

Did someone hack an e-book store? Or somehow extract data from Google books?


From all the e-books available on the webs... Check annas-archive.org for example.

Wow so this is actually just a tiny subset of what was already openly available. I'm still interested in how all of this was gathered together though - looks like Anna's Archive got the books from other large datasets, but I'm still confused about the ultimate origin of the data.

From my experience as an author of a tech book, the ebook is pirated even before the official publication date, the authors only have access to the ebooks fees days after the publication date. That’s why I’ll never write a book again.

That's sad and I can see how it must be incredibly frustrating.

Perhaps this is why so many AI books are made available for free online by their authors? For example Sutton and Barto, and Goodfellow are both available online from their authors.

Perversely I am much more likely to buy a book if the authors make it available for free. I find a hard copy more useful to study and I prefer to pick a book after sampling it (as you would in a bookshop I guess). I wouldn't go as far as getting an illegal copy to preview so those books without an easy way to review are a lot less likely to be bought (by me).


Yes I agree with you on that. However not every author can negotiate with the publisher to get the permission to offer the ebook for free on their own website (I got rejected on the idea).

I guess if you’re a famous one then you can probably get that , but now for average people?

This is really frustrating, because as an author you can’t offer your own book on your own website for free, while on some third party websites it’s already been downloaded thousands of times.


Origin of the data:

- LibGen collections gathered over decades

- Sci-Hub

- Additional fresh scholar papers and books collected by Library STC


the main ebooks sources

- bought/library borrowed books with DRM removed and shared

- scanned & OCRed books from archive.org & related projects

- non-DRM bough & shared books


Can anyone me one tell me how big this is in GB?


libstc2.jsonl.zst 231 GB

Around 250GB, several terabytes once uncompressed I assume.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: