Hacker News new | past | comments | ask | show | jobs | submit login
Exclusive access for LLM companies to largest Chinese nonfiction book collection (annas-blog.org)
235 points by sillysaurusx 3 months ago | hide | past | favorite | 141 comments

This collection has been on the internet for quite a while, likely started around 2015-ish. It is highly duplicated and I suspect the total number is around 4 million books. Still a lot.

The source was from a company named DuXiu, or previously SuperStar. They collaborated with the libraries around China and scanned their collections since early 2000-ish. Before that I think they just bought some junk books from recycling stations based on the quality of early samples.

Many of the books are translated versions of the textbooks from the west (most likely the US) and many are pure political propaganda junk. Some literature and history stuff which were published when censorship wasn't so extreme.

Many of the Chinese tech companies should have access to this collection (especially Baidu for sure) but the books were not censored based on today's standards so I doubt any of them would openly use them not only due to the copyright issue but also the political risks.

The guy running this site is undoubtedly intelligent but also foolhardy enough to where it’s going to be hard to feel sorry when his door gets broken down. He started this project after seeing how dangerous of a situation zlib’s spotlight put them in, yet he tried to replicate their success. He succeeded but I recommend he doesn’t take being a free man for granted.

Others will carry the torch. Silk Road was proof of that. Of course, it’s little consolation to Ulbricht, but he deserved to be imprisoned for ordering hits. (Incidentally, I was skeptical of that claim till https://youtu.be/GpMP6Nh3FvU?si=q0KXzP2FNJDin2tW.)

The benefit of Anna’s Archive is that they have goodwill on their side. They are trying to contribute to academic research directly, whereas libgen seems to prioritize profit (and has to, to survive).

Anna’s Archive does that too. The "donations" are actually paying to access features. But by proactively reaching out to the academic community like this, they’ll make a lot of powerful allies.

EDIT: on second thought, how does libgen monetize? I thought they ran ads, but I just realized I didn’t verify that.

Whoever runs a library like this for altruistic reasons stands to win. The desire to get rich is most of the problem with doing shady things scalably. One can argue that it’s the risk/reward tradeoff, but I suspect as technology makes everything more accessible, an operation like this only costs a few thousand a month. That’s achievable while flying under the radar long enough to hopefully avoid getting in a lot of trouble.

Of course, as with death note, everything comes down to: don’t get caught.

> how does libgen monetize

The original LibGen, under anonymous and altruistic founder bookwarrior, did not aim to monetize but was built purely on volunteer effort. Some of the LibGen domains, however, are forks that put up advertisements.

That’s so cool.

It's hard to know if they are running it for altruistic reasons or money or both. Probably both. You can make a lot of money. I ran a torrent site in the 00s and it made >$10m in donations.

Did you offer incentives for those donations?

Yes, "VIP" forums and excluded from having to keep a ratio.

Interesting choice of pronouns to refer to the operator of Anna's Archive

It seems naive to me to think that Anna is one person and not a team of people.

So "The guy running this site" should be "The people running this site" then.

Yeah exactly, they were both being naive about it.

"Hi, I’m Anna. I created Anna’s Archive, the world’s largest shadow library. This is my personal blog, in which I and my teammates write about piracy, digital preservation, and more."

What do we do now?

As I said, I think you're naive to think that there is actually a member of the Pirate Library Mirror (PiLiMi) team (the team behind Anna's archive) named Anna. They know what is at stake, so they wouldn't risk sharing personal information.

Possibly. It seems at least as naive as thinking a team member sharing an extremely common first name is going to matter in this context.

It matters a lot, so no. If you have a long list of suspects, even if the name is common, it will focus attention on a small group of people.

Oh my gosh. Duxiu is an absolutely incredible resource. The prospect that it will be fully searchable in a year is incredible!!

It really is

Please elaborate on Duxiu!

The Chinese system where all books are in a central database gives them huge advantages in training AI.

I just realized that western companies are screwed because of the people who wanted to make a buck by creacting artificial scarcity.

I have been predicting for decades that IP laws would bring about the end of our civilization. I did not know how it would happen. Now I know.

That way to dooming the west will be fine the rest of the world will just do better then they have been doing the last 3 centuries. It will create a level playing field and the current tier like system.

We are going back to the time where GDP was just how many farmers did your country have. But in 2000s it will be how many STEM people does your country have and how much IP one can create and license. tech IP is like a global rent from everybody that consumes tech.

This collection is 40 times larger than books3. No one seems to care at all about copyright. It’s tempting to leave the English speaking world behind and focus on someplace that supports academic work.

Will people call this theft? No one has so far.

In the meantime, simply abandoning European and American knowledge is a realistic option. LLMs will learn in Chinese.

On the other hand, it would suck much more to be thrown in a Chinese prison. So probably don’t do this if you plan on visiting China. (If China reacts by fully endorsing this, that would be the uno reverse of the century.)

Replying to a deleted comment, because I think this is worth underscoring:

By explicitly supporting this, China could signal to AI researchers worldwide that China is a safe haven. It’s the materialization of what everyone said would happen: when copyright enthusiasts shut down AI work in America, companies will simply go elsewhere.

Imagine if the epicenter of AI research shifted away from the US. This may seem unlikely, but lack of data is the biggest problem that open source LLMs face. Mistral had to completely refuse to disclose their training sources just to release a decent model.

Personally, I’m tempted to focus on this full time. It’s what I’ve always wanted to do. I never gave a thought to legal consequences, and with this collection there appears to be none.

> China is a safe haven

Heh. It’s a safe haven until your LLM says something the CCP doesn’t like.

This issue is overstated. Given the material is in Chinese, it stands to reason the bulk (but of course not all) of it will not be in violation of whatever policies. Furthermore, there have been a series of open-weights Chinese models that give reasonable answers to sensitive (in China) questions. Unless you plan to release something customer facing in China, it's not something to stress about.

The idea that overly burdensome regulations around open source models in the US shifts the global center of mass of LLMs to China is not implausible.

One reason resonant among government officials against opensource LLMs is to keep them from China and Russia. But among the very best embeddings, ~13Bs and perhaps 34B open-weights models are the Chinese ones. The recent DeepSeekCoder also tops EvalPlus: https://evalplus.github.io/leaderboard.html For those concerned about over-training, ChatGPT and GPT4 are almost certainly also contaminated and there have been independent confirmations of DeepSeekCoder's strength.

> Given the material is in Chinese, it stands to reason the bulk (but of course not all) of it will not be in violation of whatever policies.

Especially due to no freedom of the press in China (you need a permit for each published book) they are all already censored.

I still don't think China could be a safe haven though, they haven't shot down open source models not because they are nice, it's just that they are incompetent.

> Especially due to no freedom of the press in China (you need a permit for each published book) they are all already censored.

Even in a fully censored environment, you can still accidentally run afoul of state censors.


No LLMs will evolve emergent behavior. It will have capabilities beyond the training data including saying shit about china. You just need the to prompt it correctly.

"Safe"and "haven" are indeed bold worlds when applied to the current Chinese regime...

While this is true, imagine having to run your entire dataset past the censors, to make sure your LLM doesn't turn out to have unpalatable opinions.

In Soviet China, LLM censors you.

Sort of joking, but some people seem to be using the dual LLM approach where the 2nd LLM censors the output of the first. Take the Gandalf game as an example.

And while you don't need an LLM to censor an entire internet at scale I see how it would be a useful tool for the censors in China or anywhere else. "Hey CensorGPT does the following weibo post history make fun of dear leader, taking note of the themes of Winnie the Pooh, touch on any other criticism of the CPC" etc

Only if the model serves the mainland. Chinese authorities don’t care what you serve others I think .

On a technical level it's really not that much harder than all of the RLHF companies have to do in the West.

I guess it wouldn't be, since it's for the exact same purpose. The west is just censoring different things than the east.

Well, there are some things different and some things the same. Every society censors some information, it is just part of state power. Examples of things that both Western nations and China censor are things like material support for their designated terrorist groups, fraudulent commercial communications, bomb making instructions, information tools that could be too helpful for producing bioweapons, the identities of personnel in intelligence, the identities of undercover police, the precise security arrangements for the head of state, etc.

Also worth noting that censorship serves a dual purpose. Censorship can be bypassed both inside China and outside it, but because it's overwhelmingly administered through the internet the intelligence services will get a bunch of pings if you're trying to access information they care about. Censorship serves a secondary purpose of making you much more likely to generate intelligence if you go searching for XYZ restricted information.

I am not familiar with how enforcement is done in China, but in the US it's common for the intelligence agencies to detect someone through search patterns (e.g. "what is the schedule of Joe Biden" "where can I buy a pressure cooker" "purchase micronised aluminium" will absolutely get you flagged somewhere if you google it and actually click on the links/engage normally), and then either go straight to entrapment if it can get through court, or engage in parallel construction if it can't (most common method is just convincing a cop's CI to give a tip about the suspect, which can trigger digital searches with a warrant) and then move to entrapment. I imagine in China they don't bother with parallel construction, there is no legal fiction that they're not surveilling you, so they can just convict you for their equivalent of conspiracy based on the digital evidence as presented to the Judge.

They both censor what they don't like

Maybe they will need you to give them access to weights like how tech co need to share IPs with them.

Anna's Archive is already blocked by the Great Firewall https://blocky.greatfire.org/detail/535678/https%3A%2F%2Fann... , chance of endorsement is nil.

Is there a way to find out why? It could be anything from "most new websites get blocked by default, or automatically" to "a censor didn’t like a specific thing."

I’m not a fan of bowing to censorship (to say the least), but at this point I am even less a fan of being verbally spit on, called a thief, and being included in lawsuits. I’d take the lesser of two evils, since there’s no way to even do any work in the first place in America.

(I’ve been depressed for months that the inevitable outcome of all the LLM lawsuits is that book authors will demand licensing fees, which completely kills open source AI models for all but the most funded institutions. I.e. my original work would’ve been completely impossible.)

Most websites do not get blocked. When it happens, the censors won't tell you, but if there's a search bar, it's usually easy to find content that explains the block: https://annas-archive.org/search?q=Prisoner+of+the+State+Zha...

Thank you. I have no way of knowing anything about China, so this kind of thing is super helpful. I wouldn’t have even known what to search for to trigger the censors.

In principal it's easy to explain - as long as you have uncensored content on your website, it's going to be blocked. For example, HN is blocked.

If it didn't happen, then it's because you are not big enough.

A few cases:

- FB, Instagram, Reddit, Twitter, Youtube are all blocked: user generated content and without outrageous censorship effort (Douyin has 100k employees dedicated to censor and is top cost center, Weibo 50k IIRC) it is very predictable, just assume such things get blocked.

- Google is blocked. They announced in 2010 they didn't want to censor any more.

- A censored version of Bing is not blocked.

- Dropbox is blocked: the same reason.

- Notably, GitHub is NOT blocked. GitHub pages and Gist are blocked. This is because GitHub does censor repos upon government request, plus that when it was once blocked scientist spent effort and managed to reverse the decision.

- WhatsApp and Signal are blocked. SMS from Signal is also blocked for all +86 numbers so people simply can't register.

- iMessage is not blocked. This is, IMO, an outlier. My guess is it's because Apple + no group chat + incompetence.

Didn't Apple agree to store data inside China for their citizens in a separate server system, thus making it subject to scrutiny on a national level? If that has been the case and people think otherwise...well, it shows how hard it is to navigate the abstractions of networked systems.

They did. But iMessage is end-to-end encrypted. As long as you avoid the iCloud caveat (iCloud backups are by default not E2E encrypted) you get an actual uncensored E2EE messaging app in China.

Sharing of information is more important to me personally than business interests

If the AI is accessible to the rest of the world, it's not a problem: the model can query its knowledge and answer in any language since it already is very good at translating things.

The problem is rather than, given the economic war between the US and China, the chip ban, the taiwan issues and the GFC, we may not have access to this.

China may find it attractive to have a model trained on their languages and cultures.

It would send a message of power, and ensure that local tech companies will use a model with their influence rather than western corporate influence.

Not saying they should or would endorse, but it wouldn't be _that_ surprising.

> simply abandoning European and American knowledge is a realistic option.

Haha, OK, yeah.

Chinese languages have a higher information density which makes me hypothesize that tokenizing works better than in Romance languages which can lead to more efficient LLMs.

Not particularly? The number of "words"/distinct concepts is not that much different between languages. Having more characters just makes the words look shorter...

That said, Chinese is an isolating language, so there is little information within a word to hint at the rest of the sentence (no plurality, tense etc.). Which could be better or worse for language models. Or it might not matter in the end given modern tokenization also uses word "fragments" for Western languages.

Right now English language LLMs favour more tokens over fewer tokens and more unique tokens. For example in GPT4, “ Altogether now” is four tokens.

I don’t think visual information density is much help.

I'm just an observer to the space, but I imagine each chinese character (or each word) being a token would be pretty effective, especially as inputting/outputting more tokens requires more vram. Additionally Chinese doesn't have conjugations which I imagine makes tokenization easier.

I think it's easier to connect semantic meaning between these tokens/charcaters, because you can relate a character to other characters by their presence in words (most words are 2 characters long (compound words), although there are sometimes words like AB and BC which mean the same thing, but in conversations people will just say A or C and the meaning (AB or BC) is inferred by the context of the conversation, so in a semantic meaning database you would want a connection between A and AB, and to BC & C). You could also have a dimension for each radical that isn't a phonetic component. Radicals do link vaguely related words together, and I think this is a powerful language feature missing in English that could be exploited in the NLP field.

I'm not a fluent or proficient speaker, so maybe I'm not the best person to explain things, but before OpenAI's Whisper, speech-to-text in WeChat in Mandarin was way better than any English speech-to-text I've ever used. The structure of the Chinese language and its large userbase would make me bet long on Chinese NLP progress over the west.

Interestingly BaiChuan2 didn't just tokenize individual characters


Text tokens by their construction are more like the Chinese ideographic characters, so no, there would not be a big difference between tokenized English and Chinese. The number of unique tokens is decided upfront - something like 50K - so, much more than basic vocabulary. English tokens can even be 2 word phrases.

Look up Byte-Pair Encoding (BPE) for a fascinating and simple algorithm that explains the process of choosing the tokens. It shows why it doesn't matter how token dense the source language is.

Imagine losing WW3 because your language is less efficient to store and query than the enemy, so your war-llms need more resources and are less effective.

My understanding is that all spoken languages are roughly equally efficient at data transmission per unit of time. I'm not sure how long it takes someone to read a text in Chinese over a text in English, though.

Per unit of time is true, because languages with a lesser information density will be spoken at a faster rate in order to compensate (see: Spanish speakers speaking way faster than English speakers). But LLMs don't ingest data per unit of time, they ingest data via text.

Sure, but, depending on your representation, you can only have so many different tokens, and if you use, say, pinyin, then you don't have much of an advantage over English.

It'll be interesting to see how LLMs do when trained on concepts (ie the Chinese alphabet), rather than sounds, though.

> Sure, but, depending on your representation, you can only have so many different tokens

From my understanding (which only comes from reading HN), unique token count isn't an issue that LLMs run into. If it is, then that would be a bummer for the possibilities of exploiting all the features of the Chinese language.

> and if you use, say, pinyin

Well yeah pinyin would be inefficient because you're stripping away all the "built in" semantics of a Chinese character to only focus on the pronounciation.

I imagine you can always expand your token count, but then is that very different from syllabic alphabets (that encode multiple syllables in one character)?

Granted, that doesn't apply to Chinese, which encoders concepts, so that's interesting to see in LLMs.

I think so, because even if you're encoding multiple syllables, that's just the pronounciation.

I've learned languages with alphabets and glyph scripts like Chinese, and *what I don't like about alphabet languages like English is that the letters themselves provide little context to their meaning, although you could learn the latin root words and guess from there. Of course you know how to say it, but you don't know what it means. With Chinese, you might guess what it means, but you don't know how to say it.* The characters have more meaning in them, although this isn't the case for every word, which is something non-speakers assume.

In short, from my experience learning Chinese was way easier than learning Vietnamese, which have historical ties together since Vietnamese used a fork of Chinese script until the early 1900s when they transitioned to latin script in order to improve literacy rates. Sure, it improved literacy rates because to read you only need to know how to pronounce, but it doesn't mean glyph languages are harder to learn the meaning/semantics.

Vram is dependent mostly on the parameter count, not on the number of input/output tokens, if I remember correctly.

Also, in case of english language, it's one token per one word (more or less), so it's the same as in Chinese - assuming both LLM tokenisers were geared towards their native language.

The only issue is when you have an tokeniser geared towards western languages, and you try to to use the same tokeniser on a different group of languages - then a single word in a language foreign to the tokeniser would have multiple tokens.

But that has nothing to do with the underlying structure of the language.

In other words - you wouldn't really see a difference between an input in chinese compared to english after the text gets tokenised. It's rougly the same amount of tokens, and the underlying parameter count would be also similar.

I wouldn't be sure of that:

- if by efficiency you mean the speed of operation, then it's the network size that determines it

- the input token / vector size is not dependent on the language

Do LLMs really encode knowledge regardless of language, and tap that knowledge irrespective of which language you query in?

My mental model of LLMs being next-word predictors with a long context window suggests they don't. Are there any papers on this?

Anthropic has a recent paper that talks about this [1]. See figure 16 in particular. The larger the model, the more it generalizes between languages.

On the other hand, no matter how large the model they struggle to generalize between sentences of the form "George Washington was the first US president" and "The first US president was George Washington".

Clearly generalization behavior is unintuitive, you can come up with a post hoc explanation (weights are independent between early and late layers) but I doubt anyone would have thought ahead of time that models would have an easier time generalizing between languages than between slightly different ways of saying the same thing in English.

[1] https://arxiv.org/pdf/2308.03296.pdf

A big area of research in NLP is translation between natural languages, I’d go so far as to say that “translation tool” is a better mental model than “token predictor”. For example, think of summarisation as translation between verbose language and concise language, or code generation as translation between English and Python.

They are called "transformers", after all.

There was a nice paper recently showing that even people encode the knowledge differently depending on which language they use.

According to that paper, people made decisions more emotionally in their native languages compared to their second language - where they tend to be more logical.

In case of LLMs there were some nice cases where GPT gave different replies depending on the language of the query. Factual information was roughly the same, but in some cases the model gave totally different replies depending on the language of the query.

As a bilingual speaker - my experience is that the replies are very similar regardless of whether I speak in Polish or English, or mix them both within the same sentence.

SGD in high dimensions is good at finding efficient, generalizing models. Naturally, it is more efficient to learn a representation of the abstract concept of a table rather than learn representations for the hundreds of different words for tables. I don't have reference but for me it feels very inthuitive that the former is the case.

you say it like predicting next word is an easy task. Calculating the permutations with statistics alone would probably require more compute than the entire universe to the power of the entire universe. And there's just not enough data, even all the internet wont be enough. We abandoned these methods long ago.

Yes, LLMs really encode knowledge. This is not up for debate, and it's obvious to anyone working in AI. It outputs words and input words, yes, but the magic is what is in between. The words are broken down immediately at the first layer. Next it involves 100s of billions of parameters doing god knows what. But it's safe to assume they encode the base reality as best it can. It is what i call 'language-induced reality model'.

The task of next word predictor in a humongous dataset turns out, is best solved by creating an inner representation of the real world. It's still not a perfect model since it's just induced by language though, so obviously it has limitations. But it's really modeling it, and with higher fidelity than people imagined you can learn the world through words alone.

Predicting the next word is hard. To do this more efficiently than a dumb lookup table (the chinese room thought experiment) or a markov chain you basically need to build a world model. This model sits in the deeper layers of the LLMs (the deep in deep learning). Turning it into english or chinese only happens closer to the output layers.


Standard large transformers trained on corpora of multiple languages will generally perform next-word prediction in language A by using information that was only seen in training data in language B, demonstrating that they have managed to implicitly learn a capability for translation and/or multilingual perception.

this is fairly easy to test for yourself. What did you try?

I don't think it's easy at all without access to the training data. I could ask about some information I find obscure in my native language but I can't be confident someone didn't write about it in English on e.g. reddit.

How exactly are you proposing he tests this without access to hundreds of thousands of dollars worth of compute? Toy models don't work for this kind of thing, small language models behave qualitatively differently from large ones.

What is a better way of predicting the next word if not knowledge encoding?

you may disabuse yourself this mental model by learning about the concept of universal function approximators and the possibility that intelligence itself can be approximated.

I’m guessing one of the big Chinese firms will outbid everyone. They’re obviously trying to catch up with OpenAI and throwing whatever money necessary at this makes sense as a result

That’s ok. It’s the expected result in a market economy.

America is throwing out their chances of making open source LLMs. Copyright holders are demanding license fees for training. That’s analogous to someone demanding licensing fees before you make a YouTube video. No one would be able to do it for free. Whereas it was completely possible to train a high grade LLM for free (clusters are surprisingly accessible for researchers) and it wasn’t till recently that you had to worry about being sued for it.

Net result: open source LLMs die, except for companies that can open source their smaller (lamer) models as an upsell for the real ones that anybody cares about. That’s not a world where open source makes a big impact. That’s a world where (metaphorically) GPL software is subservient to business interests for the rest of eternity. Say what you will about whether that’s true, but no business has influence over Emacs, and it’s fantastic, powerful software. No one will be able to make the equivalent open source fantastic LLM in America at this rate.

Doesn't have to be made in America, you can download a LLaMA or Mistral in minutes anywhere. A Chinese base model can be fine-tuned anywhere.

What would hinder an open source pirate LLM?

A bunch of factors. One is that you’d have to keep your identity private, but credibility is how you get access to resources. And resources are necessary to train anything.

Take TRC for example. They give people access to TPUs in exchange for being cited. But if they were cited as facilitating large scale piracy, they probably wouldn’t be happy. It could even lead to a lawsuit on the grounds of facilitating copyright infringement, which will likely be the charge against me if someone gets mad enough to sue me directly. (I never distributed anything, but that doesn’t matter if they can prove facilitation.) And TRC is an even juicier target for lawsuits since Google is a giant loot box of money for them.

The coordination problems faced by all illegal entities.

Piracy generally means taking a full fledged product of someone elses and using it. Seemingly in your description, you're taking the data illegally and then doing all the compute on it. The amount of compute needed in this case is staggering, so you are back to coordinating with other people. Any one of those individuals turning against the group would likely doom the group, hence it's a high risk operation.

This is something I've been thinking about: What kind of compute would it take to train an LLLM on let's say a torrent with 100GB of books from Anna's Archive?

i'm not sure i follow. you're saying that no company (except the small ones) will be able to open source a LLM. but then you cite software created outside of the traditional company structure as an example of what we'll never have in the LLM space. doesn't your example negate the premise?

Not at all. Maybe it’s lost to time, but most of the important models were created by academics, not companies, up till recently. GPT-J for example was trained by one person acting alone (Ben Wang).

I fine tuned GPT 1.5B on chess games. AI Dungeon fine tuned on fantasy novels. All of this type of work will become impossible with the specter of lawsuits hovering over their heads.

EDIT: also most impactful older models were by one person (e.g. Yolo). What I like about ML is that lone wolves can have a big impact.

Chinese firms can’t afford to hire the number of censors required to satisfy the very proud leaders of the CCP.

I wonder if LLMs will enable us to access information in chinese much easier now by acting as an interface.

The Chinese have a very different cultural response to copying, and often see it as a form of honoring.

They would be more than happy if everyone on earth had every Chinese work in existence.

That seems essentialist. Chinese incentives are aligned with a lack of IP protection right now. For instance, IIRC, bunny’s explanation of gongkai (https://www.bunniestudios.com/blog/?p=4297) explains that Chinese companies freely trade chip designs because they make money off of manufacturing, regardless of who designed it.

> Chinese incentives are aligned with a lack of IP protection right now.

Even that is essentialist. In some areas, sharing IP is beneficial to the originators and incentives are aligned. In other areas, not so much.

Duxiu makes money selling access to their collection of scanned books. Their incentives are not aligned with having others use those scans for free. Currently, such cases seem to be mostly handled under the most general provisions of the Anti-Unfair Competition Law (反不正当竞争法), but new amendments are likely to make it more explicitly illegal https://www.whitecase.com/insight-alert/china-releases-draft... ("Improperly obtaining or using another business operator's commercial data")

This seems to me a much more moral way of doing things. Actually making the stuff is what's important.

Patents and copyright are intended to encourage publishing of information (and profitability of individuals so they can establish themselves with economic mobility and unlock more potential investments for the world) rather than keeping it as trade secret. That doesn't mean they always do their job, but neither will there always be people running a company or a nation who can maintain such sharing in the face of competitive pressures during different times.

Oh no, there’s a huge difference between 山寨(copy) and 致敬(pay tribute)…

AI is a technology that is going to bring ridiculous profits, billions of dollars. Why don't AI researchers want to pay for the books and for the work to write them? They want to train their models for free, and then sell access for money like OpenAI.

Imagine saying that someone should pay for the right to make open source software, before you’ve made anything, or proven any kind of market value. In a world with power law distributions, the vast majority of people will not come close to even millions in profit. At one point I was hoping just to run my own GPT API to self fund my measly $7k/mo burn rate. Then I could live life happy as a clam doing research all day in my little shell.

Such a dream is completely impossible when rightsholders demand everyone pay for training data, rather than sharing in the profits of the result. It’s backwards. The actual market profit is what matters. But authors are frothing at the mouth, upset that their work is being used at all without their explicit consent. And some of those authors, like Sarah Silverman and more recently former governor Mike Huckabee, are going after LLM companies and demanding injunctions preventing them from using books at all, under any circumstance, unless they go through authors. Good luck going through 100,000 different authors.

The net result is that we’re headed for a world where the YouTube of LLMs will dominate the space, and no one else will be able to run their own platforms. I’ve done everything humanly possible to try to steer us away from that outcome — you like using stable diffusion yourself? Imagine if no one released stable diffusion in the first place. But things are not looking good.

Lawsuit against Eleuther for books3, courtesy of Huckabee: https://news.ycombinator.com/item?id=37962244

Example of authors being furious, though thankfully with an exception for academic curiosity: https://news.ycombinator.com/item?id=37949012

Twitter is completely merciless. Digging up some examples to illustrate the point…

Here’s one: look through the replies of someone who released a 3D dataset. https://x.com/mattdeitke/status/1678855859089326080?s=61&t=j... "Eat shit Matt" (5O likes) "Hope you enjoy the storm" "you make me sick" and on and on. I can tell you firsthand that this isn’t unusual; it’s the default reaction now.

> Such a dream is completely impossible when rightsholders demand everyone pay for training data, rather than sharing in the profits of the result. It’s backwards. The actual market profit is what matters

I like the way you think, can I also get free books at the store if I'm just reading them for learning and will only make money with the knowledge later, or if I promise that I won't put it in practice?

I did a similar deal in university but had to return books to the library every week so I like your plan better.

One big difference with your university analogy is that the university _did_ pay for the resource, often way more than you would as a private citizen, since it is for sharing purposes.

Even worse, for e-books, libraries are often limited to how often they can lend the book out, and this was done since physical books "degrade" over time, and so have a limited lend-life, the idea is that the same should hold true for digital books as well…

I get the point behind your sarcasm of the first sentence, since that is the essence of the problem.

Well yeah, people sometimes read whole books in a store. Though this use case is done better in facilities that specifically support it - libraries.

If you really need to take the books back with you, that's what the library copy machine is for.

The good thing about LLMs (at least when compared to YouTube) is that "all" you need to train an LLM is processing power, so it's easier now (and will be much easier in the future) to do it illegally. Much easier than running a YouTube competitor illegally for any length of time.

Why do you deserve to reap the benefits of their labour while not giving them a cent? Just saying "but open source!" Isn't a valid argument, everyone who wants access to someone else's work should be forced to pay, this includes the current big players that have so far been circumventing paying people for their work.

It's attitudes like these from the AI sycophants that makes people despise the technology and anyone involved in any of it. I don't give a shit how difficult not having free unfettered access to everyone else's work makes it for you. I don't care that we're supposedly holding the world back (from infinite machine-generated spam made for the sole purpose of making everyone else's lives maximally miserable).

In general, the goal of labor is to enrich the society you live in. Whether it’s capitalism or socialism, the theory is the same: the richer your society, the happier you’ll be, even if you’re poor. Objectively, the poorest in America live far better than the poorest in poor countries.

Then there is the harm argument. What is the harm to authors of letting AI models do this? Models aren’t replacing authors any time soon. And even if they were, you won’t be able to stop it; precisely zero times in history have people stood against a wave of new technology. Technology wins 10 out of 10 times in the long run.

I am skeptical that authors deserve anything for merely having a transformative use of their work. But I’m also sympathetic. The key is for authors to get a slice of the actual market value. Right now authors are shutting everyone down merely for using their work at all, with no consideration of whether anyone is even profiting.

Yet even when authors deserve a cent, how do you decide which authors get how many cents? Do all profits go to copyright holders, as with YouTube? Then there’s no room to run a service. Is half fair? Before or after company expenses? And whatever is allocated, is it distributed evenly among all 100,000 authors, whether they’re Stephen King or just an enthusiastic contributor to AO3? $1M in profits distributed to 100,000 authors is a whopping $10 per author. That’ll buy them a Happy Meal and fries. It’s far more lucrative to make up imaginary damages and sue them in court.

It’s immensely complicated, and it’s not at all clear that it’s evil or even exploitative to train AI models on other people’s work.


> And? Why would I give a shit about some silicon valley asshole's company not being able to extract all the money on the planet for himself on the back of others?

That logic works both ways. Why should AI creators give a shit about your wishes? If you’re not going to meet them halfway, they won’t either, and it’s a race to the bottom for everyone.

You can be as angry as you want. It won’t change a thing. And if you won’t be reasonable and have a good faith debate on the topic, there’s little choice but to just ignore you.

> Not really, just respect the creator's wishes.

Why? Please justify this from a moral standpoint. You being the creator does not mean you get to control all aspects of your work. That’s not even how copyright works in general, let alone this case.


If you’re going to refuse to answer the question, we’re pretty much done here. You won’t justify why you feel entitled to control your work merely because you made it. That wasn’t the point of copyright in the first place.

Bits are not property. Property is stuff, like a chair. You’re like the person who came up with "you wouldn’t download a car." Everyone would in fact download a car if they could, and frankly that’s a world that we should all be pushing for.

I don’t know what to do with you. I don’t do well against people who just yell, since I tend to meet their emotion in kind. It’s a personal flaw. So, as one human to another, allow me to raise my middle finger at you for saying that I don’t care at all about morality when I lose sleep over this very topic. Enjoy your anger; you seem to revel in it.

We also have a higher duty here, which is to foster good conversation on HN. If you can’t muster up some intellectual curiosity on this topic, we should both just end it — for the audience, if not for ourselves.


Both of your premises fundamentally differ. You're trying to argue that it's wrong for others to use your work for their benefit, without compensation for you.

They're arguing that there is no such thing as "your" work in the first place, that society would be better off with unrestricted access of information for everyone. There is no such thing as "stealing" data. In fact, the notion of even being able to restrict other people from building on previous works is unethical in the first place: https://en.wikipedia.org/wiki/Free-culture_movement

It's pretty much impossible to reconcile this in a comment thread.

For the record, I find your comments about the future to be incredibly pessimistic. The advent of AI has allowed a lot more people to be creative without needing to dedicate hundreds of hours to it. It's not all about money. A lot of community models out there, for example, are free in both meanings of the word.

> For the record, I find your comments about the future to be incredibly pessimistic. The advent of AI has allowed a lot more people to be creative without needing to dedicate hundreds of hours to it. It's not all about money. A lot of community models out there, for example, are free in both meanings of the word.

And I read this whole portion as incredibly naive and idealistic. Wow, you can click a button and get some shitty artwork and some robotic copy text, incredible. And you know how this is being used right now, which is only going to get worse as this tech matures? Spammers have an easier time than ever flooding the frontpage of every search engine with machine generated garbage whose sole purpose is to trick people into interacting with it. Scammers can generate infinite fake but convincing material with which to target everyone and trick them. Oh but it's okay, you can click a button and get some pictures!

Let's also just ignore for a moment what these AI Shenanigans will actually lead to down the line, which is the mass replacement of anyone and everyone possible for the sake of lining the pockets of some psychopathic C-suite whose 1 and only worry in the world is how to make the maximum possible amount of money. We're already living in a world where some shitty language model is examining every letter in everything you post and decides whether it's a no-no for the corporation controlling it before banning you with no recourse, this is only going to get worse and worse and soon enough will start flooding into real life, where the course of entire lives will be decided by some decentralized black box. All with 0 recourse, and 0 consequences for the sociopaths that implemented these things in the first place.

Yeah that's some real hopeful future we're looking at, all so that talentless people can click on a button and get a picture out of it. Such "progress" indeed.

> Yeah that's some real hopeful future we're looking at, all so that talentless people can click on a button and get a picture out of it. Such "progress" indeed.

You could say that about photographers too.

> What happens if tomorrow, every creator just stops creating free shit for your AI to absorb?

Then nobody will have free shit to absorb. Not young aspiring authors, and not AI models.

> Hilarious statement, since obviously many people (especially the people who's actual work you're stealing) disagree with that.

It's not theft when a human author reads the work of another author. And it's not theft when a model reads the work of another author. It's exactly the same.

> Then nobody will have free shit to absorb. Not young aspiring authors, and not AI models.

Difference being, of course, the fact that the young aspiring authors are fully capable of coming up with novel ideas without having to copy the contents of 90 billion textbooks cover-to-cover.

And I actually asked about a situation in which the content is no longer allowed for the AI specifically to absorb. What happens then, when these noble warriors of progress that are AI sycophants lose access to their as-of-now treasure trove of free shit that other people worked hard on?

> It's not theft when a human author reads the work of another author. And it's not theft when a model reads the work of another author. It's exactly the same.

Will this disingenuous line of "reasoning" ever die? A computer != A human, no amount of anthropomorphizing will ever make them equivalent.

Edit: Also, if we do take this argument at face value, then humans already have to get proper access to the works of other authors.

>Difference being, of course, the fact that the young aspiring authors are fully capable of coming up with novel ideas without having to copy the contents of 90 billion textbooks cover-to-cover.

So only scale matters? The fact is individuals almost never come up with novel ideas. Can you imagine having to pay someone for that smiling sun you drew in kindergarten?

> then humans already have to get proper access to the works of other authors.

What the hell does that even mean? I mean, please go read "The Right to Read" by RMS. In your mind if I go to my friends house and read a book, I've committed a crime. If I let someone borrow that book, I've too committed another crime.

You've embraced the right to make a profit and put it far above the needed idealism of sharing information as what makes societies grow. But in your desire to profit you are going to create the framework of greed that will allow massive corporations to crush us just as effectively.

> Difference being, of course, the fact that the young aspiring authors are fully capable of coming up with novel ideas without having to copy the contents of 90 billion textbooks cover-to-cover.

Every human was taught their language from other humans. It would be interesting to see how creative an inspiring author would be with zero literary exposure ever in their life. And that's all AI is asking, to have the same access to reading material as anyone else. To not be starved arbitrarily from having the same access to the world as a human.

> Will this disingenuous line of "reasoning" ever die? A computer != A human

Who cares? Stop being so compuphobic. Artificial intelligences should have as much access to reading material as any human.

> ... humans already have to get proper access to the works of other authors.

Of course. On this we agree. AI should have no extra access to written material and should have to respect the same laws as any human.

> ... And that's all AI is asking ...

No, the AI itself isn't asking anything because it's not a sentient being capable of rational, independent thought. The people making the AIs are asking everyone else to blindly believe that using the word "learning" a bunch makes a computer and a human equivalent entities.

> Who cares?

Quite a lot of people, as can be shown in literally any discussion where AI comes up.

> Artificial intelligences should have as much access to reading material as any human.

Why? Other than AI creators wanting to make infinite money by generating spam en-masse, of course.

> No, the AI itself isn't asking anything because it's not a sentient being capable

That is really just your human ego and sense of superiority speaking. Lesser beings might not have the cognitive capabilities you do, but they still should not be discriminated against.

> Quite a lot of people.

A lot of people are racist too. Doesn't make it right.

> Why? Other than AI creators wanting to make infinite money

That's just cynical. There are a lot of people working on AI that have no such motivation.

They want to train their models for free, and then sell access for money like OpenAI.

Counterpoint: Someone who teaches himself how to do something entirely via freely borrowed books from the library, and then makes a living out of those learned skills, is normally not considered stealing.

But… how do you think the book got into the library? The state buys the book from the author’s publisher. The author gets paid their royalty. This happens with your tax dollars. This is a good thing, the library system allows authors to receive payment to provide a public good, that can educate citizens who can then make a living. Stealing their work to train an LLM to make a living is not the same thing.

Okay, then let's have libraries train LLMs on the whole collection of the books that they have legitimately obtained. I'd assume that for this particular collection of Chinese printed works, the National Library of China would also own a copy of every one of them, so the end result would be literally be the same.

And a computer isn't a human luckily, so this argument doesn't work here.

People will write complex moral and legal reasoning.

I'll tell you the simple answer. They don't want to pay for the books because they can get it for free. Simple as that.

> Why don't AI researchers want to pay for the books and for the work to write them?

You get paid in exchange - you give them your text, they give you their AI. Have you tried Mistral-7B, it can do amazing feats on a 5 year old GPU. Didn't imagine my old GPUs were going to get so smart someday.

LLMs by their nature give every user what they need, adapting and customizing the repository of human knowledge to a particular situation. That's how they pay back, they adapt to serve our needs.

Now the authors can use AI to help themselves - brainstorm ideas, critique, first drafts, etc. But in order to benefit they got to use it first.

Searle's finally going to get his Chinese room.

Does "Chinese nonfiction" include Mao's bibliography and "Xi Jinping thought"?

Do you realize how all-consuming China's censorship apparatus is, and that this censorship introduces all sorts of biases in their literary corpus which you now want to be used as training data for LLMs?

That's why you fine tune models, right? To make sure the undesirable things in their data set don't lead to undesirable output.

Fine-tuning will never get rid of all biases introduced by the training data, some of which are subtle enough that you won't even know that you should apply fine-tuning techniques to try to mitigate them.

For example, did you know that there is a trend for physics textbooks to be dedicated to Xi Jinping? If you didn't know this, because who has time to sift through millions of books to find undesirable biases, isn't it clear that your LLM will develop subtle but favourable biases towards the CCP?

The biggest theft in history with governments as complices

With the most valuable haul of all: not gold or coin, but knowledge. How terrible.

But no, I don’t think governments are accomplices. The smartest move would be for China to wait and see how this plays out. It could significantly shift the epicenter of AI work to China. The single biggest limiting factor for open source LLMs is data, and no one in America or Europe wants to be sued.

For what it's worth, we also have millions of previously unreleased (in bulk) books in English, mostly non-fiction, that are available for torrent on our website.

High-speed access available for anyone who can do at-scale text extraction, or who can supply us with new collections.

Anna! I just want to say, I love you. Everything about what you’re doing is heroic. Whoever and wherever you are, thank you.

Please focus on your opsec. The more visible you become, the progressively angrier people will get. Don’t do anything silly like edit your Wikipedia page from your house.

With that out of the way, someone I know happens to have the original books3 epub files. I think they can be convinced to send them to you. It’s only 200,000 books, but that could theoretically grow your collection by 10% or so. I don’t know whether that would be helpful to you (you’ve far surpassed books3 at this point), but if so, let me know.

Given the legal risks, the best course of action for AI companies is probably to ignore English and European books entirely. There is plenty of Chinese data, and the models would learn all the same concepts without exposing anyone to lawsuits.

Since you're pretty knowledgeable about these things, I think I should ask here: I've made a fairly simple design for a program based on BitTorrent, that will allow people to "donate" their disk space to organizations like archive.org, Anna's Archive, and anything else that needs data hosted.

Basically, you download a client, say "allocate 2 TB of my disks to whatever archive.org/donate/disk.rss" says, and the server/client combination ensures you download and seed the rarest 2TB of the collection.

This design is also open, in the sense that the server can share the database of torrents it contains, and anyone can use it to fetch any of the files in the dataset from the swarm.

Would something like this be at all useful? I've emailed a few archivists, but I got no response, and the one person I've managed to talk to about this said there have been a few attempts on this, but they always fail for one reason or another.

You are literally building what I’ve been slowly working towards on my own. This seems like a very good sign. Multiple simultaneous discovery is a common occurrence in the sciences.

The hard part is that those who donate their space have authority over that space. It’s the Byzantine fault tolerance problem: imagine if 4chan donated their space, then started serving CSAM instead of the expected data. You can use hashes to verify integrity, but then the question becomes who gets to decide which hashes are ok. And hashing makes it impossible to edit large files, which is a frequent occurrence in LLM work. You’re constantly tweaking your datasets and spitting out new blobs.

Direct answer: yes, you’re doing good work, and you should keep doing it. I would personally use this for storing books3 transformations.

The other hard part is that you’ll want at least some redundancy — see 6.824 distributed systems, or the GFS paper. It’s why I’ve been implementing Raft and toying with some kind of distributed consensus without a blockchain. (Such consensus is still possible if the researchers were granted authority over what can be stored — which is the whole reason people are donating their disk space in the first place.)

Another issue is sudden bandwidth loss. Data storage is one part of it. The other half is rapid transfer. By replicating the data, you can pull it from multiple replicas at once (I.e. there are more seeders). This also protects against someone suddenly getting throttled, or just having a power outage. The protocol should prioritize donors with high bandwidth over vast storage space.

Feel free to DM me on Twitter if you’d like to toss around some design ideas more seriously, and thank you for trying to build this.

> The protocol should prioritize donors with high bandwidth over vast storage space.

If you're doing this in bittorrent then you might want a client that's configured to optimize for a different goal than most torrent clients.

Potential goals, somewhat conflicting:

A) keep data with a low mirroring degree available. either this needs to be centrally coordinated or some sort of randomized algorithm where clients pick underseeded torrents but not everyone picks the same

B) bandwidth matching. to not consume more resources than are provided a client maybe should only download 1 piece of data for every N times it uploaded any piece. This is much less greedy than what you'd have in a normal torrent client but ensures that caches themselves don't take up much bandwidth compared to users who actually want to download something. Otherwise a misconfigured cache (e.g. behind NAT) could accidentally always download data without ever giving much back.

Thanks, this is exactly what I wanted! I'll DM you on Twitter now!

Edit: Looks like Twitter wants me to pay to DM, I'll email you instead.

Wait what? My DMs are open. But people do occasionally say they have trouble DMing me.

Thanks! It said I need to be verified to DM people who aren't following me.

Thank you, that’s helpful to know. And frustrating. I see why Twitter did that, because bots, but I was willing to wade through the crap to find the gems. Which do get sent.

I responded with some telegram info if that helps.

Haven't you described IPFS minus crypto shenanigans?

Not exactly, IPFS doesn't tell you what to download (you select what to download) and thus can't push the rarest material to you. There are many similarities, but this is much more suited to making large datasets more resilient/accessible.

Did some rabbit hole spelunking on IPFS yesterday...

One could easily use IPFS as the storage layer and add $magic_sauce to manage the distribution of the books within the sub-network kind of like git-annex does to manage people's porn collections. There's one project I saw that does this (using the IPFS-daemon's RPC API) to run a cluster of IPFS nodes for whatever reasons people would want to do something like this.

Yes please, put us in touch by email. Or feel free to email me yourself and we can set up more secure comms from there. Thanks so much for everything you are doing as well!

So fund it with crypto and release it on 4chan.

Any entity should be allowed to read, process and update its internal state using published media. Only your productions should be judged based on copyright law.

The biggest theft in history was you being taught the alphabet, numbers, and words without having to pay a creator for them. This mistake will be reversed. All knowledge will be licenced, all words will be owned.

Short summary of The Right to Read, and evidently the world you want.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact