Hacker News new | past | comments | ask | show | jobs | submit login

I had the exact same idea after seeing Google point out that you can[0] get ChatGPT to regurgitate verbatim training data by asking it to repeat the same word over and over again[1]. I'm glad to see someone else actually bring it to fruition.

This, of course, brings two additional questions:

1. Is this "AI, hold the AI" approach more energy-efficient than having gradient descent backpropagation compress a bunch of training data into a model that can then be run on specialized AI coprocessors?

2. Will this result wind up being evidence in the ongoing lawsuits against OpenAI and Stability AI?

[0] Could. OpenAI now blocks generation if you fill the context window with a single word.

[1] https://arxiv.org/abs/2311.17035




This approach cannot possibly be more efficient than running the original model because it relies on running the original model to get the activations to search the text corpus for strings with similar activations to compute the next-token statistics. You don't get to skip many steps, and you end up having to do a bunch of extra work.

I'd be surprised if doing this with two completely separate corpora, one for training the model and the other to search for strings with similar activations, wouldn't lead to much the same results. Because the hard part is constructing similar activations for strings with similar next-token statistics in the first place.

Note that in the per-layer weights [0.01, 0.01, 0.1, 1.5, 6, 0.01] the penultimate layer ist the most important, where the input has already been transformed a lot. So you can't expect to use this to replace a transformer with a simple grep over the training data. (My guess as to why the penultimate layer has a much higher weight than the final one is that this is due to induction heads https://transformer-circuits.pub/2021/framework/index.html which implement copying repeated strings from the input, with the penultimate layer determining what to look for and the final layer doing the copying.)


I'm confused, you had the exact same idea that LLM output is based on probability of next token, which is based on the training data?

If that's the case, no, its unlikely this result will end up becoming evidence, that is well known and fundamental.

The author's contribution to discussion is showing this to a technical audience writing their own GPT, as they note, most "how to implement this?" focus on transformers


Much of the sales hype and other literature surrounding LLMs specifically obfuscates the role that training data plays in the model. Training data is "learned from", but that's implying the data goes away after the training process ends and you have a model that's solely composed of uncopyrightable knowledge about how to write or draw. If the models are actually retaining training data, and we have a way to extract that data, then the models didn't learn - legally speaking[0], they copied training set data.

The idea I had wasn't "LLMs are based on probabilities", it was "what if you benchmarked an LLM against a traditional search index over the training corpus". The linked blog post doesn't completely rip out the LLMs entirely, just the feed-forward layer, but the result is what I thought would happen: an attention-augmented search index that is producing nearly identical probability distributions to the 66% of the model that was removed.

[0] Programmers talking about copyright usually get tripped up on this, so I'll spell it out: copyright is a matter of data provenance, not bit-exactness. Just because the weights are harder to inspect does not mean no copyright infringement has occurred. Compression does not launder copyright.


Should make sure to establish this up front: People know this, it's not controversial. It's not only known to a few. It's how it works.

Also note that this example purposefully minimizes training data down to an absurdity, so it is possible to correlate 1:1 that the next letter's probabilities to the input. The key of the rest of this comment,, and the discussions you reference, is the observation that's vastly harder once the training data is measured in terabytes, to the point the question becomes interesting.

The argument of which you're speaking, the people you think are speaking literally are speaking figuratively: they know it reproduces _some_ training data, i.e. 2+2=4 was surely in the training data. Or c.f. NY Times v. OpenAI, where they were able to get it to complete an article given the first ~5 paragraphs of the article.

The unsettled question, in US legal parlance, is if LLMs are sufficiently transformative of the training data that it becomes fair use.

Eschewing US legal parlance: where, exactly, on the spectrum of "completely original" to "photocopier with perfect recall" LLMs fall, given we know it isn't at either of those extremes? What responsibility does that give someone operating an LLM commercially to the entities who originated the training data?


In my experience before they blocked it: it hallucinates something that looks like training data. A GitHub readme that under closer inspection doesn't actually exist and is incoherent. Some informational brochure about nothing. A random dialogue.


I found it interesting that in the arxiv paper you linked they are talking about an attack, ethics and responsible disclosure.

But when it comes to scraping the entirety of the internet to train such models that's never referred to as an attack.


Scraping the whole web isn't considered an attack because, well, that's just how search engines work. That being said, there are all sorts of norms (e.g. robots.txt) qualifying what kinds of scraping are accepted.

As far as I can tell, AI researchers assumed they could just piggyback on top of those norms to get access to large amounts of training data. The problem is that it's difficult to call copying an attack unless you go full MAFIAA[0]brain and argue that monopoly rents on creative works are the only functional backstop to the 1st Amendment. Hell, even if you do, the EU and Japan[1] both have a statutory copyright exception explicitly legalizing AI training on other people's text. It's not even accepted dogma among copyright holders that this is an attack.

[0] Music And Film Industry Association of America, a fictional industry association purported to be the merger of the MPAA and RIAA announced on April 1st, 2006: http://mafiaa.org/

[1] Yes, the same country whose copyright laws infamously have no Fair Use equivalent. In Japan, it is illegal to review or parody a copyrighted work without a license, but it is legal to train an AI on it.


Alternatively, you can just believe in standard, copyright law which says you need a license to distribute much content. Most file sharing cases ruled in favor of that.

The AI companies have been bundling and distributing copywritten works for pretraining. They do illegal activities just to make the AI’s. That’s before considering them generating the training data or derivative works. So, there’s lots of risk which they’re just ignoring for money.


I don't want to have copyright law as it currently exists. It is a badly-negotiated bargain. The public gets very little out of it, the artists get very little protection out of it, and the only people who win are intermediaries and fraudsters.

Keep in mind, this is the same copyright that gave us Prenda Law, an extortion scheme that bilked millions of dollars in bullshit settlements. Prenda Law would create shell companies that created porn, post it on BitTorrent, then have the shell companies sue anyone who downloaded it. Prenda Law would even post all their ongoing litigation on their website with the express purpose of making sure everyone Googling for your name saw the porn, just to embarrass you into settling faster.

This scheme was remarkably profitable, and only stopped being profitable because Prenda slipped up and revealed the fraud[0]. Still, the amount of fraud you have to commit is very minuscule compared to the settlements you can extract out of people for doing this, and there's been no legal reform to try and cut off these sorts of extortion suits. Prenda isn't even the only entity that tried this; Strike 3 Holdings did the same thing.

[0] If you upload your own content to BitTorrent, the defense could argue that this is implied license. Prenda's shell companies would lie about having uploaded the content themselves.


re: 2... if you copyright a work, then surely you also hold rights to a zip file of that work. So why not also the probability distribution of letters in that work?


To be precise, you don’t hold rights to a zip file, copyright doesn’t know anything about files. You hold rights to a work, an abstract legal concept. Your rights to the work allow you to control the reproduction of that work, and distributing a zip file is an instance of reproducing the work.

Probability distributions don’t contain enough information to reproduce a work (since they don’t preserve order). They are not copyrightable in and of themselves, and distributing a probability distribution of a work doesn’t amount to reproduction.


If the probability distribution is enough to reproduce a copyrighted work to the level of substantial similarity, then yes, a copy has legally been made.

However, that's not the only question involved in a copyright lawsuit[0].

So far most of the evidence of copying has been circumstantial: a regurgitated Quake lighting function here, a Getty Images watermark there, but everything else has looked like wholly original output. We know from how these models are trained that copyrighted work is involved somewhere, but a court could just say it's Fair Use to scrape data and train a model on it. However, that defense is way harder to make if we can actually open up a model and show "ok, this is where and how it's storing copied training set data". At a minimum, it takes the "how much was used" Fair Use factor from "a few watermarks" to "your honor, the entire fucking Internet".

[0] As usual we will assume jurisdiction in US court


Just to play devil's advocate, if I were on Microsoft's team, I'd say: You just proved our case. Having the "whole fucking internet" == having general knowledge. Synthesizing general knowledge is fair use. If it was trained only on the works of a certain artist, it would be plagiarism. But if it's trained on all artists, it's not really trained on anyone.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: