Hacker News new | past | comments | ask | show | jobs | submit login
“AI” demystified: a decompiler for “artificial neural networks” (tesio.it)
24 points by Shamar on Sept 2, 2021 | hide | past | favorite | 11 comments



So there's some arguing over terminology, and then the main technical point seems to be that you can reverse-engineer a training dataset from the "virtual machine" built by training a neural network.

The decompilation process doesn't just use the neural network though, if I understand correctly it also uses logs from the final training epoch that include error and weight update data. Does this somehow smuggle the training dataset back into the VM? To me, if you're making a statement about the nature of existing ML systems, the statement "reconstruct the source dataset from the cryptic matrices that constitute the software executed by them" would imply that this is possible from trained networks alone.


Beyond the global output and error of each sample from the last epoch, the log also includes the weight update of one single (fully connected) node for each layer.

During the compilation phase, the training dataset is projected on a complex vector space that is constituted by both the "model" of the "neural network" and these logs.

It's just like projecting a shadow over a bidimensional surface: if you discard the data pertaining to one dimension you have no hope to guess what projected it: you need both dimensions.

The logs that are preserved in the compilation process is the part of the vector space that is usually discarded during the "training".

But discarding the "model" would have exactly the same effect: you cannot get back the source dataset from those logs alone. That's why this does not "smuggle the training dataset back".

Indeed the fact that the source dataset is obtainable from the couple "these logs" + "final model", but neither from "these logs" alone nor by the final model alone, proves that a substantial portion of the source dataset is always embedded in the "model", that becomes a derivative work of the sources.


The last iteration (or epoch) of SGD is not shipped with the trained model. The point just does not stand. There are other (better) arguments for why such models are derivative works.

Basically the argument starts with a claim (you can reconstruct the training set of model X from its weights alone) and then shows something totally different. Of course you can reconstruct from the gradient updates plus the weights—that's not interesting, nor does it support the claim.


This does not prove that the source dataset is embedded in the model. You could do this with a random model and get the same result...


I strongly encourage you to prove your statement with a script that use the logs saved and a random "model" and get back the exact source dataset.


Right. While I appreciate the author's skepticism and diction (there is a lot of misleading terminology thrown around by the ML community), his points don't land.

In particular, he argues that there's no learning going on, but then says that there is "absorption" of statistical patterns going on. That's just nitpicking over semantics—to people in the field, the two phrases mean the same thing. The only difference is whether you anthropomorphize a piece of software.

The second place the author stumbles is that he makes the (quite grave) mistake you pointed out. The title insinuates that the network contains the "source dataset" itself. He has shown nothing of the sort by including the training logs in his "decompilation". That's like suggesting you have a Swift decompiler that can recover the exact source code from an optimized binary, but you actually require access to the pre-optimized LLVM IR.


The term "absorbed" was not for the people in the field, but for people who don't know what folding means.

IMHO it's a better metaphor then "learning", because learning is a _subjective_ experience that everyone does and using that term lead inevitably to anthropomorphisation.

"Absorb" match the insight of filters and pipelines, that can be easily understood from any CS student, any "ML expert", any lawyer and any other citizen.

____

As for the network, my argument is simple: if I get back the source dataset from the executable, I think we can agree that such dataset is projected on the numerical matrices that such executable record.

Now where is the dataset?

You might argue that it is recorded _only_ into the gradients logged there (the gradients applied to one single "neuron" for each "layer"), but if so you could reconstruct the source dataset from the logs alone, and in fact, you cannot. You need both the "model" and those gradients in the correct order (and the encodings of inputs and outputs, obviously).

You might ask: "fine, but how much of the source dataset is projected into the gradients and how much is projected into the model?"

To answer, we need to consider that

- the vector space that constitutes the executable is non-linear (the "model" part) and hierarchical (the vectors of the gradients are not independent neither between layers nor between samples)

- (initialization apart) all the values (and the operative value) that the "model" contains comes from the source dataset

Thus I argue that a substantial portion of the source dataset is contained in the "model".

This does not exclude that another substantial portion of the source dataset is also contained into the few logged gradients!

And in fact I've never stated that the "model" contained the whole source dataset.

But if the portion contained into the "model" was negligible, you would be able to get back the sources from those logged gradients alone with negligible errors.

AFAIK, it is not possible, but if you can, please teach me how! I'm always more than happy to be proven wrong if I can learn how to do something that I previously thought impossible!


> And in fact I've never stated that the "model" contained the whole source dataset.

Apologies for seeming rude, but I feel that the abstract is disingenuous (I'm assuming you're the author of the article).

The abstract states:

> we provide a ... decompiler that reconstruct the source dataset from the cryptic matrices that constitute the software executed by them.

But that's not what's happening here. Instead, what's happening is (correct me if I'm wrong) the decompiler uses the gradient information along with the network itself (which is very close to the penultimate network) to reconstruct the input. If we consider for instance, MSE loss, that reconstruction appears trivial given all the information available. As I said before, this reconstruction (while interesting) does not show copyright violation, because the training process information is not available once the network is deployed. Obviously the model contains information about its training set. If it didn't, it would be useless.

I'm not saying there aren't obvious copyright issues, and I'm also not saying that the approach to recover the training set is not interesting. I'm just saying that the overall copyright argument has a major gap (and there are more direct alternative arguments).


> Does this somehow smuggle the training dataset back into the VM?

Turns out you were right about this: http://www.tesio.it/2021/09/01/a_decompiler_for_artificial_n...

Obviously I was not aware of this, so the whole decompilation process was a waste of computation time, but it doesn't prove nor disprove anything about the "model"'s relation with the source dataset.


I think of neural networks as "smooth compression" - something that compresses a key-value database into a continuous non-linear function. Previously Daniel Holden wrote about something similar in his blog (https://theorangeduck.com/page/machine-learning-kolmogorov-c...)

I agree with the author's sentiment that "learning" might be a misleading term. Neural networks can just be seen as specialized programs that just compress certain kinds of data extremely well (at the cost of extremely high pre-computation). But it's still a very useful technique to tackling previously intractable problems.


Such a delightful read! I'm a bit disappointed to see the word "learning" thrown around as if we actually understand what it means. If you can explain it's working in simple mechanical terms, does it become less valuable?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: