I always wondered where to start learning reverse-engineering. Most people will say learn Assembly first. But from there on, there seems to be not much more concrete information online.
Do people just figured it out by trial & error like common patterns in x86 / arm / arcade platforms slowly?
I can't really find much discussion on details online.
It's like debugging.. I'm sure you must have worked on an unfamiliar code base at some point and had to figure it out. Instead of having the source you have the binary and using tools like Ghidra you can start to piece together the source but you'll still need to reason over it the very same way you did on that unfamiliar codebase and this time there's no comments at all ( which isn't uncommon in a lot of source available projects mind you )
So you're probably already half way there. Being familiar with assembly code helps of course.
I personally learned a lot by messing in Cheat Engine, it is way more capable than I thought, specially because I mostly used it as a kid and never looked back.
It is a great tool to get started with assembly in my opinion because the disassembler is good enough and you can write what they call 'assembly scripts' which provides the foundation on doing memory patches in x86 asm. Then from that you can start writing your own utils to patch the games at your own will.
You can do crazy cheats by patching the game just with Cheat Engine!
I knew some from school, but stepping through a debugger with a video game that I remembered from childhood was a better education on computer engineering than anything I got in class.
Not completely related. Does anyone know where I can find articles / papers that discuss why transformers, while acting as merely "next token predictor" can handle questions with:
1. Unknown words (or subwords/tokens) that are not seen in the training dataset.
Example: Create a table with "sdsfs_ff", "fsdf_value" as columns in pandas.
2. Create examples(unseen in training dataset) and tell the LLM to provide similar output.
I have a feeling it should be a common question, but I just can't find the keyword to search.
PS. If anyone has any links with thoroughly discussion about positional embedding, that would be great. I never got a satisfying answer about the usage of sine / cosine and (multiplication vs addition)
If I had to guess, single characters are able to be encoded as tokens, but there's more "bandwidth" in the model being dedicated to handling them and there's less semantic meaning encoded in them "natively" compared to tokens for concrete words. If it decides to, it can recreate unknown sequences by copying over the tokens for the single letters or create them if it makes sense.
I think some earlier NLP applications have something called "Unknown token", which they will replace any unseen word. But for recent implementations, I don't think they are being used anymore.
It still baffles me why such stochastic parrot / next token predictor, will recognize these "Unseen combinations of tokens" and reuse them in response.
Everything falls into place once you understand that LLMs are indeed learning hierarchical concepts inherent in the structured data it has been trained on. These concepts exist in a high dimensional latent space. Within this space is the concept of nonsense/gibberish/placeholder, which your sequence of unseen tokens map to. Then it combines this with the concept of SQL tables, resulting in hopefully the intended answer.
That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.
And, “correct probability distribution over sequences of tokens” (or, “correct conditional probability distribution over sequences of tokens, conditional on whatever)”, can be... well, you can describe pretty much any kind of input/output behavior in those terms.
So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?
So, whatever impressive thing it does, is not really in conflict with its output being produced from the probability distribution P(X_{n+1}=x_{n+1} | X_1=x_1, ..., X_n=x_n) (“predicting the next token”)
Why does the embeddings have linear properties such that you can use functions like cosine similarity to compare? It seems that after the signal going through so many non-linear activation layers, the linear properties should have been broken down / no guarantees.
Because neural networks use dot products, which are just un-normalized cosine similarities, as the main way to compare and transform embeddings in their hidden layers. Therefore, it makes sense that the most important signals in the data arranged in latent space such that they are amenable to manipulations based on dot products
For what it's worth, I wonder the same thing and think it's not as obvious as others suggest. e.g. if you have an autoencoder for a one-hot encoding, you're essentially learning a pair of nonlinear maps that approximately invert each other, and that map some high dimensional space to a low one. You could imagine that it could instead learn something like a dense bit packing with a QAM gray code[0]. In a one-hot encoding the dot product for similar tokens is zero, so your transformations can't be learning to preserve it.
Somewhat naively, I might speculate that for e.g. sequence prediction, even if you had some efficient packing of space like that to try to maximally separate individual tokens, it's still advantageous to learn an encoding so that synonyms are clustered so that if there is an error, it doesn't cause mispredictions for the rest of the sequence.
I suppose then the point is that the structure exists in the latent space of language itself, and your coordinate maps pull it back to your naive encoding rather than preserving a structure that exists a priori on the naive encoding. i.e. you can't do dot products on the two spaces and expect them to be related. You need to map forward into latent space and do the dot product there, and that defines a (nonlinear) measure of similarity on your original space. Then the question is why latent space has geometry, and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy. So perhaps it is obvious after all!
I think my comment was not worded properly. I was thinking "geometry properties = linear properties", what I really should say is:
Why does the latent space has geometry properties where we could use functions like cosine similarity to compare?
So when training, the signal will be mapped to latent space that will minimize the error of the objective function as much as possible.
Many applications already use cosine similarity function at the end the network, it would be obvious why they work. I reviewed other cost functions such as Triplet Loss. They use euclidean distances, so I guess it make sense why the geometry properties exist too.
For "and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy", what does "maximally information dense" means, I still don't quite get it.
LLM vectors do have decent linear properties already. But for document embedding purposes they are often further trained for retrieval via cosine similarity, which enhances this, e.g. see table 1 in [1], avg retrieval performancs using BERT goes up from 54 to 76 after fine-tuning for embeddings.
The cosine similarity is not inherently better suited for linear properties whatever that means, it’s just the cosine of the angle between two vectors. If the vectors are unit length, then it’s just the projection of one to the other.
This is not the first time something is fishy. Back in the early stages of the repo. They were advertising on the front page that they are achieving similar MAP to the original C++ version. But only to be found out they haven't train it on COCO dataset and test it.
Do people just figured it out by trial & error like common patterns in x86 / arm / arcade platforms slowly?
I can't really find much discussion on details online.