This reminds me of an interesting exeriment I did earlier this year with ChatGPT.
First, I came upon this reddit post [1] which describes being able to convert text into some ridiculous symbol soup that makes sense to ChatGPT.
Then, I considered the structure of my Typescript type files, ex [2], which are pretty straightforward and uniform, all things considered.
Playing around with the reddit compression prompt, I realized it performed poorly just passing in my type structures. So I made a simple script which essentially turned my types into a story.
Given a type definition:
type IUserProfile {
name: string;
age: number;
}
It's somewhat trivial to make a script to turn these into sentence structures, given the type is simple enough:
"IUserProfile contains: name which is a string; age which is a number; .... IUserProfiles contains: users which is an array of IUserProfile" and so on.
Passing this into the compression prompt was much more effective, and I ended up with a compressed version of my type system [3].
Regardless of the variability of the exercise, I can definitely say the prompt was able to generate some sensible components which more or less correctly implemented my type system when asked to, with some massaging. Not scalable, but interesting.
I’m curious, did you actually run it through the tokenizer and see if it was less tokens vs uncompressed? I have seen a lot of people try these “compression” schemes and token usage can be higher.
It's definitely less tokens at least in my contrived case. Looking at the compressed text, I can make out what is what, and see that it's just minimizing words to their root parts.
IAssist contains: id which is a string; prompt which is a string; promptResult which is an array of strings.
Compressed (13 tokens):
IAsst{id,prompt,promptR}
And again I'll just call this interesting, because is it really going to know promptResult is a string array in most cases? Definitely not unless it gets some help in the component description, maybe.
and thought about a question I'd thought about for a while which is extracting facts from that sort of thing and one notable thing is that certain named entities appear over and over throughout the document (say "Cave of Ordeal") and how both attention and compression-based approaches can draw a line between those occurrences.
Actually a neural network is just that: data compressed with losses. A transformer makes multiples queries to a large loss-y and stochastically compressed database to determine the next token to generate. The PAQ archiver is famous for being just that: a neural network to predict the next symbol.
The compressor idea is really clever, but wouldn't it be nice to have 100% direct control over everything?
This got me thinking about the possibility of building a series of simple context/token probability tables in SQLite and running the show that way. Assuming we don't require massive context windows, what would prevent this from working?
It's not like we need to touch every row in the database all at the same time or load everything into RAM. Prediction is just an iterative query over a basic table - You could have a simple key-value pair of context & the next most likely token for the given context. All manner of normalization and database trickery available for abuse here. Clearly a shitload of rows, but I've seen some 10TB+ databases still satisfy queries in seconds. You could even store additional statistics per token/context for online learning scenarios (aka query-time calculation of token probabilities). You could keep multiple tokenization schemes online at the same time and combine them with various weightings.
What would be more efficient/cheaper than this if we could make it fit? Wouldn't it be easier to iterate basic tables of data and some SQL queries than to trip over python ML toolchains and GPU drivers all day?
and you can still combine them for tasks that require strict output control (e.g. alphanumeric sequence recognition, noisy keyword spotting, strict grammars, etc).
- So, how do you build ChatGPT with data compression?
ChatGPT is already built with data compression, the training loss is cross entropy which means the explicit goal of the training is to compress the training dataset to the fewest bits.
First, I came upon this reddit post [1] which describes being able to convert text into some ridiculous symbol soup that makes sense to ChatGPT.
Then, I considered the structure of my Typescript type files, ex [2], which are pretty straightforward and uniform, all things considered.
Playing around with the reddit compression prompt, I realized it performed poorly just passing in my type structures. So I made a simple script which essentially turned my types into a story.
Given a type definition:
It's somewhat trivial to make a script to turn these into sentence structures, given the type is simple enough:"IUserProfile contains: name which is a string; age which is a number; .... IUserProfiles contains: users which is an array of IUserProfile" and so on.
Passing this into the compression prompt was much more effective, and I ended up with a compressed version of my type system [3].
Regardless of the variability of the exercise, I can definitely say the prompt was able to generate some sensible components which more or less correctly implemented my type system when asked to, with some massaging. Not scalable, but interesting.
[1] https://www.reddit.com/r/ChatGPT/comments/12cvx9l/compressio...
[2] https://github.com/jcmccormick/wc/blob/c222aa577038fb55156b4...
[3] https://github.com/keybittech/wizapp/blob/f75e12dc3cc2da3a41...