Hacker News new | past | comments | ask | show | jobs | submit login
Bad numbers in the “gzip beats BERT” paper? (kenschutte.com)
381 points by ks2048 10 months ago | hide | past | favorite | 130 comments



Probably disappointing to the authors but an excellent rebuttal.

This is the sort of mistake that's awfully easy to make in ML. The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science. You can easily make a system slightly worse or slightly better by contaminating your training set with bad data or accidentally leaking some information about your target and the ML system will take it in stride (but with slightly contaminated results).

This result makes sense to me because as much as I would like it to be true, applying existing compression algorithms to ML feels like too much of a "free lunch". If there was any special magic happening in compression algorithms we'd use compression algorithms as encoders instead of using transformers as compressors.


> This is the sort of mistake that's awfully easy to make in ML.

It is important to remember this! Mistakes are common because they are easy to make. Science is a noisy process, but there is signal there and what we see here is exactly what peer review is about. I tend to argue that open publications are a better form of peer review than conferences/journals because of exactly this. Peer review is about your peers reviewing your work, less about whatever random and noisy standard a conference/journal puts forward. Remember that this was the way things happened for most of our history and that our modern notion of peer review is very recent (mid 70's). Older journals were more about accomplishing the mission that arxiv accomplishes today: disseminating works.

https://mitcommlab.mit.edu/broad/commkit/peer-review-a-histo...

[side note] another reason I'd advocate for the abolishment of conferences/journals is that through this we can actively advocate for reproduction papers, failure papers, and many other important aspects since we would not be held to the "novelty" criteria (almost everything is incremental). "Publishing" is about communicating your work to your peers and having them validate or invalidate your results.

[edit] I think conferences are good in the fact that they bring people together and that encourages collaboration. That's great. But I should clarify that I'm specifically talking about using these platforms as a means to judge the validity of works. If a conference system wants to just invite works and the community, then I'm totally cool with that. I do also like journals in theory given that there's a conversation happening between authors and reviewers, but I believe this also could just easily be accomplished through arxiv + github or OpenReview (preferred).


Those are used. Search for minimum description principle and entropy based classifier. The performance is poor, but it is definitely there and really easy to deploy. I have seen gzip being used for plagiarism detection as similar text tends to compress better. Use the compression ratio as weights on spring model then for visualisation. Also works with network communication metadata ...


It's true in many experiments. The desire to get the result you want can often overwhelm the need to validate what you are getting.

Especially true when the results confirm any pre-existing thinking you may have.


One particular example that I remember from an introductory particle physics class is the History Plots section[1] of the biennial review of experimental data.

Knowing these quantities is important, but their particular values largely aren’t; nobody’s funding or career really depended on them being equal to one thing or another. Yet look at all the jumps, where the measurements after the initial very rough ones got stuck in the completely wrong place until the jump to the right value—when it happened—was of a completely implausible magnitude, like four, six, or ten sigma.

[1] https://pdg.lbl.gov/2023/reviews/rpp2022-rev-history-plots.p...


What's also good to see here is that the post '90 numbers usually don't even fall within the error bars of the pre '90 numbers. While uncertainty is great, it isn't the end all. I think a lot of people forget how difficult evaluation actually is. Usually we just look at one or two metrics and judge based on that, but such an evaluation is incredibly naive. Metrics and measures are only guides, they do not provide certainty nor targets.


Yep, confirmation bias. Luckily helped with peer review!


Hasn’t this paper made it through peer review?


Yeah it was published at ACL ( https://aclanthology.org/2023.findings-acl.426/ ) which is one of the most prestigious conferences in NLP. So kinda disappointing.

But paper reviewers are usually not supposed to look at the actual source code of the papers, and definitely don't try to reproduce the results. They just read the paper itself, which of course doesn't talk about the error.

Not sure what the best solution is, other than having the most "hyped" papers double verified by researchers on Twitter.


It's customary to use OSF (https://osf.io/) on papers this "groundbreaking," as it encourages scientists to validate and replicate the work.

It's also weird that at this stage there are not validation checks in place, exactly like those the author performed. There was so much talk of needing this post-"replication crisis."


I don't think it's "customary" to use OSF in machine learning. At least I had never heard of it. It looks great though, I'm glad other fields are figuring this out. Hopefully it will be taken up by the ML field too.


Really? Perhaps it was just my department (interdisciplinary). We were strongly encouraged to put our work up for review via OSF, including ML. This promotes transparency, collaboration, and verifiable replication.

Why? You can't trust "peer review" from a journal or a conference alone for the verified stamp of "correctness." Why? Bias and ego; humans are human. So, open it up to the hive, encourage people to swarm.


Yeah, it’s not (entirely) the students’ faults that this slipped through peer review. I don’t envy the whiplash they’re going to experience over the next few weeks.

If I was the graduate chair of their department I might schedule a meeting with their supervisor to sort out how this happened.


What about the difference in CPU cost, RAM cost, and GPU training hours, though? What about the comparative Big-E's of the models?

Great topic: Model minification and algorithmic omplexity


These are good research topics, but then you really need to be comparing to other models in the same class.

The only other super cheap model they compare with is FastText, and FastText beat them quite substantially.


Gzip fails at audio and video compression IIRC?

Are time-domain signals better compressed with A/V codecs (compression-decompression); and does this gzip finding suggest that other compression algorithms likely to outperform LLMs at certain tasks like computational complexity, at least?


> paper reviewers are usually not supposed to look at the actual source code of the papers

Wait what? I haven't reviewed for ACL but most conferences don't say "don't look at the source code." They will say that reviewers are not required to look at it (as well as the appendix). But generally it just isn't uploaded. I do always look at the main method when it is there but talking to my peers and advisor, this is very uncommon[0]. My experience is that most reviewers do not spend more than an hour on a work and make an opinion within 15 minutes.

> Not sure what the best solution is, other than having the most "hyped" papers double verified by researchers on Twitter.

I'd say (as a start):

1) Get rid of the conference system. A zero-shot (maybe 1-shot if "rebuttal" is allowed) zero-sum system is just disastrous, especially at scale. There's high incentives to actually reject works you review for. A conference system has a binary outcome and the purpose is to reject 80% of papers based on a rather noisy metric of "top tier." A journal system is a back and forth where reviewers are trying to improve the paper. The purpose of the reviewers here is to determine if the idea is indeed good, and then if the paper meets the requirements or not and must explicitly state what needs to be changed for acceptance.

1.5) An actual rebuttal system could help alleviate some of these issues. Using OpenReview for a conversation between authors and reviewers is critical. A singular 1 page response (the norm) is not adequate to respond to 4 different people who often have low similarities in responses. Reviewers are allowed (though breaks guidelines) to respond in one sentence.

2) ACs need to do a better job at validating reviewers. The number of inane and absolutely unacceptable level of reviews I have gotten is astounding (>25%). I've also seen reviewers often break guidelines and have nothing happen. Examples are comments such as those claiming lack of novelty with no explanation or asking authors to compare to concurrent works (I've had this happen for a work that was put out _after_ submission deadlines. Not mine, but here's an example[1] of this being done publicly). If the reviewer is pushed to update their comment then the authors have no ability to respond to their update without the conversation aspect. If there is high variance in response -- not just scores, but what the critiques are about -- then the ACs need to look closer as something is going wrong. We're in a crisis for reviewers but we also have an undisclosed crisis in quality of reviewers. Benchmarkism is on the rise but benchmarks are extremely limiting for evaluation. There's a certain irony given our frequent discussion of Goodhart's Law or Reward Hacking. I'll even make the claim that the quality crisis influences the quantity crisis as I have seen many peers stop reviewing because it isn't worth their time and they aren't getting a fair shot in return. On a personal note, there is a journal I will no longer review for because in-actionable and unreasonable responses, but I also won't submit to them either.

3) Either get rid of double-blind, or actually do it. Everything is published on arxiv these days, which in general is great for the community as it allows things to move fast. But with this it is incredibly easy to de-anonymize authors. Though for big labs, they de-anonymize themselves actively[2]. In a very noisy process even a very slight edge becomes a significant edge[3]. These biases can even come unconsciously given that we're all reading arxiv papers constantly and it isn't unlikely that we come across some of the works we end up reviewing (yet to knowingly happen to me fwiw). But certain labs do have keywords that they use that can be identified.

I think one of the major problems comes down to this: in a small community we have a certain level of accountability, as we all end up knowing one another through minimal connections. But in a large community there is little to no accountability and what depends on good faith can no longer be trusted. This encourages bad actors, especially when the system is highly competitive (see 1)), and creates bad science/evaluation creep. (e.g. now standard to tune HPs on test data results -- this is information leakage. If you don't, you likely can't compete).

======

[0] Here's a prominent researcher explicitly saying they don't read the appendix, calling it trash, and a poll showing most people don't look at it https://twitter.com/david_picard/status/1660293648796340226

[1] Here's a prominent researcher criticizing a paper for "not citing his work". I linked the top response which is telling him the submission date was 2 months prior to his arxiv release. This is someone who published >250 papers vs someone with <50. For added reference, paper 2 (prominent researcher) was _published_ June 26th in TMLR, but they did cite the other work (gotta give credit for that) https://twitter.com/RinonGal/status/1667943354670170118

[2] We have 2 scenarios here: either reviewers do not know Chinchila == DeepMind, where I'd argue that they are unfit for reviewing given the prominence of that model or 2) they do know, and thus know this is a DeepMind work, and we have an ethics problem. Neither sound great. https://openreview.net/forum?id=OpzV3lp3IMC&noteId=HXmrWV3ln...

[3] The conclusion in this analysis of consistency experiment is that even a small amount of inconsistency leads to a lot of noise given a highly selective standard. Which means that paper acceptance itself is highly stochastic: (2014 experiment) https://inverseprobability.com/talks/notes/the-neurips-exper...

[3.1] A shorter version: https://blog.mrtz.org/2014/12/15/the-nips-experiment.html

[3.2] A follow-up on the 2014 experiment tdlr: reviewers are good at identifying bad papers, but not good at identifying good papers (i.e. bias to reject): https://arxiv.org/abs/2109.09774

[3.3] A follow-up 2021 experiment (consistent with 2014 experiment): https://blog.neurips.cc/2021/12/08/the-neurips-2021-consiste...

[3.4] Video form https://www.youtube.com/watch?v=19Q-vMd9bYg


I try not to submit to conferences if I can avoid it. It's like you say, reviewers are looking for a reason to reject. I don't understand what makes the difference since it's usually the same people reviewing in both conferences and journals, but somehow journal reviewers do a much better job. Some journals have a fast turnaround even, and still the quality of reviewing is considerably better.

My second journal paper got rejected with encouragement to resubmit. Half the reason for that was because the reviewer had, I think, genuinely misunderstood the description of an experiment, so I re-wrote it in painstaking detail. I had a long section where I hammered out a proof of complexity spanning three pages, with four lemmas and a theorem, and the reviewer waded through all that like a hero, and caught errors and made recommendations for improvement. They made a new round of recommendations when I resubmitted. That paper took three rounds of revisions to publish (reject, resubmit, accept with minor revisions) but it got 80% better every time I had to revise it. I wish there was another couple of rounds! It was exhausting, and I bet much more so to the reviewer, but it was 100% worth it.

And yeah, I absolutely do my best to review like that myself. Even in conferences, which probably seems really weird to authors. But, hey, be the change you want to see.


Yeah, honestly the only reason I submit to conferences now is because my advisor asks me to. If it was up to me I would submit exclusively to journals or just to arxiv/open review directly. I think I'll do this when I graduate (soon).

As for the reason why it happens in conferences, I think it may actually be a different set of reviewers. While journal reviewers are going to be conference reviewers, I don't think the other way around is true. I think conferences tend to just have a larger number of shitty reviewers (as well as more shitty submissions). And as you note, it is quite easy to misunderstand a work, doubly so when you're reading under a time constraint. It just makes for a noisy process, especially when reviewers view their job as to reject (not improve). I just think it is a bad system with a bad premise that can't really be fixed. For conference reviewing, I always try to write what would change my mind and if I think the authors should resubmit to another venue. But even reviewing I don't feel authors get a fair shot at responding. They can't address all my comments while addressing others in a single page.

Edit: I saw your bio. I actually have a SOTA work that is rejected (twice). Good performance jump with large parameter drop. But just couldn't tune or run enough datasets because compute limited. Conferences are fun.


>> paper reviewers are usually not supposed to look at the actual source code of the papers

> Wait what? I haven't reviewed for ACL but most conferences don't say "don't look at the source code." They will say that reviewers are not required to look at it (as well as the appendix). But generally it just isn't uploaded.

Sorry, I formulated that badly. I meant what you say, that are usually not presented with the source code, and aren't expected to go hunting for it online. If they do anyway, they are going above and beyond.


If they do go hunting for it, then that would be a ethics violation as they break double blind. But it's not like that exists, at least for big labs.


In terms of “getting rid of the conference system”, I would suggest what I see in maths, which is split into two types of talks:

* talks about already peer reviewers conference papers.

* talks about active research. Here you only submit an extended abstract, and you don’t get to “claim credit” for giving the talk. People tend to only talk about things where they genuinely want to hear feedback, maybe even new collaborators.


To be clear, ACL is the top conference in NlP


I suspect GP commenter meant "replication study" rather than "peer review".

;-)

(Peer review doesn't check if your data is correct. They check your data collection methods make sense given the hypothesis you're testing, and that your conclusions are supported by the data you collected.)


> The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science.

I've been doing a lot of studying in the ML field lately, and I'm seeing this a lot. It is just another thing that feels like the polar opposite of everything else that I've done as a software engineer.

Miss a semicolon? Instant error.

Miscalculate some grads on one out of three layers? Every now and then it might even work! But the results will be weird.


How about this one: tune your hyper-parameters based on the results on your test data.

This is prolific, even the norm, but it is a form of information leakage. You're passing information about the test dataset to the model. The solution to this is to use 3 partitions: train, validation, test. Validation is for HP tuning (you can do cross-validation btw) and test is a single shot.


Yep, I've been guilty of that one lately. That and solving problems by simply overfitting a neural net to the data in the problem domain.

I mean, it works, but the result is less interesting than what I should have done. :)


Definitely. Problem is that doing this helps you get published, not hurts. I think this is why there's often confusion when industry tries to use academic models, as they don't generalize well due to this overfitting. But also, evaluation is fucking hard, and there's just no way around that. Trying to make it easy (i.e. benchmarkism) just adds up creating more noise instead of the intended decrease.


What about: add more dropout or noise layers and train an ensemble of models. Submit the best one. Is this considered dirty?


It is unclear what you are actually saying. An ensemble of models combines the prediction of multiple models, so I'm not sure how you submit "the best one." But what is standard practice is to do some hyper-parameter search and submit the best one. It is status quo that "the best one" is determined by test performance rather than validation performance (proper, no information leakage). These hyper-parameters do also include the number of dropout layers and the dropout percentage. For "noise layer" I'm also not sure what you mean. Noise injection? I don't actually see this common in practice though it definitely is a useful training technique.

But if in general, you're talking about trying multiple configurations and parameters and submitting the best one, then no that's not dirty and it is standard practice. Realistically, I'm not sure how else you could do this because that's indistinguishable from having a new architecture anyways. People should put variance on values but this can also be computationally expensive and so I definitely understand why it doesn't happen.


They banged cross validation into our heads in school and then no one in NlP uses it and I just can’t even understand why not.


Not only that, but I've argued with people substantially, where people claim that it isn't information leakage. The other thing I've significantly argued about is random sampling. You wonder why "random samples" in generative model papers are so good, especially compared to samples you get? Because a significant number of people believe that as long as you don't hand select individual images it is a "random sample." Like they generate a batch of samples, don't like it, so generate a new batch until they do. That's definitely not a random sample. You're just re-rolling the dice until you get a good outcome. But if you don't do this, and do an actual random sample, reviewers will criticize you on this even if your curated ones are good and all your benchmarks beat others. Ask me how I know...


The convention should be to use 1337 as your seed, and disclose that in your publication.


This is never going to happen given the need to SOTA chase. Need because reviewers care a lot about this (rather, they are looking for any reason to reject and lack of SOTA is a common reason). Fwiw, the checkpoints I release with my works include a substantial about of information, including the random seed and rng_state. My current project tracks a lot more. The reason I do this is both selfish as well as for promoting good science though, because it is not uncommon to forget what arguments or parameters you had in a particular run and the checkpoint serves as a great place to store that information, ensuring it is always tied to the model and can never be lost.

You could also use the deterministic mode in pytorch to create a standard. But I actually don't believe that we should do this. The solution space is quite large and it would be unsurprising that certain seeds make certain models perform really well while it causes others to fail. Ironically a standardization of seeds can increase the noise in our ability to evaluate! Instead I think we should evaluate multiple times and place some standard error (variance) on the results. This depends on the metric of course, but especially metrics that take subsets (such as FID or other sampling based measurements) should have these. Unfortunately, it is not standard to report these and doing so can also result in reviewer critique. It can also be computationally expensive (especially if we're talking about training parameters) so I wouldn't require this but it is definitely helpful. Everyone reports the best result, just like they tend to show off the best samples. I don't think there is inherently a problem showing off the best samples because in practice we would select these, but I think it is problematic that reviewers make substantial evaluations based on these as it is highly biased.

Noise is inherent to ML, and rather than getting rid of it, I'd rather we embrace it. It is good to know how stable your network is to random seeds. It is good to know how good it can get. It is good to have metrics. But all these evaluations are guides, not measures. Evaluation is fucking hard, and there's no way around this. Getting lazy in evaluation just results in reward hacking/Goodhart's Law. The irony is that the laziness is built on the over-reliance of metrics, in an attempt to create meritocratic and less subjective evaluations, but this ends up actually introducing more noise than if we had just used metrics as a guide. There is no absolute metric, all metrics have biases and limitations, and metrics are not always perfectly aligned with our goals.

We should be clear that starting seed isn't related to what I'm talking about in the previous comment. The previous comment was exclusively about sampling and not training.


Academic research code is largely dogshit written as quickly as possible by amateurs, barely tested whatsoever, and the primary intended output of all such code is accumulating paper citations.

A world with half as many scientific papers and twice as much care would produce far more value but the whole enterprise is hopelessly gamified.


Having worked in other sciences (neuroscience for me), I’m not sure what catastrophic obvious errors you’re used to seeing. The vast majority IME are like this, except with even longer feedback loops (on the order of several months).


Well, biology is probably a lot closer to ML in this way. My experience in chemistry or material science is that 99% of the time if you do anything wrong it totally doesn't work at all.

This is fairly typical in software as well.


now shift fields such that the subtle methodological errors don't come to light in 20 years.

which field are you on now? economics!? haahha


Hi, that's my blog post. I'm pretty sure about what I wrote here, but may need the authors to chime-in in case I am missing something. I just submited an issue on github, https://github.com/bazingagin/npc_gzip/issues/3


You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.


>gzip approach not better than dnn models but mostly competes and much cheaper to run

Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.

edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.


I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.


As described in the paper, it didn't look like the gzip classifier trained at all. Inference involved reading the entire training set.

One could surely speed this up by preprocessing the training set and snapshotting the resulting gzip state, but that wouldn't affect the asymptotic complexity. In effect, the number of parameters is effectively equal to the size of the entire training set. (Of course, lots of fancy models scale roughly like this, too, so this isn't necessarily a loss.)


The gzip approach is much slower at inference time because you need to compute the gzip representation of the concatenated strings (query + target). Intuitively, this should be significantly more than a dot product of two embedding vectors.


The latter depends very strongly on how much computation is needed to compute those embedding vectors.

If you run a GPT-3.5-sized mode to compute that embedding (which would be a bit absurd, but if you really want GPT-3.5-quality classification, you may well be doing something like this), you're looking through quite a few tens of billions of parameters and doing a correspondingly large number of FLOPs, which could be just as expensive as running gzip over your whole (small, private) training set.


no, because the compute intensity scales with the number of classes which you wish to classify to. if you have n classes, you need to do n gzip compressions at inference time. in the embedding world, you only call the embedding model once on insert, and only need to dot product at inference time.

the same logic extends to using a self-hosted embedding model, which tend to be as good as Ada on most benchmarks, and yes, can be finetuned over your private data.


>The latter depends very strongly on how much computation is needed to compute those embedding vectors.

Sure but the gzip metrics are worse than FastText which computes the embeddings in essentially no time. Tokenize, lookup embeddings by token id, and then do some averaging. So compared to that the gzip approach is very slow.


FastText isn't a LLM, it's a token embedding model with a simple classifier on top.


Sure but it's existence means the statement is really "gzip approach not better than dnn models, and doesn't compete or be cheaper to run than previous models like FastText." That's not a very meaningful value statement for the approach (although why gzip is even half-decent might be a very interesting research question).


I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.


Do you happen to be familiar with audio classification? There's been a ton of research on text classification and prediction but not many good papers I've seen for general audio classification. I'm talking more feature extraction, not speech recognition. There are a lot of speech recognition papers. So far I've been stuck on fft - image processing pipeline but I haven't gotten great results in real world tests, only on nice teat datasets.

Personally I don't have much experience working beyond mlp/rnn/lstm/cnn models.


Don't look at it as a suggestion to use gzip in production, but an invitation to reconsider the unassailable superiority of BERT over simpler, tailored solutions.


I don't think anyone actually doing NLP research has thought that BERT is always better than simpler methods. Linear classifiers with ngrams, or even better, large margin linear classifiers, are well known to be competitive with things like BERT on a variety of tasks, with orders of magnitude better runtime.

In contrast, this gzip technique is considered a cute application of information theory, but even in academia is rarely included in studies because there are simpler and better techniques for NLP.

Yes, if you are chasing the ultimate accuracy, then using a LLM (not necessarily BERT either) is going to be the best. But for a practical system trading some accuracy for vastly improved runtime is usually a very good trade-off. And again, it depends on your domain. Topic classification, stick with a linear model. Sentiment analysis? Ok, here a LLM actually gives substantially better results so it's worth the extra cost if sentiment is crucial to your application.

I personally like the CW algorithm I mentioned because it's relatively easy to implement and has excellent qualities. But if I were a dev looking for a ready to go already implemented production system I'd go for vowpal wabbit and move up to a LLM if I'm not getting the accuracy I need for my application.


Is it really an invitation? The paper shows that the current models are worse for some marginalized languages that are used as OOD datasets. I am not really that surprised that the modelals don't speak those and I don't know anybody who would use BERT like that


But FastText (2015) already exists and beats this gzip approach on all criteria. So the invitation has already existed before BERT (2018) and continues to exist.


If this story is true the paper is not solid.

Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.

It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.

If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".

Also, kNN methods are slow O(n^2).


kNN methods are broadly not O(n^2)[1], especially in practice where approximate methods are used.

[1]: https://en.wikipedia.org/wiki/Nearest_neighbor_search


how would you build an index over the gzip encoded data? seems quite different from building indices over vector embeddings.


Hi, I'm the first author of the paper and I read your blog post. The reason I chose k=2 is because it's recommended to use n^{1/2} and I wanted to pick a k that's compatible with 5-shot setting. But you are right this is a bit odd. As I mentioned in both the paper and the twitter, different values of k will affect the result and I reported the _max_ result we can get so it does mean an ideal situation when the guess is always right. I also use this strategy for W2V and SentBERT. But it doesn't make the result top2 accuracy. As far as I know, top2 accuracy means in the top2 predicted classes, either one is correct, we score. However, as you mentioned, in knn when k=2, there is a situation when the 2 nearest neighbours point to the same class - we are missing another class candidate if reporting top2 accuracy. But I would love to add results for different strategies and different k values when I upload newest version to arxiv when I have time. The strategy of "decrement" you mentioned in the blog is really nice and I would love to add it to the repo if you want. I apologize for the short and late reply; I haven't checked the repo yet. I'm preparing for tomorrow's thesis defence and will reply & resolve the issue when I finish.


Just wanted to say, thanks for your work debugging this.

You have probably saved other researchers an unfathomable amount of time


Thanks for the replication, this is important.

One question, did you try to replicate the other result table (Table 3)?

If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).

In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.

[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).


I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).

Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.

Yeah, you're right, the more classes there are, probably the lower the effect this will have.


Did you try contacting the authors before you went public with your discovery?


We're adult enough to have discussions like this in public. They are healthy to have. People make mistakes. Kudos to the original authors for releasing the source code so people could inspect and replicate their results.


I agree, and just want to add: nowadays it's super common for researchers to widely publicize their new work on social media. The blog post here even mentions "Table 5 from the paper was often included in tweets".

In this context of sharing your results very publicly, it seems only fair that rebuttals would be very public, too. Otherwise researchers would be highly incentivized to very publicly publish weak results because they would get a lot of positive publicity when they share the results, but not much negative publicity when the weaknesses are found and shared.


It isn't a security issue and doesn't warrant responsible disclosure so why would op be expected to?


I did not, but I see why that could be a better approach. I mainly am trying to be more open with little side projects I do, so wanting to start blogging what I'm working on. Also, this paper was beiung widely discussed so thought this would be one more entry in that.


Whoa, I just read your post and saw this:

> tldr: it is reporting a top-2 accuracy rather than a kNN(k=2) accuracy

If the accuracy figures shown for other models are top-1, that's a pretty significant mistake, hopefully an innocent one.

Thank you for doing this and sharing your findings!

---

Previous discussion on HN: https://news.ycombinator.com/item?id=36707193


Also:

>k=2 is a bit of an odd choice for kNN classification

That's an understatement. Choosing an even number for any sort of voting algorithm doesn't make much sense, choosing 2 specifically probably makes the least sense of all.


Yeah that is a big red flag - as the OP mentions, there is basically no way of making k=2 statistically different from k=1, that's why nobody uses it.

I suppose the authors just tried many different k and selected k=2 because it performed surprisingly well (likely due to the bug the OP found out). But if the results were significantly better than k=1 or k=3, it's a bit weird the authors never double checked why that was the case. I guess it can be one of those things you settle on early in the overall process with a few experiments, and just take for granted afterwards, never checking it again, but still, it sounds like something that should pop out at some point while writing the paper...?


Yeah, I think this is one part where a reviewer or advisor could have focused questions.

There is a sentence in Appendix C: "We set k = 2 for all the methods on all the datasets and we report the maximum possible accuracy getting from the experiments for each method."

I'm not sure what the second part of that means exactly.


True. You want to always use an odd number so there are no ties.

I'm guessing they were trying a parameter sweep, and found that (thanks to the bug) they got the best results for K=2.

This too is problematic in its own sense.


Yes, agreed. One small point: for the multi-class case (more than just two classes), which include all the datasets here, you can still get ties for odd k. e.g. k=3, you can get 1 vote each for 3 different classes, etc.


Multi-class is trickier. Maybe we can break down an N-class problem into N binary-classification problems?


When we did search relevance experimentation at Shopify we made lots of mistakes. I can empathize with the authors. I’ve had a lot of my own public screw ups.

At the end of my time at Shopify I learned good science requires good software engineering. It’s really easy to make mistakes at so many places in the stack.

We spent a lot of time on creating rigorous, heavily tested and high quality software for our experiments so we could trust our numbers and reproduce each others experiments. We tried to discourage one-off evaluation methods, but if we created a new one, to add it to our suite and test the metric to understand what it meant.

It seems obvious, but sadly not as common as I wish it were in my experience with this kind of experimentation. Companies want velocity, and thinking deeply statistically, and building internal tools, is not in the interest of most higher ups.


> At the end of my time at Shopify I learned good science requires good software engineering.

This is a positive side of industry research. First, you tend to have more software engineering expertise available, and secondly you have more of an insentive to not exaggerate your claims, as if you say it works, you'll be expected to put it into production.


I appreciate that this blog post was made.

This is like so, so many little projects that I do (even specifically showing problems in papers like this) that never see the light of day. I usually just make a small noise, and then it sits on my hard drive and that's it.

So thank you for putting this out there.


I've started using Twitter as a "lower effort blogging" platform. After I spend a day on something like this, I'm usually too tired to actually write a blog post about it, which feels like a waste. But then writing a small Twitter thread is usually within my capabilities.


Really happy to see this: KNN + classification task + doing classification that's based on pure text similarity is a recipe for stacked results.

Schaudenfreude responses to this paper misunderstand that the natural language stuff is crucially important for embeddings: sure, phrases that share words will classify well and GZIP well, so GZIP can be used as ersatz classification.

The miracle of BERT / embeddings is _not_ having to share words: for instance, "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but not "my jewelry is stored safely in the safe".

This is also an important thing to consider with LLMs: people are using embeddings intended to do text similarity, whereas you want to use an SBERT model: that is trained to correlate a question to a document that will answer it.

https://www.sbert.net/ for the full rabbit hole.

Previously: Should you use OpenAI's embeddings? Probably not, and here's why. https://iamnotarobot.substack.com/p/should-you-use-openais-e....

HN discussion: https://news.ycombinator.com/item?id=35377935


> The miracle of BERT / embeddings is _not_ having to share words

To be fair, the original task is specifically chosen where something like knn+compression has a chance of being good: i.e. out of domain + low resource.

Under these conditions the training inputs could be too sparse for a highly parameterized model to learn good embeddings from.

In traditional in domain + big data classification settings there's no chance that non-parametric methods like compression would beat a learned representation.


It wasn't obvious to me why the authors chose kNN for the classifier. Since they produce a distance matrix, they could have used multi-dimensional scaling to convert the matrix to factors, and then used a tree algorithm such as xgboost which would likely make use of more information than kNN and produce significantly better results. They could have also used a PAQ compression algorithm which are much better than the LZ compressors -- all of which could have significantly improved the results and possibly delivered on their original conclusions.

what i liked about the subject paper is that the compression algorithm is abstracted away and it led me to consider what else one could do with compression via the p(x) ~ K^(-|x|) relationship, where K is the alphabet size and |x| is the length of the string x, and assuming optimal coding.

For example it occurred to me that one could do traditional classification by packing the factors for each response into separate documents and then proceeding as the paper does to classify which document best compresses the next sample in order to determine its class: a sort of supervised classification using compression algorithms. The closer the compressor is to an optimal code for the dataset, the better it will work.

Schemes for sequence prediction are equally straightforward to implement.

I found it to be a pleasant surprise.


Can anyone explain to me how a compression algorithm can beat an LLM at anything? Isn’t that like saying horses are better than graffiti?

I’m sure the answer is in there somewhere but I’m not well versed in AI and I simply can’t figure it out.


Generally, compression = model + entropy coding.

The model's job is to predict what comes next. The entropy coder's job is to encode the difference between the prediction and what actually comes next so that the most likely outcome uses as few bits as possible. The more accurate the model is, the less the difference between reality and prediction, the less bits the entropy coder needs and the better the compression.

Simple compression algorithms have simple models, like "if I see the same byte 10 times, the 11th is likely to be the same". But you can also use a LLM as your model, as completing text with the most likely word is what LLMs do.

Here they did the opposite. Instead of using a model for compression, by using a few tricks, they used a compression algorithm as a model: the most likely outcome is when the compression algorithm uses less bits to encode the result. And the original authors have shown that, in some tasks, the simple model that can be extracted out of gzip beats much more complex LLMs.


I almost feel like compression and embeddings are duals of each other, but I can't quite articulate it.

Embeddings use fixed-size vectors to minimize the dot product between vectors of similar inputs. Compressors use a variable-length encoding to minimize the overall stored size.


they are in a way. both encode representation of larger amounts of information on a smaller footprint.


Generally compression algorithm try to give structured data a distribution more similar to random data.

If any byte sequence is a correct file (unlikely, but mostly because compression algorithms try to be robust against corruption), then this is easy to reverse, you just generate a random sequence of bytes and then decompress it.

Basically you can turn a compression algorithm into a probability distribution by inserting random bytes wherever the decompression algorithm tries to read one, but sometimes not all bytes are allowed.

You can then reason about this probability distribution and see what it's properties are. Typically something with a probability of 'p' will require -log(p)/log(2) bits.


Why gzip, and not a more complex compression algorithm like 'xz'? Or if you want it to be fast then 'zstd' or 'lz4'. Gzip is an odd choice: is it neither the highest compression ratio, nor the fastest.


A language model estimates the probability of a sequence of words P(w_1, ..., w_n) or equivalently P(word | context).

For compression, word sequences that have higher probability should be encoded with shorter codes, so there is a direct relationship. A well known method to construct such codes based on probabilities is Huffman coding.

This works whether you use a statistical language model using word frequencies or an LLM to estimate probabilities. The better your language model (lower perplexity) the shorter the compressed output will be.

Conversely, you can probably argue that a compression algorithm implicitly defines a language model by the code lengths, e.g., it assumes duplicate strings are more likely than random noise.


The intuition about how the gzip method works goes like so:

If you compress `ABC`, it will be X bytes. If you then compress `ABCABC`, it will not take 2x bytes. The more similar the two strings that you concatenate, the less bytes it will take. `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.

BERT is, by todays standards, a very small LLM, which we know has weaker performance than the billion-param scale models most of us are interacting with today.


> very small LLM

Heh. So does that make it a MLM (medium)?

I've always found it funny that we've settled on a term for a class of models that has a size claim... Especially given how fast things are evolving...


Compression is equivalent to intelligence:

https://mattmahoney.net/dc/rationale.html


It's a very limited task: take a document and classify it into one of (for example) 10 or so categories. Things like detecting certain words can do pretty well in some cases. Things that compress well have the occurrence of common substrings.


LLMs and essentially all neural networks can be viewed as learning compression algorithms where the behavior of the compression algorithm is learned and subject to potential constraints beyond mere file reconstruction.

Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web"[0] to get a better sense of this.

Keeping this fact in your mental model neural networks can also go a long way to demystify them.

0. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...


(The human brain is also, in part, a blurry JPEG of the world.)


We currently have no reason to believe this, and information we do have seems to suggest that is very unlikely to be the case. I'm also guessing from my username you can infer that I don't think we even know enough to concretely say what is this "world" you are referencing.


I don't know what exactly blurry jpeg means to you but we have every reason to believe we operate on shortcuts of reality, not reality. Nearly all your brain does with sense data is warp it to confirm to internal predictions in numerous ways.

Memories are always part fabrications. You can't return to previous mental states (you only think you do) and we have no real clue what really informs decisions i.e preferences shape choices just as much as choices shape preferences.

Your brain will happily fabricate rationals you sincerely believe for decision that couldn't possibly be true i.e split brain experiments


I for one massively compress my experience. I remember things on autocomplete. I have memories where different time periods are mixed together: my recollection of a room will have furniture in it that was only added later, for instance.


One way to interpret/understand language models is as quite involved compression algorithms.


Many other replies here are wrong - the primary reason is that the LLMs were used on completely out of distribution data (e.g. trained on English, evaluated on completely different language that shared some characters). The points about compression's relatedness to understanding are valid, but they are not the primary reason for LLMs underperforming relative to naive compression.


Other reply is great, more in-depth on details from me here: https://news.ycombinator.com/item?id=36758681

Plain english TL;DR:

- if you limit your task to binning snippets of text

- and the snippets are very well-defined (ex. code vs. Filipino text)

- the snippets _bytes_ could be compared and score well, no text understanding needed

- size delta of a GZIP after adding one more sentence acts as an ersatz way to compare sets of bytes to eachother (ex. you can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF added to it will have a size delta of 0)


Did you read the recent Douglas Hofstadter article or do you just always use the word ersatz?


I was homeschooled till 12 and mostly left to my own devices as long as it was reading - I believe that has caused a lifelong issue where I sound like a tryhard unintentionally :( (TL;Dr I use it but IDK when, didn't see hofstader article but now I'm looking forward to it)


The word ersatz is great, and conveys the notion that the replacement is simpler and possibly inferior when compared across all features. “Substitute” doesn’t cut it. Human language (ironically w.r.t. TFA) isn’t a collection of redundant symbols, the synonyms carry all sorts of useful nuance.


You can substitute the word "substitute" as an ersatz "ersatz".


It's honestly weird and annoying and I'd give it up in a second.

There's two issues:

- I don't have an ear for what's simple vocabulary versus tryhard, I go into a mad loop when I try

- even if I actively notice it, substitution can seem very far away from intent. Simple wouldn't have occurred to me - I wanted to say something more akin to sloppy / stunt and ersatz is much closer to "hacky" in meaning than simple. Think MacGyver.

But I should do the exercise of at least scanning for words more often and aim for wider audience - I would have known ersatz was an outlier and I shouldn't feel it's condescending or diluting meaning, it's broadening the audience who can parse it


Why would you apologize for your vocabulary and try to sound like someone less well-read than you are? Just get over it and be yourself.


Eh, it's fine, it doesn't sound tryhard to me, just a bit hard to read.


It doesn't sound tryhard; it sounds literate.


https://www.inference.org.uk/itprnn/book.pdf is a classic text on this connection.


If this is true, I'm looking forward to seeing how all the people who made grandiose statements about that paper now quietly scrub them.

LinkedIn and Twitter influencers, I'm looking at you in particular.

If it's not true, I guess I'll be the one looking stupid—I only skimmed the article.


Gzip as a classifier is surprisingly good and you should use it as a benchmark for your neural network.


Just a note: your blog seems to be stuck in 2022. Date of post is 17 July 2022


Thanks, should be fixed in a minute. That's what I get for writing dates by hand...


Whats disturbing to me if this is true is the number of "top voices" in ML on ML-Twitter were head over heals with this paper. How many of them actually read it at all?


I had some difficulty reproducing results that were better than word2vec embeddings + cosine similarity. See my comment on the announcement thread here. https://news.ycombinator.com/item?id=36706078


Great job replicating and fixing. So easy to accidentally create results that are statistical artifacts.


IMO their results have nothing to do with compression:

When I replace gzip with lz4 (lower compression ratio, just does the "substring replacement" part of gzip), I get the same performance.

When explicitly just comparing the substrings, I still get the same performance.

All experiments were done in their 100-shot setting. Code: https://github.com/mattminder/npc_gzip (for more details, see the readme)


My take on this:

https://twitter.com/antonosika/status/1679423272541192196?s=...

Regardless, great work digging into the code, and great work by authors publishing the code.


Great. So a researcher publishes their code, we can replicate it, find flaws, and correct it. If only every paper were like this. Congrats to the authors for making it easily accessible and the OP for trying to replicate and fix the problem.


So, couldn't it be that the authors of the paper ran the model with a random tie break and got lucky? This blog post seems to assume they had the "rand" flag deactivated. Please correct me if I am wrong.


From what I understand in the post getting lucky enough to see that big of a change in this situation would be like getting 1000 head flips in a row. It’s not luck you could expect to ever get.


I see


The code of the paper is all on Github so you can verify that line of thinking if you want to.


One thing many people are missing: the simple gzip-for-text-classification hack is not the contribution of this paper. (They reference the standard intro AI textbook for that hack.) The contribution is to use the gzip numbers together with k-nearest-neighbors.

In section 6.2 they compare gzip-distance+kNN vs. gzip-distance on its own on four problems: it was better on two, and worse on two others.

Another bit of background I guess is worth saying: language models are pretrained with a compression objective. That is, the loss function in pretraining is the cross entropy of the input text, which means "minimize the compressed length of this input if you fed it to this LM driving an arithmetic coder".


In current times *zip mostly cripples things like the web and Docker. To find out it is the best at something in 2023 is not very likely.

Nice find btw.


This is interesting, however, why not first discuss with authors directly to make sure that you're right?



One hour ago?


Yes, that could be a better idea. I am mainly trying something new to work more "in the open" and write blogs about things as a do them. I could be wrong and that would be pretty embarrassing for me. I just published the code I used to double-check things, now linked on the page near the bottom.


I have hacked javascript port https://gist.github.com/netdur/a777f75fb70e0abc19c407c2ff7f9...

and it seems to work!!! regardless

Best matched answer for 'How do I start a new project?' is 'To create a new project, go to the dashboard and click on 'New Project'. Then, fill out the details and click 'Save'.'

Best matched answer for 'How can I delegate tasks to others in my team?' is 'To invite team members to your project, open the project and click on 'Invite Members'. Enter their email addresses and click 'Send Invites'.'

Best matched answer for 'What do I do when I finish a task?' is 'When a task is completed, it should be marked as 'Complete'. It will then be moved to the 'Completed Tasks' section.'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: