Hacker News new | past | comments | ask | show | jobs | submit login
Decoding the ACL Paper: Gzip and KNN Rival Bert in Text Classification (codeconfessions.substack.com)
34 points by abhi9u on July 20, 2023 | hide | past | favorite | 10 comments



The paper has recently been called into question for overestimating their performance relative to BERT: https://news.ycombinator.com/item?id=36758433. Might be good for the blog's author to take this into account in their explainer. The author's perspective sounds a bit too positive (and borderline salesmanlike).


The second to last section "some potential issues with the paper" discusses the top-2 finding.


Yes - it's mentioned, but doesn't the framing below make it sound like they're still advocating for this paper?

> In essence, it's advisable to take the paper’s reported figures with a grain of salt, particularly as they cannot be precisely reproduced as described. Nonetheless, this approach continues to deliver unexpectedly well.

A "grain of salt" is different from "critical evaluation flaw," and if the reproduction's results are true, then the method doesn't after all "deliver unexpectedly well".


I take your point that it could have been more strongly worded. The reason I say it "devliers unexpectedly well" is because the whole concept of using gzip for classification is unintuitive, and even after fixing the flaw it still manages to get decent accuracy (given that it is no more beating state-of-the-art models).


Further analysis shows that it doesn’t perform well at all—successes are tied to things like test set leakage.

https://kenschutte.com/gzip-knn-paper2/

This paper isn’t any surprisingly effective result. It’s thoroughly shoddy scholarship by which the authors should feel embarrassed.


Thank you for reading :-)

I mentioned it towards the end, in the 2nd last paragraph. Those issues in the evaluation do bring its accuracy down a bit, even then it performs better than expected, considering it is doing knn on compressed data.


Well…one that peeks at the test set labels.

https://kenschutte.com/gzip-knn-paper2/


In addition to the evaluation issues, it looks like several of their test sets have significant overlap with the test sets [1]. Especially for a compression-based technique, having exact duplicates is going to help a lot.

[1] https://github.com/bazingagin/npc_gzip/issues/13


In such a scheme, wouldn't synonyms of the same word be no closer to each other, than any other random string?


Yes,that's true. This approach will not work as well as techniques which build a semantic model of the data, such as embeddings.

For example if you want to do positive vs negative sentiment classification of movie reviews, this will perform just as well flipping a coin. (I tried this)

It works well for multiclsss classification. For example, let's say I have an article about soccer, I am a bound to have words like goals and scores. And I will find a good amount of texts in the corpus from the same class with same vocabulary. Thus their ncd will be lower, as opposed to texts from other classes. But this isn't really doing anything related to semantics.

I am going to be writing another article on this, where I will try to explain why gzip is helping it do this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: