Hacker News new | past | comments | ask | show | jobs | submit login

Yes,that's true. This approach will not work as well as techniques which build a semantic model of the data, such as embeddings.

For example if you want to do positive vs negative sentiment classification of movie reviews, this will perform just as well flipping a coin. (I tried this)

It works well for multiclsss classification. For example, let's say I have an article about soccer, I am a bound to have words like goals and scores. And I will find a good amount of texts in the corpus from the same class with same vocabulary. Thus their ncd will be lower, as opposed to texts from other classes. But this isn't really doing anything related to semantics.

I am going to be writing another article on this, where I will try to explain why gzip is helping it do this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: