Hacker News new | past | comments | ask | show | jobs | submit login
FastText is the best text classification library for a quick baseline (rolisz.com)
143 points by rolisz 7 months ago | hide | past | favorite | 11 comments

Past related threads:

Using aligned word vectors for instant translations with Python and Rust - https://news.ycombinator.com/item?id=27465287 - June 2021 (35 comments)

FastText embeddings of field headers to improve NLP - https://news.ycombinator.com/item?id=23405965 - June 2020 (1 comment)

Fast and accurate language identification using fastText - https://news.ycombinator.com/item?id=15393518 - Oct 2017 (12 comments)

Multilingual word vectors in 78 languages - https://news.ycombinator.com/item?id=14167539 - April 2017 (23 comments)

Facebook releases 300-dimensional pretrained Fasttext vectors for 90 languages - https://news.ycombinator.com/item?id=13771292 - March 2017 (70 comments)

Fasttext and Torch: A fasttext implementation based on Torch - https://news.ycombinator.com/item?id=12862541 - Nov 2016 (1 comment)

Facebook AI Research Open Sources fastText - https://news.ycombinator.com/item?id=12329094 - Aug 2016 (5 comments)

FastText – Library for fast text representation and classification - https://news.ycombinator.com/item?id=12226988 - Aug 2016 (52 comments)

The built-in classifier to fasttext is abysmal. Putting a default XGB on the embeddings is just as easy, and once you're just using fasttext for embeddings, using a better embedding makes more sense.

Do you have an example of this?

Correct. This has been my experience as well.

I agree, and not just for text classification. Its a good baseline for anything that looks like text (if you train a model from scratch), e.g., URLs with some basic preprocessing. This is because FastText uses character n-grams; the consequence of this is substring similarities are exploited.

Some things to note: (1) FastText comes with a classifier but I have often seen good results with using a difference classifier like LightGBM or a SVM. (2) If you want other word embeddings to compare with as well, magnitude [1] is an easy to use library. (3) FastText now supports multiple languages [2].

[1] https://github.com/plasticityai/magnitude#pre-converted-magn... [2] https://fasttext.cc/docs/en/crawl-vectors.html#models

For more text classification baselines (CRNN, NRTR, RubustScanner, SAR, SegOCR), checkout https://github.com/open-mmlab/mmocr They are reproducible, customizable.

Did you mean to link to something else? Those all seem geared for OCR tasks as opposed to text classification (which is what the article is talking about).

I created a python module leveraging fasttext to train text classifiers with just a labeled dataset


Not only for classification, fastText is really good (and fast) at language identification[0]

To use fastText in Python in a scikit-learn style (also inside sklearn pipelines), I’d recommend trying skift[1]

[0] https://ricardoanderegg.com/posts/python-fast-language-ident... [1] https://github.com/shaypal5/skift

How does it compare to vowpal wabbit today?

I feel like using HuggingFace Transformers is just as easy right now and it will probably have better performance. Of course, GPU machine is needed for training.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact