
FastText – Library for fast text representation and classification - Dawny33
https://github.com/facebookresearch/fastText
======
kough
Links to the relevant papers:

Bag of Tricks for Efficient Text Classification:
[https://arxiv.org/abs/1607.01759v2](https://arxiv.org/abs/1607.01759v2)

Enriching Word Vectors with Subword Information:
[https://arxiv.org/abs/1607.04606](https://arxiv.org/abs/1607.04606)

Both fantastic papers. For those who aren't aware, Mikolov also helped create
word2vec.

One curious thing: this seems to use heirarchal softmax instead of the
"negative sampling" described in their earlier paper
[http://arxiv.org/abs/1310.4546](http://arxiv.org/abs/1310.4546), despite that
paper reporting that "negative sampling" is more computationally efficient and
of similar quality. Anyone know why that might be?

~~~
exgrv
It is possible to chose between negative sampling (ns), softmax or
hierarchical softmax (hs) by using the -loss option.

~~~
kough
Cool, thank you!

------
samfisher83
What exactly does it do?

It says this: fastText is a library for efficient learning of word
representations and sentence classification.

What does that meant? Is for sentiment analysis?

~~~
onewaystreet
Read the "Example use cases" section

~~~
lucb1e
(Not OP) I did and it's still rather vague. I totally see where s/he's coming
from with this question.

------
slig
I noticed that the C++ code has no comments whatsoever. Why would they do
that? The code is clear enough and you can read the papers to figure it out or
do they clean up comments before releasing internal code to the public?

~~~
bdcravens
I suspect it's the latter, since code not initially OSS likely has some
references to IP, or org structure, some crudeness, etc. Probably easier to
remove it all than rewrite.

Adding comments back in would be a great start to contributing to OSS.

~~~
michael_storm
I think open-sourcing the code in the first place was a great start to
contributing to OSS. Facebook isn't a newcomer to the community.

~~~
whafro
I took bdcravens' comment to mean it'd be a great project for someone who
wanted a way to start contributing to OSS, not a suggestion that Facebook
wasn't contributing.

~~~
michael_storm
Oh, you're right. Whoops.

------
misiti3780
The classification format is a bit confusing to me. Given a file that looks
like this:

Help - how to I format blocks of code/bash output in this editor ?

`fastText josephmisiti$ cat train.tsv | head -n 2 1 1 A series of escapades
demonstrating the adage that what is good for the goose is also good for the
gander , some of which occasionally amuses but none of which amounts to much
of a story . 1 2 1 A series of escapades demonstrating the adage that what is
good for the goose 2

Are they saying to reformat it like this

cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'`

giving me

`fastText josephmisiti$ cat train.tsv | head -n 10 | awk -F '\t' '{print
"__label__"$4 "\t" $3 }' __label__1 A series of escapades demonstrating the
adage that what is good for the goose is also good for the gander , some of
which occasionally amuses but none of which amounts to much of a story .
__label__2 A series of escapades demonstrating the adage that what is good for
the goose __label__2 A series __label__2 A __label__2 series __label__2 of
escapades demonstrating the adage that what is good for the goose __label__2
of __label__2 escapades demonstrating the adage that what is good for the
goose __label__2 escapades __label__2 demonstrating the adage that what is
good for the goose`

~~~
throwanem
HN doesn't support Markdown. Indent each line of a block by four spaces for
fixed-width markup.

------
haddr
For supervised classification this tool is suitable when your dataset is large
enough. I performed some tests with binary classification (twitter sentiment)
on the corpus with ~7.000 samples and the result is not impressive (~0.77).
Vowpal wabbit performes slightly better here, with almost the same training
time.

I'm looking forward to try it on some bigger datasets.

I also wonder if is it possible to use separately trained word vector model
for the supervised task?

~~~
exgrv
Thanks for pointing this out. We design this library on large datasets and
some static variables may not be well tuned for smaller ones. For example the
learning rate is only updated every 10k words. We are fixing that now, could
you please send us on which dataset you were testing? We would like to see if
we have solved this.

~~~
haddr
sure, how can i send it to you?

~~~
exgrv
If the dataset is public, could you post a link? Otherwise, could you please
send me an email? (My address can be found on the github README). Thanks!

~~~
haddr
I've sent you an email

------
jgraham
This might be a naïve question, but does anyone know if this is suitable for
online classification tasks? All the examples in the paper ([2] in the readme)
seemed to be for offline classification. I'm not terribly well versed in this
area so I don't know if the techniques used here allow the model to be updated
incrementally.

~~~
plusepsilon
If it uses stochastic gradient descent (it should) to train using batches of
data, you can apply that to online learning.

------
mendeza
Can this be used to do automatic summarization? I have been really interested
in that topic, and I've played with TextRank and LexRank, but they don't
provide as meaningful summarizes as I would want.

~~~
samfisher83
No I don't think it can do that. It can classify text. So suppose you have a
bunch of sentences that describe cars, a cat etc. If you feed in data it can
tell you if the data is about a car or a cat.

~~~
mendeza
Thanks for the input! Text classification and semantic analysis seemed vague
to me, so the clarification helped :). Maybe classifying text can help improve
automatic summarization, as sentences that include or describe the main topic
the best, should be in the summary.

------
Smerity
Just to mirror what was said on the thread a month ago when the paper came
out[1], if you're interested in FastText I'd strongly recommend checking out
Vowpal Wabbit[2] and BIDMach[3].

My main issue is that the FastText paper [7] only compares to other intensive
deep methods and not to comparable performance focused systems like Vowpal
Wabbit or BIDMach.

Many of the features implemented in FastText have been existing in Vowpal
Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many
other interesting, but all highly performant, ideas, and has reasonable strong
documentation. The command line interface is highly intuitive and it will burn
through your datasets quickly. You can recreate FastText in VW with a few
command line options[6].

BIDMach is focused on "rooflining", or working out the exact performance
characteristics of the hardware and aiming to maximize those[4]. While VW
doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't
going to be a major slow point in your systems as word2vec is actually pretty
speedy.

To quote from my last comment in [1] regarding features:

Behind the speed of both methods [VW and FastText] is use of ngrams^, the
feature hashing trick (think Bloom filter except for features) that has been
the basis of VW since it began, hierarchical softmax (think finding an item in
O(log n) using a balanced binary tree instead of an O(n) array traversal) and
using a shallow instead of deep model.

^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat
on", "on the", "the mat" \- you lose complex positional and ordering
information but for many text classification tasks that's fine.

[1]:
[https://news.ycombinator.com/item?id=12063296](https://news.ycombinator.com/item?id=12063296)

[2]:
[https://github.com/JohnLangford/vowpal_wabbit](https://github.com/JohnLangford/vowpal_wabbit)

[3]: [https://github.com/BIDData/BIDMach](https://github.com/BIDData/BIDMach)

[4]:
[https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_D...](https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_Data)

[5]:
[https://github.com/BIDData/BIDMach/blob/master/src/main/scal...](https://github.com/BIDData/BIDMach/blob/master/src/main/scala/BIDMach/networks/Word2Vec.scala)

[6]:
[https://twitter.com/haldaume3/status/751208719145328640](https://twitter.com/haldaume3/status/751208719145328640)

[7]: [https://arxiv.org/abs/1607.01759](https://arxiv.org/abs/1607.01759)

~~~
SergeyHack
Sounds interesting. Can these tools work on character n-grams as FastText
does?

~~~
tensor
In principle if you just put a space between each character it would, though
it would also make ngrams between words which you might not want. edit: for
vw, maybe the other lib has special support for character ngrams with word
boundaries

------
jjuliano
I code something like this before for personal use, it allows me to evaluate
my facebook/twitter status before posting online and classify them according
to being "negative, sarcastic, positive, helpful" so that I can be careful on
what I'm posting. I use bayesian filtering with trained words I gathered which
contains negative, sarcastic, positive and helpful, then I use scoring to
filter out what exactly the sentence means.

------
tcamp
How does this work with or replace other NLP solutions in the market. Is it
only for training models or for actual interpretation as well.

------
merrellb
The simultaneous training of word representations and a classifier seems like
it ignores the typically much larger unsupervised portion of the corpus. Is
there a way to train the word representations on the full-corpus and then
apply this to the smaller classification training?

~~~
eefic
You probably meant to initialize the input->hidden weight matrix with the
result of unsupervised training on the full corpus. A little tweak on how
these weights are initilized would do:
[https://github.com/facebookresearch/fastText/blob/master/src...](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc#L358)

I was a bit curious why they did not offer this by default. It seems quite
useful.

------
himavarsha
This might be a naive question, but what should be the format of the
training/test data? Is it like __label__1 John __label__2 Ram

------
d0100
As a side note, are dataset that have already been classified available for
free anywhere?

------
riyadparvez
Did they release any trained model like Google did for word2vec?

~~~
rspeer
Conceptnet Numberbatch ([https://github.com/LuminosoInsight/conceptnet-
numberbatch](https://github.com/LuminosoInsight/conceptnet-numberbatch)) is a
pre-trained model that outperforms the results reported in this paper (and of
course far outperforms the pre-trained word2vec models, which are quite
dated).

Here are the almost-comparable evaluations:

    
    
                  fastText    Numberbatch
        en:RW          .46           .601
        en:ws353       .73           .802
        fr:rg65        .67           .789
    

The difference actually should be larger: Numberbatch considers missing
vocabulary to be a problem, and takes a loss of accuracy accordingly, while
FastText just dropped their out-of-vocabulary words and reported them as a
separate statistic.

I'm using their Table 3 here. I don't know how Table 2 relates, or why their
French score goes down with more data in that table.

What's the trick? Prior knowledge, and not expecting one neural net to learn
everything. Numberbatch knows a lot of things about a lot of words because of
ConceptNet, it knows which words are forms of the same word because it uses a
lemmatizer, _and_ it uses distributional information from word2vec and GloVe.

------
kwrobel
Is it multi label text classification or only multi class?

~~~
exgrv
At train time, the code supports multiple labels by sampling one of the k
label at random. At test time, it only predicts the most probable label for
each example.

We will add more functionalities for multi label classification in the future
(predict the top k labels, etc...).

------
aantix
Was hoping for Java bindings as I'd like to try it out on a long running
Map/Reduce classification job..

~~~
rabidsnail
then write some. there's also not that much code here so you could port it to
java in a day or two.

------
drstrangevibes
how fast is it? does it outperform tensorflow or torch-rnn?

~~~
smhx
Link to the paper:
[https://arxiv.org/abs/1607.01759](https://arxiv.org/abs/1607.01759)

Quotes from the paper:

Both char-CNN and VDCNN are trained on a NVIDIA Tesla K40 GPU, while our
models are trained on a CPU using 20 threads.

Table2 shows that methods using convolutions are several orders of magnitude
slower than fastText.

Our speed-up compared to CNN based methods increases with the size of the
dataset, going up to atleast a 15, 000× speed-up.

Table 2 shows the speedups of:

ConvNets: 2 to 5 days on GPUs

FastText: 52 seconds on CPU

