Fasttext is also available in the popular NLP Python library gensim, with a good demo notebook: https://radimrehurek.com/gensim/models/fasttext.html
And of course, if you have a GPU, recurrent neural networks (or other deep learning architectures) are the endgame for the remaining 10% of problems (a good example is SpaCy's DL implementation: https://spacy.io/). Or use those libraries to incorporate fasttext for text encoding, which has worked well in my use cases.
If simple solution gets 77% accuracy, and complex solution yields 80%, I would wager that most of the time you should just stick with the simple solution.
One specific example that comes to mind is in sentiment analysis where you can achieve "sufficiently" high accuracy with well tailored Bayesian approaches. They are super fast to train and reason with. If a customer/consumer wants to know exactly why a piece of text is positive or negative, the n-gram probability matrix is extremely easy to inspect. Subsequent re-training and fixing is also much easier than re-training a large neural network, svm, etc.
"This library can also be used to train supervised text classifiers" -
Looks like you can use it for gensim.
also afaik fastext is just a faster way to build embeddings that are mostly equivalent to word2vec
All and all, I'm not sure you're enough of an expert to be commenting on this
2) FastText is also a classification algorithm, a command line classifier and a set of pretrained embeddings.
3) Yes, he does. I’ve learnt from his notebooks before, and I am confident I’m qualified to judge. I don’t completely agree with his statement here, but he is well qualified to comment.
Yeah, fasttext/spacy/gensim are some of the biggest open source NLP libraries these days. However I've found that they aren't very performant - I've done prototyping in them before but wouldn't use them for a finished product. I'm sure most big tech companies use custom implementations to do their heavy lifting
For what it's worth, when I was at a large company recently, I did use Gensim too - I was originally referring more to the case where you take a prototype for an NLP application and ask "ok, how do we scale this across millions of users"
FastText itself (the command line program and the pretrained embeddings together) is a decent starting point for classification though.
You can code something incredibly complex that works great without understanding any of the math underneath. Understanding the math arguably makes you a better engineer overall, but isn't required to solve many of these problems.
I think it's pretty cool, but I'm sure a lot of people have a big issue with the "just TRUST the library!" approach.
A fair introduction to a library would be like this: "This library lets you make X, Y and Z from A, B and C. It does so using mathematical methods 1, 2, 3 - therefore, it will be great for this-and-that type of A-B-C, but will not perform well for some-other-type." Such a description will tell you where the limits of applicability are, and what to look for if you want to understand more.
(Also, neural network libs should come with a big, bold caveat: "this is magic, only 10 people on the planet know how the whole stack works; the rest of us just perform a ritual on a cluster of GPUs and pray that thus summoned entity will do sorta ok-ish job with our problem (and if it doesn't, you're on your own)".)
In the broad sense, isn't this an example of how we trust?
I don't think machine learning and NLP has reached that point. Largely because the field is based on probabilities and tunable parameters which are hard to expose without complexity, and hard to trust when they aren't exposed.
We build layers on top of layers on top of layers.
Occasionally this causes problems. Often it solves them.
But can we still call it computer science? It seems to me more like linguistics / mathematics / statistics, with the CS only to do the low-level computations and accounting of data.
Of course, if you build a global-scale efficient search engine with it, the role of CS may become bigger.
but we ill make the shitty engineering jobs as easy as being a cashier.
and life goes on...
Example of the wisdom herein:
> Remove words that are not relevant, such as “@” twitter mentions or urls
I could quite easily say "I'm going to @chicago this weekend to see @taylor_swift". Now see if this sentence makes any sense: "I'm going to this weekend to see". Nope. What you need to do is translate them, like any other word. Sure it's not an English word, but you wouldn't ignore the word "hola" just because it's not an English word. Now, if your NLP application doesn't rely on this data, sure, throw it away. But if you're looking at Twitter and throwing away any mentions, you're not really processing natural language, are you?
Sure, it's really hard to translate. Am I talking about Chicago the city or Chicago the band? Maybe Chicago the movie?
Well that's why NLP is one of the most challenging areas of research, and nothing in this article will help solve even 0.009% of those challenges.
(edit) Maybe confused isn't the right word... they may just have other priorities for their use case. However, that doesn't excuse the title.
The data is natural language, and your system is processing it. NLP doesn't mean "using computation to model exactly and completely the entire human language faculty". NLP is distinct from text processing in that often text files may contain forms of unstructured information other than natural language. The data being processed here is natural language.
For the example task in the article (classifying whether a tweet is about disasters), it would be genuinely surprising if `@` mentions were meaningful. Sure, this would be something you would investigate, but the general idea of `removing words that are not relevant` as a pre-processing step is definitely not bad advice.
Author here, happy to answer questions and share our vision of it. Many problems require more complex approaches, and we definitely have Fellows tackle some of those (https://blog.insightdatascience.com/entity2vec-dad368c5b830).
That being said, when it comes to the volume of practical applications that come up for the many teams that we work with, the vast majority can be solved by the techniques outlined in the post. These techniques are simpler, but often overlooked. We believe that they should often be a starting point, and most of the time they end up being good enough for the job.
But the article is a very nice introduction to text mining (alas, not NLP)!
Large amounts of text based problems that companies actually have problems I think can be solved with approaches in this article. I'd be surprised actually if many need to go all the way through to the end, which gets up to word vectors and convolutional neural networks.
Where did the 90% even come from?
For NLP tasks, it looks like what it does is selectively delete words from the input and check the classifier output. This way it determines which words have the biggest effect on the output without needing to know anything about how your model works.
I suppose if you were Google and had all their data, perhaps. Putting different data types together (like text, pictures, locations) adds a lot of difficulty.
def sanitize_characters(raw, clean):
for line in input_file:
out = line
def sanitize_characters(raw, clean):
for line in raw:
out = line
Or am I mistaken?