Curious how it'll perform compared to fasttext when used as encoding network in larger tasks. I can't help but notice the trend of going back to simpler models with smarter optimizations and regularization to achieve better results.
This is a frequent question of mine, which I ask to everyone using RNNs - what do you think of the idea that CNNs will be able to replace RNNs for sequence tasks ? CNNs are less computationally expensive too, so there's a definite benefit of switching to them if the performance is on par.
As to whether CNNs can replace RNNs in general, the jury is still out. Over the last couple of years there have been some sequence tasks where CNNs are state of the art, some where RNNs are. Note that with stuff like QRNNs the assumption that CNNs are less computationally expensive is no longer necessarily true: https://github.com/salesforce/pytorch-qrnn
I'd be surprised if for tasks that require long-term state (like sentiment analysis on large docs) whether CNNs will win out in the end, since RNNs are specifically designed to be stateful - especially with the addition of an attention layer.
Seeing as fasttext accuracy is 90%+, does this mean your method achieves 180%?
I'm nitpicking of course, but lately I've seen claims like "20% improvement in accuracy", where on closer inspection, the authors mean error rate dropped from 5% to 4%.
Which is not bad of course, but in the grand of scheme of things, 1% absolute improvement may not be such game-changer, especially if it comes at the cost of other relevant metrics like model complexity, developer sanity or performance.
(haven't read your paper yet, just a general sigh/rant)
Especially for more well defined problems, going from 98.5% to 99.5% is "just" 1pp absolute improvement but the fact that you have three times less mistakes can well justify a more complex model that requires ten times more hardware. The metric that you'd actually care about would often be like "number of hours required to correct the mistakes" or "number of lost sales due to mistakes", which all would get modified by the relative percentage change.
Your note on "more well defined problems" is spot on. Chasing single percent improvements and SOTA is indeed the name of the game there.
But defining the problem in the first place, figuring out the cost matrix and solution constraints, is typically the bigger challenge in highly innovative projects. Once you know what to chase, 80% of the job is done.
Disclosure: building commercial ML systems for the past 11 years, using deep learning and otherwise. What you call "metric you care about" is often not the metric you care about. This is why people coming from academia are sometimes taken by surprise that logistic regression, linear models, or heck, even rule-based systems (!) are still so popular. Model simplicity, developer sanity and performance do matter, too.
I realize that outside of Silicon Valley and other technology centers, most established companies are far -- far -- from adopting deep learning for any application of importance, due partly to the current unavailability of developers with AI expertise, and partly to deep learning's so-called "unexplainability" (i.e., the inability of many corporate executives and machine learning practitioners to reason about it, and their resulting discomfort with it). But it's only a matter of time before Corporate America starts following the lead of companies like Google and Facebook, which today are aggressively using state-of-the-art AI in lots of important applications.
Why not get ahead of this multi-decade trend?
PS. For those who don't know, Radim is the creator of gensim, a popular, friendly Python library for text classification and topic modeling.[a]
[a] https://radimrehurek.com/gensim | https://github.com/RaRe-Technologies/gensim
fasttext makes errors about 10% of the time, and our approach makes errors about 5% of the time. It's certainly fair to say (although nitpicky) that "accuracy" isn't quite the right term here (I should have said "half the error").
But as for your general sigh/rant... absolute improvement is very rarely the interesting measure. Relative improvement tells you how much your existing systems will change. So if you're error goes from 5% to 4% then you have 20% less errors to deal with than you used to.
An interesting example: the Kaggle Carvana segmentation competition had a lot of competitors complaining that the simple baseline models were so accurate that the competition was pointless (it was very easy to get 99% accuracy). The competition administrator explained however that the purpose of the segmentation model was to do automatic image pasting into new backgrounds, where every mis-classified pixel would lead to image problems (and in a million+ pixels, that's a low error rate!)
Oh wow, didn't realize that these were multi-layer pre-trained models.
Also, started going through the QRNNs, they mention they've updated the AWD-LSTM Language model to use QRNNs, which is what your paper uses!
The only point of reference between the two papers I see is the CoVe models, which you guys beat pretty handily, but the ELMo model also beats the CoVe model handily, just on different datasets, so not clear how they stack up.
Any chance you could do some more direct comparisons? You do say that it's a more complex architecture, and the tokenized char convolution stuff is a bit of a pain to do, but if that actually helps, it's not that bad to do once.
From an engineering perspective, not changing the LM weights is kind of nice because then you can train multiple separate models on top of the embeddings without needing to retrain everything (and deal with the associated "noise" when retraining models) and it gives some nice modularity. It would be nice to know how much of the performance is lost from having embeddings that can be shared across a lot of tasks.
Random note: it seems like in Table 7, you have bolded "Freez + discr + stlr" in the IMDb column which has a value of 5.00, whereas "Full + discr" has a value of 4.57, and so should probably be the bolded number.
Using char tokens can definitely be helpful, as can sub-words. It's something we've been working on too, and hope to show results of this in the future.
I mainly disagree with your view of end-to-end training. In computer vision we pretty much gave up on trying to re-use hyper-columns without fine-tuning, because the fine-tuning just helps so much. It's really no trouble doing the fine-tuning - in fact the consistency of using a single model form across so many different datasets is really convenient and helpful for doing additional levels of transfer learning.
Thanks for the note about table 7 - it's actually an error (it should be 5.57, not 4.57; Sebastian is in the process of uploading a corrected version).
Perhaps even more interesting than comparison would be modifications to ULMFit to incorporate good ideas from the AllenNLP ELMo paper.
The learned weighting of representation layers seems like a decent candidate, as does giving the model flexibility to use something other than a concatenated [mean / max / last state] representation of final LSTM output layer (as is the case in some of ELMo's task models). I'm personally curious about using an attention mechanism in conjunction with something like ELMo's gamma task parameter (regularizer) for learning a weighted combination of outputs but haven't been able to get things to function well in practice.
The dataset the ELMo model is trained might also be preferable to WIKI 103 for practical English tasks, although you lose the nice multilingual benefits you get from working with WIKI 103.
In general it seems like the format described in the ELMo paper is simply not designed to work at very low N because the weights of the (often complex) task models used in ELMo's benchmarks are learned entirely for scratch. That's not possible without a decent amount of labeled training data.
Anyhow, thought the paper was very well put together, definitely an enjoyable read. Hope yourself and Sebastian collaborate on future papers, as good things certainly came of this one!
The obvious answer is that I should just train a single joint model.
That's great, but when you retrain a model, even if you get similar accuracy, your actual predictions change. It's basically why same model ensembles help.
So if I am trying to improve predictions for a single task, but I have a joint model, then I have to deal with a whole pile of churn that I wouldn't if I had separate models.
This doesn't show up in academic metrics, but people care when things that used to work stop working for no real reason, even if an equal amount of new things started working.
So, I'm not saying we shouldn't fine tune things, it's that I have a set of engineering challenges that make fine tuning less ideal, and I am curious how much we can get away with sharing. There are plenty of CV papers which indicate that the very first layers basically don't benefit from fine tuning because they are so general. Is that true for NLP as well, or are words embeddings already quite domain specific?
I'm starting a new project where I'm given many recipes and I need to take in a free form text of recipe ingredients (e.g. "1/2 cup diced onions", "two potatoes, cut into 1-inch cubes", etc.) and build a program that identifies the ingredient (e.g. onion, potato), as well as the quantity (e.g. 0.5 cup, 2.0 units). Could I use something like Fast.ai to tackle this problem?
How does your API handle ingredients with multiple options (e.g. "1 1/2 cups seedless red or green grapes")?
Don't hesitate to try the API out by pasting some examples to the white box on the site and pressing the "Try it out!" button, it's interactive :)
Sweet, I didn't realize it was interactive. I'll give it a try!
I haven't read things in depth, but am curious - how do these models cope with out of vocabulary terms?
See  here: https://github.com/fastai/fastai/blob/master/courses/dl2/imd....
1. In the introduction, there seem to be contradictory statements about the status of Inductive Transfer for NLP. It is first stated that it "has had a large impact in practice", then in the next paragraph it is stated that "it has been unsuccessful for NLP". How is it possible, having a large impact and at the same time being unsuccessful?
2. In the introduction, it is stated that "Research in NLP focused mostly on transductive transfer". Perhaps this statement were valid back in 2007, but it seems to me outdated. Recently most transfer learning in NLP are related to using pre-trained embeddings in a inductive transfer setting.
3. In the beginning of the "2 Related Work" section, in the excerpt "Features in deep neural networks in CV have been observed to transition from task-specific to general from the first to the last layer", I believe the order "first to the last" should read "last to the first", since the last layers have the more task-specific features and the first layers have the more general.
I noticed the "NLP classification page" links are broken.
I assume they should go to: http://nlp.fast.ai/category/classification.html
It is totally correct and in no way misleading to say we need only 100 labeled examples. Anyone can get similar results on their own datasets without even needing to train their own wikitext model, since we've made the pre-trained model available.
(BTW, I see you work at a company that sells something that claims to "categorize SKUs to a standard taxonomy using neural networks." This seems like something you maybe could have mentioned.)
Also, I don't understand the need to be so defensive though and the relevance between my employer and my post?
His response on your employer was likely driven by an assumption that you viewed this as free, open source competition to your product, and thus the negative comment.
To the OP:.
I've find a lot of NLP, and this is phenomenal work.