
How Quid uses deep learning with small data - Nimsical
https://quid.com/feed/how-quid-uses-deep-learning-with-small-data
======
rspeer
The baseline I'd like to see this compared to is the not-very-deep-learning
"bag of tricks" that's conveniently implemented in fastText [1].

[1]
[https://github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText)

~~~
bowlesbe
Great point! I considering using fasttext as a baseline, however in practice
fasttext really didn't work well at all with the small data set, much worse
than the tfidf baseline. I think Fasttext's classification approach might not
work well with such a small dataset. I'm not sure but I suspect its because it
tries to learn embeddings - but there just isn't anywhere near enough data for
that. I'd love an outside perspective on this.

~~~
rspeer
Fair enough. That's a useful comparison to know about.

But I'm wondering how you get around that with the neural net. In the post,
you said there are only a few hundred labeled examples, right? How can a
neural net with hundreds of parameters set those parameters to anything
reasonable, and not overfit, when there are about as many parameters as
examples?

~~~
bowlesbe
Great question and I share your intuition but I think its all properly
regularizing your model. I guess for neural networks, Dropout works really
darn well as a regularization strategy. I could have tried to see whether
performance dropped significantly without dropout.

------
prajit
If you're interested in Sequence to Sequence tasks (e.g. neural machine
translation or abstractive summarization) with small data, check out our
recent paper from Google Brain tackling this problem (disclaimer, I'm the
first author):
[https://arxiv.org/abs/1611.02683](https://arxiv.org/abs/1611.02683)

~~~
bowlesbe
Thanks! I'll check it out. I have also been reading about abstractive
summarization - hard problem it seems!

------
jackschultz
Curious how you guys got training data for this. Did someone have to go
through and rate whether or not a sentence was quality or not? And how many
training examples did you use? You say it was "difficult to develop a large
set" but I'm curious how large that set actually was.

Edit: Also, do you think more data or a "better" or "more sophisticated" model
would make the results better? I would guess more data would trump better
model, but not sure.

~~~
bowlesbe
Thanks for the comment! There were a few hundred sentences of each, collected
internally from from a wide number of descriptions.

Yes, I'd definitely agree- more data is what we need here for further model
improvements.

~~~
jackschultz
I actually had this issue recently when trying to get training data for a
project of mine as well [0], so I built an app [1] as a way to more easily
classify documents.

Basically I have simpler interfaces and the ability for multiple people to
quickly answer questions like this on a set of data. Easily exportable in the
end as well. If you're interested in using that to get some more data on
sentences, let me know. I'm really curious how much better the results get
with more data, and this could help.

[0] [https://bigishdata.com/2016/11/01/classifying-country-
music-...](https://bigishdata.com/2016/11/01/classifying-country-music-songs-
is-an-art-getting-training-data/)

[1] [https://fierce-mountain-21498.herokuapp.com/](https://fierce-
mountain-21498.herokuapp.com/)

~~~
bowlesbe
This is really great idea. Actually if there is something you can share along
these lines, that would be amazing. I know Crowd Flower has a great "internal
only" tool, which is kind of similar to what you are designing, but you have
to pay for it. Actually I think there is a huge need for a generic tool along
the lines of what you have started to build.

~~~
jackschultz
Haven't heard of CrowdFlower but yeah this is along those lines. Pretty
similar. But I could definitely make something quick to fit this specifically.
I've been looking for other uses along with what I'm doing and this fits
exactly. Shoot me an email at the address listed on my profile and I can get
going.

------
pilooch
Comparison is wrong between tfidf on words and CNN char. You should use char
ngrams along with LR and this will beat all your classifiers with high
probability. This is because your CNN char does not have enough data to draw
all the useful chat ngrams. Doing it as preprocessing and passing it to LR is
in practice always better on small datasets. You can go one step forward and
add layers and test an MLP on your char ngrams.

~~~
bowlesbe
Thanks, would you mind expanding? I also played around with some char CNNs.
They had similar performance.

------
YeGoblynQueenne
It's worth keeping in mind that learning from few examples is not such a big
deal. What is really hard to do (and a long-standing problem in machine
learning) is learning a model that _generalises well to unseen data_.

So the question is: does the OP really show good generalisation?

It's hard to see how one would even begin to test this, in the case of the OP.
The OP describes an experiment where a few hundred instances were drawn from a
set of 50K, and used both for training and testing (by holding out a few,
rather than cross-validating, if I got that right).

I guess one way to go about it is to use the trained model to label your
unseen data (the rest of the 50k) and then go through that model-labelled data
by hand, and try to figure out how well the model did.

We're talking here about natural language, however, where the domain is so
vast that even the full 50k instances are very few to learn well. That doesn't
have to do anything with the model being trained, deep or shallow. It has
everything to do with the fact that you can say the same thing in 100k
different ways, and still not exhaust all the ways to say that one thing. So
50k examples are either not enough examples of different ways to say the same
thing, or not enough examples of the different things you can say, or, most
probably, both.

It's also worth remembering that deep nets can overfit much worse than other
methods, exactly because they are so good at memorising training data. It's
very hard to figure out what a deep net is really learning, but it would not
be at all surprising to find out that your "powerful" model is just a very
expensive alternative to Ctrl + C.

It's just memorised your examples, see?

------
tadkar
There's something strange about the ROC curve here. It seems that the feature
engineered and logistic regression methods can pick out some examples very
easily (20% true positive rate at a very low false positive rate) but the CNN
seems to not be able to make many predictions at a low false positive rate. It
then catches up later. It's almost like it can't pick out the easy examples,
but does just as good a job on the harder ones.

~~~
bowlesbe
This is a great point, would be worth further investigation. And I agree with
your general interpretation. It would be interesting to look further at where
CNN is failing to detect bad ones and where the feature engineered one picks
them up.

------
kmike84
The link to LIME looks a bit out of place - LIME is an algorithm of explaining
classifier decisions which is most useful for cases when you can't inspect
weights and map them back to features. For TF*IDF + Logistic Regression there
is no need to use LIME, one can just use weights and feature names directly.
LIME is more helpful for all other models (there is a lot of caveats though),
not for the basic tfidf + linear classifier model.

~~~
bowlesbe
This is actually a great point. Thanks for sharing. I should maybe considering
removing LIME in that context or changing the wording.

------
SubiculumCode
I can't seem to find where the sample size is mentioned. It mentions that Quid
has 50,000 company descriptions, but is n=50,000 tiny in thr ML/DeepLearning
world?

I do neuroscience research and where I am coming from I have maybe n=150to200
per class. And that is not generally regarded as a tiny sample.

~~~
ryanschneider
The issue is those 50,000 descriptions aren't labeled good/bad. Someone had to
pick a subset of them and label them, so my guess is they did this for maybe
100 or 200 descriptions.

~~~
bowlesbe
This is correct. We had 300-400 examples of each

------
gwenzek
> A downfall of CNNs for text is that unlike for images, the input sequences
> are varying sizes (i.e., varying size sentences), which means most text
> inputs must be “padded” with some number of 0’s, so that all inputs are the
> same size.

Actually Kim's model you're using doesn't require padding because it uses
k-Max over time pooling.

Also kuddos for NOT updating your word embeddings during training! A lot of
people are doing it, but IMHO it's a mistake most of the time.

~~~
bowlesbe
Are you sure about the padding? On page 1746, bottom right it says "padded as
necessary". And intuitively it makes sense that all your inputs need to be the
same size for a CNN.

------
dmichulke
about that detecting _generic text that conveys little information_

Can I have that for my email? (Seriously) And as browser plugin? Oh and on
telephone, TV, radio and in real-life would be also nice.

It's probably also a nice predictor of startup success, developer quality and
sales guy effectiveness.

I just wonder if I would ever read or hear a Politician again.

Very inspiring...

~~~
bowlesbe
haha, absolutely. It takes a lot of intelligence to detect non-
informativeness. you might enjoy:
[http://journal.sjdm.org/15/15923a/jdm15923a.pdf](http://journal.sjdm.org/15/15923a/jdm15923a.pdf)

------
zump
What's the difference between softmax with categorical loss ([0, 1]) and
sigmoid binary loss? ([0/1])

------
sriku
Since word embeddings were the starting point, I'm wondering what would the
impact be if they'd stretched the vector sequences to the same length using
linear or whatever interpolation as opposed to zero padding the sentences.

~~~
bowlesbe
Could you elaborate? I'm not sure if I follow

------
xiamx
Why do you not have a dev dataset? Gridsearch over your test dataset is bound
to overfit

------
cocktailpeanuts
Wait, isn't "deep learning with small data" just machine learning, after all
the buzzwords cancel themselves out?

I thought the whole point of "deep learning" is its approach to using data.

~~~
IanCal
The point of deep learning is using a deep graph, like a neural network with a
lot of layers, not the amount of data.

However, picking millions of parameters with small amounts of data is unlikely
to work well.

~~~
bunderbunder
It seems, at first blush, like using a very complex model to fit a very small
amount of data is a recipe for some serious overfitting.

------
lukaslalinsky
Is all machine learning called "deep learning" now? Where is the line between
"normal" machine learning algorithms and deep learning?

~~~
bowlesbe
I think deep learning can be seen as a class of machine learning techniques
with more flexibility and which uses neural networks (usually with quite a few
layers).

------
GFK_of_xmaspast
I used to know some Quid people back in the day, good to see them here (and
that they're international now, congrats to them).

------
itschekkers
really nice article -- easy to follow, sensible steps, clean code. I haven't
done too much text ML and this was a nice piece to follow - thanks!

~~~
bowlesbe
I appreciate this!

------
deepnotderp
What about FastText?

------
thinkr42
This post is a joke. Seriously, it amazes me that the entire industry seems
fixated on a handful of techniques, just like they were with random forests 10
years ago, just like they were on SVMs ten years before that, just like they
were base neural networks before that. There's a simpler way, nature almost
requires it.

~~~
GFK_of_xmaspast
> There's a simpler way, nature almost requires it

This is a normative statement, do you have empirical evidence?

