
The Unreasonable Effectiveness of Deep Feature Extraction - hiphipjorge
http://www.basilica.ai/blog/the-unreasonable-effectiveness-of-deep-feature-extraction/
======
asavinov
Deep feature extraction is important for not only image analysis but also in
other areas where specialized tools might be useful such as listed below:

o
[https://github.com/Featuretools/featuretools](https://github.com/Featuretools/featuretools)
\- Automated feature engineering with main focus on relational structures and
deep feature synthesis

o [https://github.com/blue-yonder/tsfresh](https://github.com/blue-
yonder/tsfresh) \- Automatic extraction of relevant features from time series

o
[https://github.com/machinalis/featureforge](https://github.com/machinalis/featureforge)
\- creating and testing machine learning features, with a scikit-learn
compatible API

o [https://github.com/asavinov/lambdo](https://github.com/asavinov/lambdo) \-
Feature engineering and machine learning: together at last! The workflow
engine allows for integrating feature training and data wrangling tasks with
conventional ML

o [https://github.com/xiaoganghan/awesome-feature-
engineering](https://github.com/xiaoganghan/awesome-feature-engineering) \-
other resource related to feature engineering (video, audio, text)

~~~
mlucy
Definitely. There's been a lot of exciting work recently for text in
particular, like
[https://arxiv.org/pdf/1810.04805.pdf](https://arxiv.org/pdf/1810.04805.pdf) .

~~~
nl
Or from today, OpenAI's response to BERT: [https://blog.openai.com/better-
language-models/](https://blog.openai.com/better-language-models/)

Breaks 70% accuracy on the Winograd schema for the first time! (a lazy 7%
improvement in performance....)

------
kieckerjan
As the author acknowledges, we might be living in a window of opportunity
where big data firms are giving something away for free that may yet turn out
to be a big part of their furure IP. Grab it while you can.

On a tangent, I really like the tone of voice in this article. Wide eyed,
optimistic and forward looking while at the same time knowledgeable and
practical. (Thanks!)

~~~
gmac
_big data firms are giving something away for free_

On that note, does anyone know if state-of-the-art models trained on billions
of images (such as Facebook's model trained via Instagram tags/images,
mentioned in the post) are publicly available and, if so, where?

Everything I turn up with a brief Google seems to have been trained on
ImageNet, which the post leads me to believe is now small and sub-par ...

~~~
hamilyon2
Have you found anything?

~~~
gmac
Afraid not — I was hoping for some replies here!

------
bobosha
This is very interesting and timely to my work, I had been struggling with
training a Mobilenet CNN for classification of human emotions ("in the wild"),
and struggling to get the model to converge. I tried multiclass to binary
models e.g. angry|not_angry etc. But couldn't get past the 60-70% accuracy
range.

I switched to extracting features from Imagenet and trained an xgboost binary
and boom...right out of the box am seeing ~88% accuracy.

Also the author's points about speed of training and flexibility is major plus
for my work. Hope this helps others.

~~~
mlucy
Yeah, I think this pattern is pretty common. (Basilica's main business is an
API that does deep feature extraction as a service, so we end up talking to a
lot of people with tasks like yours -- and there are a _lot_ of them.)

We're actually working on an image model specialized for human faces right
now, since it's such a common problem and people usually don't have huge
datasets.

------
fouc
>But in the future, I think ML will look more like a tower of transfer
learning. You'll have a sequence of models, each of which specializes the
previous model, which was trained on a more general task with more data
available.

He's almost describing a future where we might buy/license pre-trained models
from Google/Facebook/etc that are trained on huge datasets, and then extend
that with more specific training from other sources of data in order to end up
with a model suited to the problem being solved.

It also sounds like we can feed the model's learnings back into new models
with new architectures as well as we discover better approaches later.

~~~
XuMiao
What do you think of life-long learning scenario that models are trained
incrementally forever? For example, I train a model with 1000 examples, it
sucks. The next guy pick it up and train a new one by putting a regularizer
over mine. It might still suck. But after maybe 1000 people, the model begins
to get significantly better. Now, I will pickup what I left and improve it by
leveraging the current best. This continues forever. Imagine that this
community is supported by a block chain. We won't be relying on big companies
any more eventually.

~~~
jacquesm
What is it with the word 'blockchain' that will make people toss it into
otherwise completely unrelated text?

~~~
oehpr
nothing, they're describing a series of content addressable blocks that link
back to their ancestors. Which is a good application of a block chain. Think
IPFS.

It's not cryptocurrency. Though cryptocurrency definitely popularized the
technique.

~~~
fwip
IPFS isn't a blockchain just like git isn't a blockchain. "Blockchain" has
semantic meaning that "a chain of blocks" does not.

------
stared
A few caveats here:

\- It works (that well) only for vision (for language it sort-of-works only at
the word level: [http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html))

\- "Do Better ImageNet Models Transfer Better?"
[https://arxiv.org/abs/1805.08974](https://arxiv.org/abs/1805.08974)

And if you want to play with transfer learning, here is a tutorial with a
working notebook: [https://deepsense.ai/keras-vs-pytorch-avp-transfer-
learning/](https://deepsense.ai/keras-vs-pytorch-avp-transfer-learning/)

~~~
mlucy
There's actually been a lot of really good work recently around textual
transfer learning. Google's BERT paper does sentence-level pretraining and
transfer to get state of the art results on a bunch of problems:
[https://arxiv.org/pdf/1810.04805.pdf](https://arxiv.org/pdf/1810.04805.pdf)

~~~
stared
Thanks for this reference, I will look it up. Though, from my experience
people in NLP still (be default) train from scratch, with some exceptions for
tasks on the same dataset:

\- [https://blog.openai.com/unsupervised-sentiment-
neuron/](https://blog.openai.com/unsupervised-sentiment-neuron/)

\- [http://ruder.io/nlp-imagenet/](http://ruder.io/nlp-imagenet/)

~~~
samcodes
This is true, but rapidly changing. In addition to fine tuneable language
models, you can do deep feature extraction with something like bert-as-service
[0] ... You can even fine tune Bert on your days, then use the fine tuned
model as a feature extractor.

[0] [https://github.com/hanxiao/bert-as-
service](https://github.com/hanxiao/bert-as-service)

------
mlucy
Hi everyone! Author here. Let me know if you have any questions, this is one
of my favorite subjects in the world to talk about.

~~~
fouc
What do you think are the most interesting types of problems to solve with
this?

~~~
mlucy
I think if you have a small to medium sized dataset of images or text, deep
feature extraction would be the first thing I'd try.

I'm not sure what the most interesting problems with that property are. Maybe
making specialized classifiers for people based on personal labeling? I've
always wanted e.g. a twitter filter that excludes specifically the tweets that
I don't want to read from my stream.

~~~
fouc
One problem that intrigues me is Chinese-to-English machine translation.
Specifically for a subset of Chinese Martial Arts novels (especially given
there's plenty of human translated versions to work with).

So Google/Bing/etc have their own pre-trained models for translations.

How would I access that in order to develop my own refinement w/ the domain
specific dataset I put together?

~~~
mlucy
I don't think you could get access to the actual models that are being used to
run e.g. Google Translate, but if you just want a big pretrained model as a
starting point, their research departments release things pretty frequently.

For example, [https://github.com/google-
research/bert](https://github.com/google-research/bert) (the multilingual
model) might be a pretty good starting point for a translator. It will
probably still be a lot of work to get it hooked up to a decoder and trained,
though.

There's probably a better pretrained model out there specifically for
translation, but I'm not sure where you'd find it.

------
jfries
Very interesting article! It answered some questions I've had for a long time.

I'm curious about how this works in practice. Is it always good enough to take
the outputs of the next-to-last layer as features? When doing quick
iterations, I assume the images in the data set have been run through the big
net as a preparation step? And the inputs to the net you're training is the
features? Does the new net always only need 1 layer?

What are some examples of where this worked well (except for the flowers
mentioned in the article)?

~~~
mlucy
> Is it always good enough to take the outputs of the next-to-last layer as
> features?

It usually doesn't matter all that much whether you take the next-to-last or
the third from last, it all performs pretty similarly. If you're doing
transfer to a task that's very dissimilar from the pretraining task, I think
it can sometimes be helpful to take the first dense layer after the
convolutional layers instead, but I can't seem to find the paper where I
remember reading that, so take it with a grain of salt.

> When doing quick iterations, I assume the images in the data set have been
> run through the big net as a preparation step?

Yep. (And, crucially, you don't have to run them through again every
iteration.)

> And the inputs to the net you're training is the features? Does the new net
> always only need 1 layer?

Yeah, you take the activations of the late layer of the pretrained net and use
them as the input features to the new model you're training. The new model
you're training can be as complicated as you like, but usually a simple linear
model performs great.

> What are some examples of where this worked well (except for the flowers
> mentioned in the article)?

The first paper in the post
([https://arxiv.org/abs/1403.6382](https://arxiv.org/abs/1403.6382)) covers
about a dozen different tasks.

------
mikekchar
It's hard to ask my question without sounding a bit naive :-) Back in the
early nineties I did some work with convoluted neural nets, except that at
that time we didn't call them "convoluted". They were just the neural nets
that were not provably uninteresting :-) My biggest problem was that I didn't
have enough hardware and so I put that kind of stuff on a shelf waiting for
hardware to improve (which it did, but I never got back to that shelf).

What I find a bit strange is the excitement that's going on. I find a lot of
these results pretty expected. Or at least this is what _I_ and anybody I
talked to at the time seemed to think would happen. Of course, the thing about
science is that sometimes you have to do the boring work of seeing if it does,
indeed, work like that. So while I've been glancing sidelong at the ML work
going on, it's been mostly a checklist of "Oh cool. So it _does_ work. I'm
glad".

The excitement has really been catching me off guard, though. It's as if
nobody else expected it to work like this. This in turn makes me wonder if I'm
being stupidly naive. Normally I find when somebody thinks, "Oh it was
obvious" it's because they had an oversimplified view of it and it just
happened to superficially match with reality. I suspect that's the case with
me :-)

For those doing research in the area (and I know there are some people here),
what have been the biggest discoveries/hurdles that we've overcome in the last
20 or 30 years? In retrospect, what were the biggest worries you had in terms
of wondering if it would work the way you thought it might? Going forward,
what are the most obvious hurdles that, if they don't work out might slow down
or halt our progression?

~~~
aabajian
If you haven't, you should take a few moments to read the original AlexNet
paper (only 11 pages):

[https://papers.nips.cc/paper/4824-imagenet-classification-
wi...](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-
convolutional-neural-networks.pdf)

What you're saying is true, it _should_ have worked in theory, but it just
_wasn 't_ working for decades. The AlexNet team made several critical
optimizations to get it work: (a) big network, (b) training on GPU, and (c)
using a ReLU instead of tanh(x).

In the end, it was the hardware that made it possible, but up until their
paper it really wasn't for sure. A good analogy is the invention of the
airplane. You can speculate all you want about the curvature of a bird's wing
and lift, but until you actual build a wing that flies, it's all speculation.

------
al2o3cr
Contrast a similar writeup on some interesting observations about solving
ImageNet with a network that only sees small patches (largest is 33px on a
side)

[https://medium.com/bethgelab/neural-networks-seem-to-
follow-...](https://medium.com/bethgelab/neural-networks-seem-to-follow-a-
puzzlingly-simple-strategy-to-classify-images-f4229317261f)

------
purplezooey
Question to me is, can you do this with i.e. Random Forest too, or is it
specific to NN.

------
gdubs
This is probably naive, but I’m imagining something like the US Library of
Congress providing these models in the future. E.g., some federally funded
program to procure / create enormous data sets / train.

~~~
rsfern
I don’t think it’s that naive. NIST is actively getting into this space:
[https://www.nist.gov/topics/artificial-
intelligence](https://www.nist.gov/topics/artificial-intelligence)

------
CMCDragonkai
I'm wondering how this compares to transfer learning applied to the same
model. That is compare deep feature extraction plus linear model at the end vs
just transferring the weights to the same model and retraining to your
specific dataset.

------
zackmorris
From the article:

 _Where are things headed?

There's a growing consensus that deep learning is going to be a centralizing
technology rather than a decentralizing one. We seem to be headed toward a
world where the only people with enough data and compute to train truly state-
of-the-art networks are a handful of large tech companies._

This is terrifying, but the same conclusion that I've come to.

I'm starting to feel more and more dread that this isn't how the future was
supposed to be. I used to be so passionate about technology, especially about
AI as the last solution in computer science.

But anymore, the most likely scenario I see for myself is moving out into the
desert like OB1 Kenobi. I'm just, so weary. So unbelievably weary, day by day,
in ever increasing ways.

~~~
coffeemug
Hey, I hope you don't take it the wrong way -- I'm coming from a place where I
hope you start feeling better -- but what you're experiencing might be
depression/mood affiliation. I.e. you feel weary and bleak, so the world seems
weary and bleak.

There are enormous problems for humanity to solve, but that has _always_ been
the case. From plagues and famines, to world wars, to now climate change, AI
risk, and maybe technology centralization. We've solved massive problems
before at unbelievable odds, and I want to think we'll do it again. And if
not, what of it? What else is there to do but work tirelessly at attempting to
solve them?

I hope you feel better, and find help if you need it -- don't mean to presume
too much. My e-mail is in my profile if you (or anyone else) needs someone to
talk to.

