
Deep Learning Is Easy – Learn Something Harder - hunglee2
http://www.inference.vc/deep-learning-is-easy/
======
elliott34
I think the best part of this article is the approach to learning new things
he presents--going beneath the surface, learning the fundamentals below the
“API” layer (abstractly speaking). Really great explanation of that insight!

The part I am having trouble with, with ~5 years of real world data science
experience in industry, is the _implicit assumption that we all need or want
to become good at deep learning._

In my experience, most businesses, at best, are still struggling with
_overfitting logistic regression in Excel_ let alone implementing/integrating
it with a production code base. And we all know that toy ML models that sit on
laptops create ZERO value beyond fodder for Board presentations or moving the
CMO's agenda forward.

The fact of the matter is that the vast majority of businesses, with respect
to statistics/ML, aren’t doing super duper basic shit (like a random forest
microservice that scores some sort of transaction) that might increase some
metric 10%. This is due to lack of sophisticated analytics
infrastructure/bureaucracy/ lack of talent/ being too scared of statistics.
Ultimately when you’re rolling out a machine learning product internally (I’m
not talking about Azure/ other aws-model-training-as-a-service type things),
the hardest part isn’t: “We need to increase our accuracy by 2% by using
Restricted Convolutionallly Recurrent Bayesian Machines!” The hardest part is
_convincing people you need to integrate a new process into a “production”
workflow, and then maintaining that process._

~~~
kwisatzh
Totally agree! If I had a dollar bill every time I had a senior exec (VP-
level) for large orgs ask me what 'probability' means, I wouldn't need to be a
data scientist anymore! The issue is deeper than what you write I think. There
is technical debt of course in implementing complex ML/DS pipelines that
should be compensated by the increase in lift/revenue. Outside of large
companies like Google/FB/Apple etc who have incorporated ML in their products,
many outfits that want to use their 'data' to 'address business problems'
don't really need sophisticated ML or can justify the technical and human
debt. Having worked in the industry as a data scientist, I'm not too hopeful
about the prospects of many DS-as-a-service companies, not because there isn't
solid technical content to offer, but their clients are routinely idiots.

~~~
woah
Are you saying everyone is an idiot, including the customers of data science
companies, or that people who pay for data science are especially idiotic?

------
aabajian
I think the article overestimates how much we understand about neural
networks, and the novel ways we might use them.

Take word2vec for example. It's a two-layer (i.e. not deep) neural network
that uses a relatively simple training algorithm. Yet, after 30 minutes it can
learn impressive relationships between English words. The method is only 3
years old and businesses continue to find novel ways to apply word2vec. Check
out this article:

[http://multithreaded.stitchfix.com/blog/2015/03/11/word-
is-w...](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-
thousand-vectors/)

They are using word2vec to find clothing items that resemble other products by
adding or subtracting descriptors such as "stripes", "blue", or even
"pregnant."

For my company, we are attempting to normalize medical terminology using
word2vec. Here are a couple examples:

1\. "myocardial" \+ "infarction" = "MI"

2\. "ring" \+ "finger" = "fourth" \+ "metacarpal"

We never would have thought to normalize the names of different body parts,
but with word2vec, it does it for us. There's a great deal yet to learn about
how this model alone works, let alone deeper networks.

~~~
billconan
I think I have an intuitive explanation why word2vec works. I tried to write
it down, but I'm lazy.

Don't think of it as a 2 layer neural network. what it really is is a physics
system. like a universe where each word is a planet. when words appear in the
same context, that will assign some gravitational pull to the "planets".
eventually, words with with same meaning will clustered as a galaxy.

for example, look at the following 2 sentences:

1\. I pat a dog.

2\. I pat a cat.

dog and cat, they appear in the same context. when the first time we see the
word dog, we calculate center of mass of "I", "pat" and "a". Think of them as
little planets.

then you give the planet "dog" a little force to pull it to the center of mass
of the previous three words.

when you see the word cat, you do the same as it's in the same context.

Eventually, cat and dog will be clustered and form a galaxy, with other
similar words. and the galaxy will be called "pet".

~~~
robrenaud
I don't know that they physics analogy works so well, or at least, it's
definitely missing something. What prevents the whole word "universe" from
collapsing on itself, forming a black hole? That is, if there are only
attractive forces, the global optimum is to co-locate everything in the same
point, which doesn't give you a useful model. There needs to be something in
the model that keeps different words apart from each other.

This page has the clearest explanation of word embeddings and the relationship
between the objective function and why vector translation captures meaning.

[http://nlp.stanford.edu/projects/glove/](http://nlp.stanford.edu/projects/glove/)

~~~
billconan
it works because the gravity of word2vec isn't the gravity of real life.

notice that I only pull the word "dog" to the center of gravity of the rest
words, instead of pulling all of them together. I think the full version even
push the rest of the words away from the center of gravity.

but I need to double check the math.

this is not just an analogy. this is what word2vec's math says.

the only analogy part is that word2vec is in high dimension, my analogy is in
3 dimension.

------
golergka
This article would be much better if it would specify whom is it addressed to.

My wife works in internet advertising, managing AdSense campaigns and stuff.
The fact that she's able to write simple scripts in Javascript to automate
some of her tasks already puts her ahead of her colleagues. Her whole company
doesn't seem to have even a freshman-level of understanding of mathematical
statistics, judging by how they operate and organize things internally.
(Unfortunately, she has no power to change that). A competitor that would have
just these two obvious things would already easily surpass them — but for now,
they are bring more profit than a lot of SF-based startups could hope for.

Big data, deep learning? This stuff may be "obvious" for a computer scientist
or a competent developer, but it sure as hell is not "obvious" to a lot of
real world businesses out there, and there's a lot of value to be created by
applying these things.

~~~
ebiester
I interpreted the post as saying, "Don't try to get a Ph.D in deep learning,
and don't learn it for it's own sake -- have an application in mind."

~~~
nickpsecurity
Adding to other commenter, he says that learning the recent tech will just
give you a temporary advantage as a first-mover followed by all kinds of
people having same skills. That's misleading as a first-mover will have many
successful projects under their belt and a network of referrals if they played
it smart. That advantage will be a lasting, selling point vs cheap labor of "I
know the dark corners and time-wasters. I'll do it right."

~~~
Outdoorsman
>That advantage will be a lasting, selling point vs cheap labor of "I know the
dark corners and time-wasters. I'll do it right."<

This very much still holds true...excellent point...

I've been in the profession for a number of years, paid attention to detail,
and put all my efforts into nailing the skills I've observed to be relevant...

Much to my amusement, I learned a few years ago that some in my network of
friends had given me the nickname of "the fixer"...which led to more referrals
than I could possibly handle...

I know what I know, and if you call me in I may take a bit more time than
you'd like to fix your problem(s), but when I'm done it will be "fixed"....

That's still worth quite a bit, even though the pace has picked up
dramatically over the years...your "core" needs to be rock solid for long-term
viability...

My advice to beginners is master something, then master something
else...before long your reputation will precede you in ways that you'll be
delighted by...

~~~
nickpsecurity
Glad to see you're benefiting from the effect. And this...

"My advice to beginners is master something, then master something
else...before long your reputation will precede you in ways that you'll be
delighted by..."

...is totally true. It's why I tried to wear many hats. At some point, wearing
too many makes you look like you might not be that good at any one. So, the
next trick I learned is to make different versions of a resume that leave off
stuff and target specific audiences. Each one might think you only have 2-3
skills. High skill or mastery in them is still believable at that point. So,
let them believe. :)

~~~
Outdoorsman
Agreed...targeted resumes are the way to go, especially with the current
differentiated market, which is accelerating geometrically...

I'm full stack in 3, often 4 (depending), discreet environments...that took
years, and (still takes) 50-some-odd active bookmarks and rss feeds...my
traveling "hotshot" case has 34 thumbs loaded with goodies...a good memory
helps a great deal...

~~~
nickpsecurity
You got me thinking. A friend several years back got a small box full of 60+MB
thumb drives for almost nothing. We were thinking what could be done with
those in terms of attacks, distribution of sample software, various ideas.
Your comment makes me wonder if they could also be used as business cards of
sorts showcasing a portfolio w/ code. Is that was you were referring to or
something else like your software stacks for various roles you take on?

Regardless, I never considered that use case.

------
maciejgryka
That's a little bit like saying: don't learn web development, it's too easy,
invest in learning fundamentals of computer science instead.

Sure, if you want to innovate in the field of academic AI, you probably need
to know much more than how to throw some data at a neural net. But there are
SO MANY problems supervised learning can solve out there - much like there are
so many people and businesses who need a website done.

It's a tool, like any other - and right now it seems very hard to beat. So
please go ahead, learn how to use it, and apply it to your own problem domain.

~~~
amelius
I guess one problem is that deep learning requires a different mindset than
most people who are currently in computing are willing to have. I'm not a deep
learning expert, but I guess that working in this field requires a lot of
trial-and-error: "hmm this doesn't work, let's see what happens if we change
this parameter". This is directly in contrast with the more "engineering" or
"mathematical" mindset that computer science requires.

(Of course, deep learning certainly requires mathematics and engineering
skills, but the driving force behind their use remains vague and difficult to
grasp.)

~~~
tangentspace
I don't think the mental process is necessarily all that different in pure
mathematics. The process of creating new structures and proofs in mathematics
definitely involves a lot of trial-and-error and vague intuition. When the
evidence mounts that a particular approach is likely to succeed, more effort
goes into refining the intuitive approach into a valid proof. Further
iterations often build on the original proof by tweaking the parameters to
make it more general or change the context, just to see what happens.

~~~
amelius
You have a good point. But at least in pure mathematics, you end up with a
rigorous proof, whereas in deep learning, you end up with a system that may-
or-may-not work.

This means that in pure mathematics you can build on your previous results,
but in deep learning you cannot (without danger).

~~~
tamana
Rigorous proofs are a formality that gets less attention than you'd guess at
first.

------
varelse
My takeaway is something I've observed and commented about for some time:
understanding how any important/relevant technology works gives you an
unbeatable edge over most of your peers. Andrew Ng makes a similar comment
about 2/3 of the way through his Coursera course on machine learning.

With respect to deep learning, most of the work I see involves pulling
existing networks off of the Caffe Zoo and retraining them with proprietary
datasets or fiddling with Theano to reproduce the works of others. These
efforts mostly end in failure, usually after a blind hyperparameter search.

The easiest way to separate the winners from the losers in deep learning and
elsewhere is to ask them to derive the method they are using and explain what
it's doing. For the most part, I find the above people are incapable of doing
so for backpropagation. Bonus points if they can write (relatively) efficient
C++/CUDA code to implement backpropagation from the ground up.

Edit the above paragraph for PCA, ICA, GP, GMM, random forests or any other
technology/algorithm. If one has a decent grasp of the underlying logic, the
instincts for adapting it to new domains/datasets come at no additional
charge.

~~~
selectron
Can you give some anecdotes of where "understanding how any important/relevant
technology works gives you an unbeatable edge over most of your peers"?
Besides being able to pass a job interview. Naively I would assume that being
able to apply these techniques (understanding cross validation, overfitting,
domain expertise, etc) would be far more important than knowing the underlying
logic of how to write them from scratch.

~~~
johnmaguire2013
I think that it's way easier to apply something when you understand how that
something _actually_ works.

i.e. There is a difference between being able to re-implement an algorithm,
and understanding why that algorithm has the behavior it has. You'll probably
write cleaner and more maintainable code, quicker, in the latter, and be able
to make modifications as necessary.

~~~
tamana
That's an easy claim, but putting more time into learning foundations means
you have less time for applications. Moth pure mathematicians are poor web
developers.

~~~
varelse
Quality over quantity: sounds like a winning strategy to me.

------
taneq
Sounds like he's viewing things very much from a research-scientist
perspective. From that point of view, sure, a lot of the groundbreaking work
has already been done. From the point of view of someone who just has a job to
do, though, it's a tool in the toolbox. If it's the most appropriate tool then
you'd be stupid not to use it.

------
imrehg
I'm not in CS, just an enthusiastic person using software to solve his own
projects. I personally want to learn more of these that make "hard thing
easy", not to come up with new methods, but to apply these now-achievable
techniques to different problems, the problems around me. "learning deep
learning should not be the goal but a side effect" sounds useful in general,
in the same time I wish there would be more, wider set of people actually
using (as opposed to merely learning) the awesome methods and technology we
have these days.

------
nisa
I don't know. If you go that route then it's just: Study mathematics or
theoretical physics and then learn everything you need.

The problem is pulling that off. Sure if you are clever and pursue a PhD it's
worth going the abstract theoretical route but in my personal experience as a
somewhat stupid student it's fine to look for a niche and get hands-on
experience on that.

Mastering Hadoop is still far from easy and if you need to use it without
burning tons of CPU cycles it's even harder to get right.

~~~
navi54
I am interested in this, mostly as a computational biologist. Any intro on
Hadoop? What is it used for?

~~~
nisa
Hadoop itself consists of different parts: HDFS is a distributed filesystem
that can span across lot's of machines and stores data in blobs of varying
sizes.

MapReduce is the Google idea from before 2004 how to do calculations on lot's
of data. Now there is also YARN that could be described as a general job
scheduler.

At the moment a lot of people use software on YARK like Spark (does more in
memory, is faster, can use the GPU on the cluster machines).

So if you have biological data you could feed that into a HDFS and have Spark
or MapReduce jobs that process the data. The clue about Hadoop is that don't
need to care about getting the clustering and distributed setup right. This is
done by Hadoop for you. You program like you would program a single thread
algorithm (at least in the simple cases).

E.g. this Python code counts words over as much data as you want in parallel:
[http://www.michael-noll.com/tutorials/writing-an-hadoop-
mapr...](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-
program-in-python/)

If you google each of these projects you'll find a lot of information.

Here are the original papers that should give a good idea:

\- HFDS (Google Filesystem was the original idea - Hadoop was a free
implementation by Yahoo -
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-
sosp2003.pdf)

\- MapReduce -
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-
osdi04.pdf)

\- Spark:
[http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark...](http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf)

If you want a good introduction read Hadoop: The Definitive Guide, 3rd Edition

We used it for building a search engine out of multiple terabytes of crawl
data - something that fit's not good on a single computer.

You can do all kinds of computations but graph problems or things that not
good to parallelize require often other solutions beyond MapReduce or more
thought - MapReduce also only fits for a certain class of problems - it's
great for counting and aggregation of stuff but beyond that it's often not
usable.

------
amelius
Also, it is boring, because 90% of the time, you will be waiting for your
networks to be trained :)

------
jaybosamiya
In short, don't go for it because it is a fad. Over time, like the _data
scientist_ buzzword, this too shall become something everyone can do. It is
what you can do with Deep Learning that matters, and where you can push beyond
this.

------
vessenes
There's a bit of optimism about technology adoption here; I'm guessing Ferenc
is part of a bleeding edge research community from how he talks about the ease
of uptake and adoption for new technologies.

The reality is there are still companies looking for their first 'big data'
data scientist years after this was a 'hot area'. While aggressive
researchers, then aggressive companies, then fast follower researches /
companies, then ... all do eventually adopt useful technology, it can take
decades in some cases.

So, do learn deep learning. But from my own experience, he's right in that
you'll think "that's it?" once you've constructed a few networks and read
through some basic 2014+ literature on tuning and tips. But, that will put you
ahead of the VAST majority of technologists inre: deep learning. Seriously.

I think he's 100% right that the search for usefulness / novel applications is
where the game is going to be for deep networks, but it's seems silly to tell
people not to learn how to work with them.

------
Buttons840
What about reinforcement learning? It seems neglected to me. There are only
one or two good reinforcement learning resources I'm aware of compared to the
hundreds of books and blogs on supervised learning. Most machine learning
books will mention supervised, unsupervised, and reinforcement learning in the
first chapter, then never mention reinforcement learning again. This is my
experience at least, and I've read a few introductory machine learning books.

Reinforcement learning can drive you car and mow your lawn. Supervised
learning can tell you which of two pictures is a cat. Granted, supervised
learning techniques can have an important place in the broader framework of
reinforcement learning.

Reinforcement learning is one of the harder problems.

~~~
api
The hard thing here is reinforcing the right thing. I did a bunch of work on
genetic programming and RL and it worked... uhh... too well. Once it succeeded
in learning, near as I could tell, my hard disk geometry. There was a subtle
timing difference between data sets and this seemed related to where they were
on disk. The literature is loaded with anecdotes like that, some even more
hilarious.

But you're right. Nature clearly does learn this way, at least for many
things. Maybe the way forward is to combine it with supervised and
unsupervised learning and let different forms of learning work together.

~~~
devonkim
This kind of unintended learning is something that all machine learning
practitioners should be careful with. The example of the satellite photo
analysis algorithm that turned out to have learned to classify difference
between night and day photos comes to mind.

This means that your particular reinforcement learning example needs something
similar to regularization of cost functions or cross validation for checking
for over or under fitting to correct these latency effects.

It may be cheaper and more effective to use SSDs instead but the magnitude of
the latency effects on your case are not clear either to determine if your
algorithm would benefit substantially from a modern SSD.

~~~
api
The fix was to (1) take steps to make sure the image was entirely pre-loaded
into RAM before the test was given, and (2) randomize the latency of the load.
That worked and I actually got some real results, which unfortunately were
decent but not as impressive as the cheating was. :) When it cheated the thing
got 100% right of course.

I also learned that rapid convergence on incredible accuracy is suspect.

