Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning Is Easy – Learn Something Harder (inference.vc)
244 points by hunglee2 on Jan 31, 2016 | hide | past | favorite | 59 comments

I think the best part of this article is the approach to learning new things he presents--going beneath the surface, learning the fundamentals below the “API” layer (abstractly speaking). Really great explanation of that insight!

The part I am having trouble with, with ~5 years of real world data science experience in industry, is the implicit assumption that we all need or want to become good at deep learning.

In my experience, most businesses, at best, are still struggling with overfitting logistic regression in Excel let alone implementing/integrating it with a production code base. And we all know that toy ML models that sit on laptops create ZERO value beyond fodder for Board presentations or moving the CMO's agenda forward.

The fact of the matter is that the vast majority of businesses, with respect to statistics/ML, aren’t doing super duper basic shit (like a random forest microservice that scores some sort of transaction) that might increase some metric 10%. This is due to lack of sophisticated analytics infrastructure/bureaucracy/ lack of talent/ being too scared of statistics. Ultimately when you’re rolling out a machine learning product internally (I’m not talking about Azure/ other aws-model-training-as-a-service type things), the hardest part isn’t: “We need to increase our accuracy by 2% by using Restricted Convolutionallly Recurrent Bayesian Machines!” The hardest part is convincing people you need to integrate a new process into a “production” workflow, and then maintaining that process.

>the hardest part isn’t: “We need to increase our accuracy by 2% by using Restricted Convolutionallly Recurrent Bayesian Machines!” The hardest part is convincing people you need to integrate a new process into a “production” workflow, and then maintaining that process.

Completely agreed. My timeline for a project usually goes like:

A. 2-4 weeks: deeply understand the problem, talk to stakeholders, gather requirements, plan out the project.

B. 2 weeks: explore the problem and the data. Build and tweak models, build a functional prototype.

C. 8-24 weeks: put the system into production on top of the companies' tech stack, either myself or working closely with engineers.

D. 4-12 weeks: sell the system internally, prove that it's a superior solution, get buy-in that it should replace existing processes.

So yeah, in a typical 6 month project I only spend about 5-10% of my time on actual data and modeling. This % has gone sharply down as my career has progressed.

Totally agree! If I had a dollar bill every time I had a senior exec (VP-level) for large orgs ask me what 'probability' means, I wouldn't need to be a data scientist anymore! The issue is deeper than what you write I think. There is technical debt of course in implementing complex ML/DS pipelines that should be compensated by the increase in lift/revenue. Outside of large companies like Google/FB/Apple etc who have incorporated ML in their products, many outfits that want to use their 'data' to 'address business problems' don't really need sophisticated ML or can justify the technical and human debt. Having worked in the industry as a data scientist, I'm not too hopeful about the prospects of many DS-as-a-service companies, not because there isn't solid technical content to offer, but their clients are routinely idiots.

Are you saying everyone is an idiot, including the customers of data science companies, or that people who pay for data science are especially idiotic?

I think the article overestimates how much we understand about neural networks, and the novel ways we might use them.

Take word2vec for example. It's a two-layer (i.e. not deep) neural network that uses a relatively simple training algorithm. Yet, after 30 minutes it can learn impressive relationships between English words. The method is only 3 years old and businesses continue to find novel ways to apply word2vec. Check out this article:


They are using word2vec to find clothing items that resemble other products by adding or subtracting descriptors such as "stripes", "blue", or even "pregnant."

For my company, we are attempting to normalize medical terminology using word2vec. Here are a couple examples:

1. "myocardial" + "infarction" = "MI"

2. "ring" + "finger" = "fourth" + "metacarpal"

We never would have thought to normalize the names of different body parts, but with word2vec, it does it for us. There's a great deal yet to learn about how this model alone works, let alone deeper networks.

word2vec has been shown to be completely equal to a much older method so it's not as mysterious or new as one at first usually believes.

Source (great paper): http://u.cs.biu.ac.il/~nlp/wp-content/uploads/Neural-Word-Em...

I think I have an intuitive explanation why word2vec works. I tried to write it down, but I'm lazy.

Don't think of it as a 2 layer neural network. what it really is is a physics system. like a universe where each word is a planet. when words appear in the same context, that will assign some gravitational pull to the "planets". eventually, words with with same meaning will clustered as a galaxy.

for example, look at the following 2 sentences:

1. I pat a dog.

2. I pat a cat.

dog and cat, they appear in the same context. when the first time we see the word dog, we calculate center of mass of "I", "pat" and "a". Think of them as little planets.

then you give the planet "dog" a little force to pull it to the center of mass of the previous three words.

when you see the word cat, you do the same as it's in the same context.

Eventually, cat and dog will be clustered and form a galaxy, with other similar words. and the galaxy will be called "pet".

I don't know that they physics analogy works so well, or at least, it's definitely missing something. What prevents the whole word "universe" from collapsing on itself, forming a black hole? That is, if there are only attractive forces, the global optimum is to co-locate everything in the same point, which doesn't give you a useful model. There needs to be something in the model that keeps different words apart from each other.

This page has the clearest explanation of word embeddings and the relationship between the objective function and why vector translation captures meaning.


in real world, gravitational force appears on every object.

in the word2vec world, it only appears on words found in similar context. that's one major difference.

I think there is also anti-gravitational force that pushes words away. but again I need to double check.

it works because the gravity of word2vec isn't the gravity of real life.

notice that I only pull the word "dog" to the center of gravity of the rest words, instead of pulling all of them together. I think the full version even push the rest of the words away from the center of gravity.

but I need to double check the math.

this is not just an analogy. this is what word2vec's math says.

the only analogy part is that word2vec is in high dimension, my analogy is in 3 dimension.

This approach works IMO only when all of the terms have sufficient representation in the dataset.

I have noticed that Google seems to have recently incorporated something like this as one of its scoring factors. That's fine if you're searching for the latest Kardashian faux pas (and such searches dominate the use of Google IMO hence the results of any A/B test I suspect), but my searches for obscure technical terms have been returning much less relevant results than they once did.

Has anyone else noticed this?

I suspect the indexers/spiders don't work as well as they once did. Routinely I've been getting results for sketchy mirrors of StackOverflow or the MSDN forums, which link back to the original MSDN or SO post, but that original is nowhere to be found in the Google results.

Those examples aren't vectors, they are synonyms. Does your system find "ring toes"?

This article would be much better if it would specify whom is it addressed to.

My wife works in internet advertising, managing AdSense campaigns and stuff. The fact that she's able to write simple scripts in Javascript to automate some of her tasks already puts her ahead of her colleagues. Her whole company doesn't seem to have even a freshman-level of understanding of mathematical statistics, judging by how they operate and organize things internally. (Unfortunately, she has no power to change that). A competitor that would have just these two obvious things would already easily surpass them — but for now, they are bring more profit than a lot of SF-based startups could hope for.

Big data, deep learning? This stuff may be "obvious" for a computer scientist or a competent developer, but it sure as hell is not "obvious" to a lot of real world businesses out there, and there's a lot of value to be created by applying these things.

I interpreted the post as saying, "Don't try to get a Ph.D in deep learning, and don't learn it for it's own sake -- have an application in mind."

Adding to other commenter, he says that learning the recent tech will just give you a temporary advantage as a first-mover followed by all kinds of people having same skills. That's misleading as a first-mover will have many successful projects under their belt and a network of referrals if they played it smart. That advantage will be a lasting, selling point vs cheap labor of "I know the dark corners and time-wasters. I'll do it right."

>That advantage will be a lasting, selling point vs cheap labor of "I know the dark corners and time-wasters. I'll do it right."<

This very much still holds true...excellent point...

I've been in the profession for a number of years, paid attention to detail, and put all my efforts into nailing the skills I've observed to be relevant...

Much to my amusement, I learned a few years ago that some in my network of friends had given me the nickname of "the fixer"...which led to more referrals than I could possibly handle...

I know what I know, and if you call me in I may take a bit more time than you'd like to fix your problem(s), but when I'm done it will be "fixed"....

That's still worth quite a bit, even though the pace has picked up dramatically over the years...your "core" needs to be rock solid for long-term viability...

My advice to beginners is master something, then master something else...before long your reputation will precede you in ways that you'll be delighted by...

Glad to see you're benefiting from the effect. And this...

"My advice to beginners is master something, then master something else...before long your reputation will precede you in ways that you'll be delighted by..."

...is totally true. It's why I tried to wear many hats. At some point, wearing too many makes you look like you might not be that good at any one. So, the next trick I learned is to make different versions of a resume that leave off stuff and target specific audiences. Each one might think you only have 2-3 skills. High skill or mastery in them is still believable at that point. So, let them believe. :)

Agreed...targeted resumes are the way to go, especially with the current differentiated market, which is accelerating geometrically...

I'm full stack in 3, often 4 (depending), discreet environments...that took years, and (still takes) 50-some-odd active bookmarks and rss feeds...my traveling "hotshot" case has 34 thumbs loaded with goodies...a good memory helps a great deal...

You got me thinking. A friend several years back got a small box full of 60+MB thumb drives for almost nothing. We were thinking what could be done with those in terms of attacks, distribution of sample software, various ideas. Your comment makes me wonder if they could also be used as business cards of sorts showcasing a portfolio w/ code. Is that was you were referring to or something else like your software stacks for various roles you take on?

Regardless, I never considered that use case.

But he's explicitly saying that it's not about PhD in the very beginning, no?

> Clearly, at the speed things are evolving, there seems to be no time for a PhD.

I agree. In my experience at a large company (in a business org) just knowing SQL giveth magical powers.

That's a little bit like saying: don't learn web development, it's too easy, invest in learning fundamentals of computer science instead.

Sure, if you want to innovate in the field of academic AI, you probably need to know much more than how to throw some data at a neural net. But there are SO MANY problems supervised learning can solve out there - much like there are so many people and businesses who need a website done.

It's a tool, like any other - and right now it seems very hard to beat. So please go ahead, learn how to use it, and apply it to your own problem domain.

> That's a little bit like saying: don't learn web development, it's too easy, invest in learning fundamentals of computer science instead.

But that's very solid advice, not just for an academic. A lot of the web development jobs are either outsourced or marginalized by newer and more powerful frameworks. It's an unfortunate side effect of the fast moving field of informatics - your skills become either redundant or invaluable very quickly.

That really depends on what you want to do - I bet there are still people making very decent living after learning how to make WordPress themes in 2005. Sure, you need to keep up to date and constantly learn, but you don't necessarily need to know details about how Apache works if you're just making a portfolio page.

I don't want to discourage anyone from learning hard topics - it can be very rewarding and useful. I just object to the sentiment that learning how to use a tool has no value if you don't have an intimate understanding of it. It still gives you the power to solve new problems.

If you learn web development and apply for a job as computer scientist, there's a bit of a mismatch.

I think the author is assuming that "data scientist" means scientist more than developer.

But I agree, there's more than enough applications coming out of this for developers.

Now it gets much more fuzzy, but for the sake of the argument (after all why are we here? ;)) I'm willing to bet that there are lots of data science jobs that could be done well given a good knowledge of how to use deep learning and no understanding of how EM works.

I'm not saying that knowing fundamental (and/or difficult) topics is not useful - it absolutely is! It's just a matter of prioritizing what you learn about. I think if you want to maximize your impact, it makes sense to invest in learning the currently-most-promising tool before going for things with lower reward/effort ratio.

The problem with doing that is you keep reinventing the wheel. Use deep learning if that fits the problem, but keep learning about other things like how EM works, variational inference, graphical models etc on the side. One day you might find a problem where deep learning doesn't work as well as some of the other techniques. Sure there are data science jobs that can be done without much knowledge, but people tend to stop when they see math and are just happy to use some API. This IMO is a wrong approach.

You're completely right. There's a cool quote from http://waitbutwhy.com/2015/06/how-tesla-will-change-your-lif...:

    "I’ve heard people compare knowledge of a topic to a tree. If you don’t fully get it, it’s like a tree in your head with no trunk—and without a trunk, when you learn something new about the topic—a new branch or leaf of the tree—there’s nothing for it to hang onto, so it just falls away. By clearing out fog all the way to the bottom, I build a tree trunk in my head, and from then on, all new information can hold on, which makes that topic forever more interesting and productive to learn about. And what I usually find is that so many of the topics I’ve pegged as “boring” in my head are actually just foggy to me—like watching episode 17 of a great show, which would be boring if you didn’t have the tree trunk of the back story and characters in place."
I think learning to use APIs is probably the way to start, but it definitely pays off to keep digging deeper as you go along.

I guess one problem is that deep learning requires a different mindset than most people who are currently in computing are willing to have. I'm not a deep learning expert, but I guess that working in this field requires a lot of trial-and-error: "hmm this doesn't work, let's see what happens if we change this parameter". This is directly in contrast with the more "engineering" or "mathematical" mindset that computer science requires.

(Of course, deep learning certainly requires mathematics and engineering skills, but the driving force behind their use remains vague and difficult to grasp.)

I don't think the mental process is necessarily all that different in pure mathematics. The process of creating new structures and proofs in mathematics definitely involves a lot of trial-and-error and vague intuition. When the evidence mounts that a particular approach is likely to succeed, more effort goes into refining the intuitive approach into a valid proof. Further iterations often build on the original proof by tweaking the parameters to make it more general or change the context, just to see what happens.

You have a good point. But at least in pure mathematics, you end up with a rigorous proof, whereas in deep learning, you end up with a system that may-or-may-not work.

This means that in pure mathematics you can build on your previous results, but in deep learning you cannot (without danger).

Rigorous proofs are a formality that gets less attention than you'd guess at first.

Isn't the above mindset the mindset used for debugging?

I would say not. While debugging, you are using inference techniques to bracket your bug, and using such an approach you are almost certain that you will find it. Whereas in deep learning, you are more blindly trying things (where those "things" can certainly be rigorous techniques, it is just that they are being applied more or less haphazardly), and you are never sure where you will be arriving.

How about the process of creating user interfaces? That's not my specialty, but I suppose one might do lots of iterations like "oh, how about explaining how it works by first launching a pop-up? Ha no, users don't like it and close the webpage right away... let's try...".

My takeaway is something I've observed and commented about for some time: understanding how any important/relevant technology works gives you an unbeatable edge over most of your peers. Andrew Ng makes a similar comment about 2/3 of the way through his Coursera course on machine learning.

With respect to deep learning, most of the work I see involves pulling existing networks off of the Caffe Zoo and retraining them with proprietary datasets or fiddling with Theano to reproduce the works of others. These efforts mostly end in failure, usually after a blind hyperparameter search.

The easiest way to separate the winners from the losers in deep learning and elsewhere is to ask them to derive the method they are using and explain what it's doing. For the most part, I find the above people are incapable of doing so for backpropagation. Bonus points if they can write (relatively) efficient C++/CUDA code to implement backpropagation from the ground up.

Edit the above paragraph for PCA, ICA, GP, GMM, random forests or any other technology/algorithm. If one has a decent grasp of the underlying logic, the instincts for adapting it to new domains/datasets come at no additional charge.

Can you give some anecdotes of where "understanding how any important/relevant technology works gives you an unbeatable edge over most of your peers"? Besides being able to pass a job interview. Naively I would assume that being able to apply these techniques (understanding cross validation, overfitting, domain expertise, etc) would be far more important than knowing the underlying logic of how to write them from scratch.

I think that it's way easier to apply something when you understand how that something actually works.

i.e. There is a difference between being able to re-implement an algorithm, and understanding why that algorithm has the behavior it has. You'll probably write cleaner and more maintainable code, quicker, in the latter, and be able to make modifications as necessary.

That's an easy claim, but putting more time into learning foundations means you have less time for applications. Moth pure mathematicians are poor web developers.

Quality over quantity: sounds like a winning strategy to me.

For work a month ago I implemented stochastic gradient descent in some arbitrary directed graph of operations (I applied a skill I learned from neural networks in a system that had nothing to do with neural networks) which required me to know how to take derivatives of a loss function with respect to parameters. I did all of the math on paper, and then coded it and it worked (to be frank this surprised me). It often may be hard to see the immediate gains from understanding things in a more technical, low level fashion - but I think in the long run they give you a great deal of power.

Sounds like he's viewing things very much from a research-scientist perspective. From that point of view, sure, a lot of the groundbreaking work has already been done. From the point of view of someone who just has a job to do, though, it's a tool in the toolbox. If it's the most appropriate tool then you'd be stupid not to use it.

I'm not in CS, just an enthusiastic person using software to solve his own projects. I personally want to learn more of these that make "hard thing easy", not to come up with new methods, but to apply these now-achievable techniques to different problems, the problems around me. "learning deep learning should not be the goal but a side effect" sounds useful in general, in the same time I wish there would be more, wider set of people actually using (as opposed to merely learning) the awesome methods and technology we have these days.

I don't know. If you go that route then it's just: Study mathematics or theoretical physics and then learn everything you need.

The problem is pulling that off. Sure if you are clever and pursue a PhD it's worth going the abstract theoretical route but in my personal experience as a somewhat stupid student it's fine to look for a niche and get hands-on experience on that.

Mastering Hadoop is still far from easy and if you need to use it without burning tons of CPU cycles it's even harder to get right.

I am interested in this, mostly as a computational biologist. Any intro on Hadoop? What is it used for?

Hadoop itself consists of different parts: HDFS is a distributed filesystem that can span across lot's of machines and stores data in blobs of varying sizes.

MapReduce is the Google idea from before 2004 how to do calculations on lot's of data. Now there is also YARN that could be described as a general job scheduler.

At the moment a lot of people use software on YARK like Spark (does more in memory, is faster, can use the GPU on the cluster machines).

So if you have biological data you could feed that into a HDFS and have Spark or MapReduce jobs that process the data. The clue about Hadoop is that don't need to care about getting the clustering and distributed setup right. This is done by Hadoop for you. You program like you would program a single thread algorithm (at least in the simple cases).

E.g. this Python code counts words over as much data as you want in parallel: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapr...

If you google each of these projects you'll find a lot of information.

Here are the original papers that should give a good idea:

- HFDS (Google Filesystem was the original idea - Hadoop was a free implementation by Yahoo - http://static.googleusercontent.com/media/research.google.co...

- MapReduce - http://static.googleusercontent.com/media/research.google.co...

- Spark: http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark...

If you want a good introduction read Hadoop: The Definitive Guide, 3rd Edition

We used it for building a search engine out of multiple terabytes of crawl data - something that fit's not good on a single computer.

You can do all kinds of computations but graph problems or things that not good to parallelize require often other solutions beyond MapReduce or more thought - MapReduce also only fits for a certain class of problems - it's great for counting and aggregation of stuff but beyond that it's often not usable.

Also, it is boring, because 90% of the time, you will be waiting for your networks to be trained :)

In short, don't go for it because it is a fad. Over time, like the _data scientist_ buzzword, this too shall become something everyone can do. It is what you can do with Deep Learning that matters, and where you can push beyond this.

There's a bit of optimism about technology adoption here; I'm guessing Ferenc is part of a bleeding edge research community from how he talks about the ease of uptake and adoption for new technologies.

The reality is there are still companies looking for their first 'big data' data scientist years after this was a 'hot area'. While aggressive researchers, then aggressive companies, then fast follower researches / companies, then ... all do eventually adopt useful technology, it can take decades in some cases.

So, do learn deep learning. But from my own experience, he's right in that you'll think "that's it?" once you've constructed a few networks and read through some basic 2014+ literature on tuning and tips. But, that will put you ahead of the VAST majority of technologists inre: deep learning. Seriously.

I think he's 100% right that the search for usefulness / novel applications is where the game is going to be for deep networks, but it's seems silly to tell people not to learn how to work with them.

What about reinforcement learning? It seems neglected to me. There are only one or two good reinforcement learning resources I'm aware of compared to the hundreds of books and blogs on supervised learning. Most machine learning books will mention supervised, unsupervised, and reinforcement learning in the first chapter, then never mention reinforcement learning again. This is my experience at least, and I've read a few introductory machine learning books.

Reinforcement learning can drive you car and mow your lawn. Supervised learning can tell you which of two pictures is a cat. Granted, supervised learning techniques can have an important place in the broader framework of reinforcement learning.

Reinforcement learning is one of the harder problems.

The hard thing here is reinforcing the right thing. I did a bunch of work on genetic programming and RL and it worked... uhh... too well. Once it succeeded in learning, near as I could tell, my hard disk geometry. There was a subtle timing difference between data sets and this seemed related to where they were on disk. The literature is loaded with anecdotes like that, some even more hilarious.

But you're right. Nature clearly does learn this way, at least for many things. Maybe the way forward is to combine it with supervised and unsupervised learning and let different forms of learning work together.

This kind of unintended learning is something that all machine learning practitioners should be careful with. The example of the satellite photo analysis algorithm that turned out to have learned to classify difference between night and day photos comes to mind.

This means that your particular reinforcement learning example needs something similar to regularization of cost functions or cross validation for checking for over or under fitting to correct these latency effects.

It may be cheaper and more effective to use SSDs instead but the magnitude of the latency effects on your case are not clear either to determine if your algorithm would benefit substantially from a modern SSD.

The fix was to (1) take steps to make sure the image was entirely pre-loaded into RAM before the test was given, and (2) randomize the latency of the load. That worked and I actually got some real results, which unfortunately were decent but not as impressive as the cheating was. :) When it cheated the thing got 100% right of course.

I also learned that rapid convergence on incredible accuracy is suspect.

This is what Deepmind seems to dedicate most of their efforts on and what their new AlphaGo models is a great example of (NN + tree search Reinforcement Learning). Pedro Domingo's thesis in his new book is the potential for the progress to be made with the combination of distinct schools within the field of ML

You should write that up. I love examples of GAs and other optimization algorithms going haywire.

I think it's been too long. I don't even have the exact code anymore.

One of my faves was a GA that evolved circuits on FPGAs that would evolve wonderful, sometimes almost magical results... that only worked on the exact FPGA the evolver used to test phenotypes. Apparently it ultra-fitted to exact little quirks of that particular physical piece of silicon. They were never able to figure out exactly what those quirks were. It was a total mystery and the circuits were bizarre. The solution was to use a big pool of different FPGAs and mix them randomly.

I also heard once of a case where the use of a poor random function was problematic. Evidently Mr. Darwin is able to reverse engineer the simple three or four line K&R rand() function. After that I started actually using cryptographic PRNGs.

We came up with the term "herding shoggoths" for trying to get evolutionary systems to evolve the right thing.

Reinforcement learning involves low-information objective signals. Hence, it sounds "universal" but doesn't actually work very well.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact