The part I am having trouble with, with ~5 years of real world data science experience in industry, is the implicit assumption that we all need or want to become good at deep learning.
In my experience, most businesses, at best, are still struggling with overfitting logistic regression in Excel let alone implementing/integrating it with a production code base. And we all know that toy ML models that sit on laptops create ZERO value beyond fodder for Board presentations or moving the CMO's agenda forward.
The fact of the matter is that the vast majority of businesses, with respect to statistics/ML, aren’t doing super duper basic shit (like a random forest microservice that scores some sort of transaction) that might increase some metric 10%. This is due to lack of sophisticated analytics infrastructure/bureaucracy/ lack of talent/ being too scared of statistics. Ultimately when you’re rolling out a machine learning product internally (I’m not talking about Azure/ other aws-model-training-as-a-service type things), the hardest part isn’t: “We need to increase our accuracy by 2% by using Restricted Convolutionallly Recurrent Bayesian Machines!” The hardest part is convincing people you need to integrate a new process into a “production” workflow, and then maintaining that process.
Completely agreed. My timeline for a project usually goes like:
A. 2-4 weeks: deeply understand the problem, talk to stakeholders, gather requirements, plan out the project.
B. 2 weeks: explore the problem and the data. Build and tweak models, build a functional prototype.
C. 8-24 weeks: put the system into production on top of the companies' tech stack, either myself or working closely with engineers.
D. 4-12 weeks: sell the system internally, prove that it's a superior solution, get buy-in that it should replace existing processes.
So yeah, in a typical 6 month project I only spend about 5-10% of my time on actual data and modeling. This % has gone sharply down as my career has progressed.
Take word2vec for example. It's a two-layer (i.e. not deep) neural network that uses a relatively simple training algorithm. Yet, after 30 minutes it can learn impressive relationships between English words. The method is only 3 years old and businesses continue to find novel ways to apply word2vec. Check out this article:
They are using word2vec to find clothing items that resemble other products by adding or subtracting descriptors such as "stripes", "blue", or even "pregnant."
For my company, we are attempting to normalize medical terminology using word2vec. Here are a couple examples:
1. "myocardial" + "infarction" = "MI"
2. "ring" + "finger" = "fourth" + "metacarpal"
We never would have thought to normalize the names of different body parts, but with word2vec, it does it for us. There's a great deal yet to learn about how this model alone works, let alone deeper networks.
Source (great paper):
Don't think of it as a 2 layer neural network. what it really is is a physics system. like a universe where each word is a planet. when words appear in the same context, that will assign some gravitational pull to the "planets". eventually, words with with same meaning will clustered as a galaxy.
for example, look at the following 2 sentences:
1. I pat a dog.
2. I pat a cat.
dog and cat, they appear in the same context. when the first time we see the word dog, we calculate center of mass of "I", "pat" and "a". Think of them as little planets.
then you give the planet "dog" a little force to pull it to the center of mass of the previous three words.
when you see the word cat, you do the same as it's in the same context.
Eventually, cat and dog will be clustered and form a galaxy, with other similar words. and the galaxy will be called "pet".
This page has the clearest explanation of word embeddings and the relationship between the objective function and why vector translation captures meaning.
in the word2vec world, it only appears on words found in similar context. that's one major difference.
I think there is also anti-gravitational force that pushes words away. but again I need to double check.
notice that I only pull the word "dog" to the center of gravity of the rest words, instead of pulling all of them together. I think the full version even push the rest of the words away from the center of gravity.
but I need to double check the math.
this is not just an analogy. this is what word2vec's math says.
the only analogy part is that word2vec is in high dimension, my analogy is in 3 dimension.
I have noticed that Google seems to have recently incorporated something like this as one of its scoring factors. That's fine if you're searching for the latest Kardashian faux pas (and such searches dominate the use of Google IMO hence the results of any A/B test I suspect), but my searches for obscure technical terms have been returning much less relevant results than they once did.
Has anyone else noticed this?
Big data, deep learning? This stuff may be "obvious" for a computer scientist or a competent developer, but it sure as hell is not "obvious" to a lot of real world businesses out there, and there's a lot of value to be created by applying these things.
This very much still holds true...excellent point...
I've been in the profession for a number of years, paid attention to detail, and put all my efforts into nailing the skills I've observed to be relevant...
Much to my amusement, I learned a few years ago that some in my network of friends had given me the nickname of "the fixer"...which led to more referrals than I could possibly handle...
I know what I know, and if you call me in I may take a bit more time than you'd like to fix your problem(s), but when I'm done it will be "fixed"....
That's still worth quite a bit, even though the pace has picked up dramatically over the years...your "core" needs to be rock solid for long-term viability...
My advice to beginners is master something, then master something else...before long your reputation will precede you in ways that you'll be delighted by...
"My advice to beginners is master something, then master something else...before long your reputation will precede you in ways that you'll be delighted by..."
...is totally true. It's why I tried to wear many hats. At some point, wearing too many makes you look like you might not be that good at any one. So, the next trick I learned is to make different versions of a resume that leave off stuff and target specific audiences. Each one might think you only have 2-3 skills. High skill or mastery in them is still believable at that point. So, let them believe. :)
I'm full stack in 3, often 4 (depending), discreet environments...that took years, and (still takes) 50-some-odd active bookmarks and rss feeds...my traveling "hotshot" case has 34 thumbs loaded with goodies...a good memory helps a great deal...
Regardless, I never considered that use case.
> Clearly, at the speed things are evolving, there seems to be no time for a PhD.
Sure, if you want to innovate in the field of academic AI, you probably need to know much more than how to throw some data at a neural net. But there are SO MANY problems supervised learning can solve out there - much like there are so many people and businesses who need a website done.
It's a tool, like any other - and right now it seems very hard to beat. So please go ahead, learn how to use it, and apply it to your own problem domain.
But that's very solid advice, not just for an academic. A lot of the web development jobs are either outsourced or marginalized by newer and more powerful frameworks. It's an unfortunate side effect of the fast moving field of informatics - your skills become either redundant or invaluable very quickly.
I don't want to discourage anyone from learning hard topics - it can be very rewarding and useful. I just object to the sentiment that learning how to use a tool has no value if you don't have an intimate understanding of it. It still gives you the power to solve new problems.
I think the author is assuming that "data scientist" means scientist more than developer.
But I agree, there's more than enough applications coming out of this for developers.
I'm not saying that knowing fundamental (and/or difficult) topics is not useful - it absolutely is! It's just a matter of prioritizing what you learn about. I think if you want to maximize your impact, it makes sense to invest in learning the currently-most-promising tool before going for things with lower reward/effort ratio.
"I’ve heard people compare knowledge of a topic to a tree. If you don’t fully get it, it’s like a tree in your head with no trunk—and without a trunk, when you learn something new about the topic—a new branch or leaf of the tree—there’s nothing for it to hang onto, so it just falls away. By clearing out fog all the way to the bottom, I build a tree trunk in my head, and from then on, all new information can hold on, which makes that topic forever more interesting and productive to learn about. And what I usually find is that so many of the topics I’ve pegged as “boring” in my head are actually just foggy to me—like watching episode 17 of a great show, which would be boring if you didn’t have the tree trunk of the back story and characters in place."
(Of course, deep learning certainly requires mathematics and engineering skills, but the driving force behind their use remains vague and difficult to grasp.)
This means that in pure mathematics you can build on your previous results, but in deep learning you cannot (without danger).
With respect to deep learning, most of the work I see involves pulling existing networks off of the Caffe Zoo and retraining them with proprietary datasets or fiddling with Theano to reproduce the works of others. These efforts mostly end in failure, usually after a blind hyperparameter search.
The easiest way to separate the winners from the losers in deep learning and elsewhere is to ask them to derive the method they are using and explain what it's doing. For the most part, I find the above people are incapable of doing so for backpropagation. Bonus points if they can write (relatively) efficient C++/CUDA code to implement backpropagation from the ground up.
Edit the above paragraph for PCA, ICA, GP, GMM, random forests or any other technology/algorithm. If one has a decent grasp of the underlying logic, the instincts for adapting it to new domains/datasets come at no additional charge.
i.e. There is a difference between being able to re-implement an algorithm, and understanding why that algorithm has the behavior it has. You'll probably write cleaner and more maintainable code, quicker, in the latter, and be able to make modifications as necessary.
The problem is pulling that off. Sure if you are clever and pursue a PhD it's worth going the abstract theoretical route but in my personal experience as a somewhat stupid student it's fine to look for a niche and get hands-on experience on that.
Mastering Hadoop is still far from easy and if you need to use it without burning tons of CPU cycles it's even harder to get right.
MapReduce is the Google idea from before 2004 how to do calculations on lot's of data. Now there is also YARN that could be described as a general job scheduler.
At the moment a lot of people use software on YARK like Spark (does more in memory, is faster, can use the GPU on the cluster machines).
So if you have biological data you could feed that into a HDFS and have Spark or MapReduce jobs that process the data. The clue about Hadoop is that don't need to care about getting the clustering and distributed setup right. This is done by Hadoop for you. You program like you would program a single thread algorithm (at least in the simple cases).
E.g. this Python code counts words over as much data as you want in parallel: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapr...
If you google each of these projects you'll find a lot of information.
Here are the original papers that should give a good idea:
- HFDS (Google Filesystem was the original idea - Hadoop was a free implementation by Yahoo - http://static.googleusercontent.com/media/research.google.co...
- MapReduce -
- Spark: http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark...
If you want a good introduction read Hadoop: The Definitive Guide, 3rd Edition
We used it for building a search engine out of multiple terabytes of crawl data - something that fit's not good on a single computer.
You can do all kinds of computations but graph problems or things that not good to parallelize require often other solutions beyond MapReduce or more thought - MapReduce also only fits for a certain class of problems - it's great for counting and aggregation of stuff but beyond that it's often not usable.
The reality is there are still companies looking for their first 'big data' data scientist years after this was a 'hot area'. While aggressive researchers, then aggressive companies, then fast follower researches / companies, then ... all do eventually adopt useful technology, it can take decades in some cases.
So, do learn deep learning. But from my own experience, he's right in that you'll think "that's it?" once you've constructed a few networks and read through some basic 2014+ literature on tuning and tips. But, that will put you ahead of the VAST majority of technologists inre: deep learning. Seriously.
I think he's 100% right that the search for usefulness / novel applications is where the game is going to be for deep networks, but it's seems silly to tell people not to learn how to work with them.
Reinforcement learning can drive you car and mow your lawn. Supervised learning can tell you which of two pictures is a cat. Granted, supervised learning techniques can have an important place in the broader framework of reinforcement learning.
Reinforcement learning is one of the harder problems.
But you're right. Nature clearly does learn this way, at least for many things. Maybe the way forward is to combine it with supervised and unsupervised learning and let different forms of learning work together.
This means that your particular reinforcement learning example needs something similar to regularization of cost functions or cross validation for checking for over or under fitting to correct these latency effects.
It may be cheaper and more effective to use SSDs instead but the magnitude of the latency effects on your case are not clear either to determine if your algorithm would benefit substantially from a modern SSD.
I also learned that rapid convergence on incredible accuracy is suspect.
One of my faves was a GA that evolved circuits on FPGAs that would evolve wonderful, sometimes almost magical results... that only worked on the exact FPGA the evolver used to test phenotypes. Apparently it ultra-fitted to exact little quirks of that particular physical piece of silicon. They were never able to figure out exactly what those quirks were. It was a total mystery and the circuits were bizarre. The solution was to use a big pool of different FPGAs and mix them randomly.
I also heard once of a case where the use of a poor random function was problematic. Evidently Mr. Darwin is able to reverse engineer the simple three or four line K&R rand() function. After that I started actually using cryptographic PRNGs.
We came up with the term "herding shoggoths" for trying to get evolutionary systems to evolve the right thing.