In my experience there are very few domains within Machine Learning where you don't need to be an expert in the field to yield useful conclusions out of the data.
Even if you have a high-level conceptual understanding of the statistical methods, tuning the parameters to yield something relevant, or much more so, adapting existing algorithms to meet your needs requires some pretty serious dedication to the field.
What scares me is that MOOCs are really pushing the data scientist field -- see Udacity and Coursera -- but giving people only enough knowledge to be dangerous. It's tough because data science is a fascinating field and many people who have the interest and aptitude don't have the means or life situation to go to grad school for it. These MOOCs are trying to appeal to such people, but they're not nearly rigorous enough.
The popular MOOCs don't take you far enough to start doing serious machine learning, but you don't need a PhD to be ready to solve those problems.
It takes work. Lots of work. Re-learn linear algebra until you know why "eigenvectors" are so important. Know what the most important matrix factorizations (LU, QR, SVD, Eigen, Cholesky) do. Read the papers until the math becomes "no big deal". Pick up a probability textbook and read the whole thing; also, get a working knowledge of real analysis. It won't happen quickly.
The PhD is some classes, plus 3-7 years of focused work. Some of that's compressible and unnecessary to becoming a data scientist. Some of it isn't. The Coursera courses are great for getting you started; they're entry-level college courses, and if you read the papers and the seminal textbooks (e.g. Elements of Statistical Learning by Hastie et al) you can get into the intermediate territory in a couple years or so. It's not easy, but it can definitely be done. Getting to the expert level, I think, just requires real-world experience on real-world problems... but, one hopes, you can start attacking such problems once you're at the intermediate level.
From a level of the mathematical difficulty, Elements etc. (but also current-level research papers) are all readable by anyone with a solid understanding of undergraduate mathematics (which is essentially a decent Linear Algebra course, multivariate analysis, a probability course, and a numerical computing course).
I think the reason why employers look for candidates with a PhD is that too many people "scrape by" when getting their CS degree -- e.g. they somehow fulfilled the required coursework, and somehow got their degree. The PhD requirement is essentially a bureaucratic substitute for answering the question "has this person understood math in sufficient depth to be able to do independent work with it".
That being said, I haven't given up completely. I'm starting to read "The Haskell Road to Logic, Maths, and Programming" in the hopes of finally being able to grok proofs. At the very least, I feel that learning more math can only help me as a developer.
For others reading this, this edx course on Probability seemed like it was really good, until my lack of maths background caught up to me: https://www.edx.org/course/mitx/mitx-6-041x-introduction-pro... For Linear Algebra, check out http://www.ulaff.net/
I see it almost the other way around: Companies strictly demand PhD's for Big Data jobs and can't find this unicorn. Yet we live in a time where we don't need a PhD program to receive education from the likes of Ng, LeCun and Langford. We live in a time where curiosity and dedication can net you valuable results. Where CUDA-hackers can beat university teams. The entire field of big data visualization requires innate aptitude and creativity, not so much an expensive PhD program. I suspect Paul Graham, when solving his spam problem with ML, benefited more from his philosophy education than his computer science education.
Of course, having a PhD. still shows dedication and talent. But it is no guarantee for practical ML skills, it can even hamper research and results, when too much power is given to theory and reputation is at stake.
In my experience Machine Learning was locked up in academics, and even in academics it was subdivided. The idea that "you need to be an ML expert, before you can run an algo" is detrimental to the field, not helping so much in adopting a wider industry use of ML. Those ML experts set the academic benchmarks that amateurs were able to beat by trying out Random Forests and Gradient Boosting.
I predict that ML will become part of the IT-stack, as much as databases have. Nowadays, you do not need to be a certified DBA to set up a database. It is helpful and in some cases heavily advisable, but databases now see a much wider adoption by laypeople. This is starting to happen in ML. I think more hobbyists are right now toying with convolutional neural networks, than there are serious researchers in this area. These hobbyists can surely find and contribute valuable practical insights.
Tuning parameters is basically a gridsearch. You can bruteforce this. In goes some ranges of parameters, out come the best params found. Fairly easy to explain to a programmer.
Adapting existing algorithms is ML researcher territory. That is a few miles above the business people extracting valuable/actionable insight from (big or small or tedious) data. Also there is a wide range of big data engineers making it physically possible to have the "necessary" PhD's extract value from Big Data.
> Tuning parameters is basically a gridsearch. You can bruteforce this. In goes some ranges of parameters, out come the best params found.
This sounds so simple. However, if you just do a bruteforce grid search and call it a day, you're most likely going to overfit your model to the data. This is what I've seen happen when amateurs (for lack of a better word) build ML systems:
(1) You'll get tremendously good accuracies on your training dataset with grid search
(2) Business decisions will be made based on the high accuracy numbers you're seeing (90%? wow! we've got a helluva product here!)
(3) The model will be deployed to production.
(4) Accuracies will be much lower, perhaps 5-10% lower if you're lucky, perhaps a lot more.
(5) Scramble to explain low accuracies, various heuristics put in place, ad-hoc data transforms, retrain models on new data -- all essentially groping in the dark, because now there's a fire and you can't afford the time to learn about model regularization and cross-validation techniques.
And eventually you'll have a patchwork of spaghetti that is perhaps ML, perhaps just heuristics mashed together. So while there's value in being practical, when ML becomes a commodity enough to be in an IT stack, it is likely no longer considered ML.
Trying to emulate biological brains might not be the way forward. People tried to fly by constructing bird-like feathers and wings - it obviously didn't work. We had to understand the underlying principles governing flight. The same applies to creating neural networks.
There's some underlying principle the brain uses. It doesn't mean we have to crack brain structure to achieve strong intelligence.
We should look for inspiration in biological systems but we should not try to copy them.
Experiments such as this one http://web.mit.edu/msur/www/publications/Newton_Sur04.pdf
The plasticity of the brain is phenomenal.
The fact that you think it takes a PhD to be a decent data scientist indicates that you're out of your depth on this one.
You don't need a PhD to get useful work done in these fields. You need to work hard and tackle difficult math. It takes years, but it can be done if you have the talent and drive. A prestigious (top-10) PhD certainly makes your life easier in getting the top jobs, but it doesn't really make you more (or less) able to fulfill them. I don't have a PhD and can do what 95+ percent of "PhD Data Scientists" do for work.
PhD is (a) focused work on a specific, usually narrow problem, and (b) years of self-study that required to know enough to attack said problem. For real-world data science, (a) only matters in the ~1% chance of overlap between your dissertation and the needs of your employer, and (b) doesn't require five years in an academic institution (although it probably does require about that much time, if you study on your own, since you're likely to be doing much of the work on your own time).
The PhD is a valuable experience and I don't mean to denigrate it. I often wish I had gotten one, in my 20s, instead of becoming a world-class expert on software office politics and "only" an intermediate-plus Haskell/Clojure/machine-learning guy. The PhD is a great experience for many people, but I don't think it belongs on a pedestal.
You can look at some of the modern ML algorithms and see what I mean; many people that I know have worked with Latent Dirichlet Allocation, but they have no idea how the model works, and there's no way they could extend to work online or under certain performance or storage constraints without havin spent months and years working on that problem.
That's not a realistic expectation for anyone in the field. Yes, the algorithms you find in Weka and other ML toolkits are useful, but the actual "Big Data" problems have their own performance and algorithmic constraints that are far, far beyond dedicated self-learners.
This feedback loop explains a great chunk of why we on HN spend so much time knit-picking through stories on e.g. Wired. What we read is not so much "reporting," but designs-by-committee of researchers doing things they think the public wants/needs and reporters bending stories toward what they think the public wants and needs.
“Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.
In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.”
It's actually worse than that. What I see is that when companies have the ability to store and "analyze" large amounts of data, their appetite for data tends to increase. So they seek to take in as much data as they can find. More often than not, the quality of the data is mixed at best. Frequently, it's horrible, and because the focus is on data acquisition and not data quality, nobody notices the bad data, missing data and duplicate data.
The result: even if you manage to come up with decent hypotheses, you can't trust the data on which you test them.
This is not necessarily a bad thing. Take the domain of application performance management. You're collecting hundreds of thousands of metrics from all over the place, OS, network, middleware, end user. Occasionally there is a performance problem that is non-obvious. You go through the obvious metrics and find nothing. It is a great thing at this point to just throw all this data at some algorithm and let it come back to you with "metric X, Y, Z looks related". This gives me some hypothesis I can go check that I would probably never have thought of on my own. And I have a direct way of verifying if it was a correct hypothesis: oh, it looks like there's 2 disks in this cluster, 1 is running at 100% the other at 0% so the overall utilization only shows 50%, I didn't think that was a problem. Investigate. Oh this disk has compression enabled, the other doesn't, turn it off, the application runs fast now.
Just what would the energy function look like for real neurons? I don't know but we do know that the activation function would have to "spike" in bursts. So that is a clue. We also have rudimentary ideas about the learning rule used in biological neural networks, so you would also want to take this into account when determining the actual energy function. Finally, real neurons do not send retrograde signals but are instead wired recurrently, which must also be taken into consideration.
back propagation = chain rule = forward differentiation = adjoint differentiation
and that different disciplines have different words for what is just the chain rule.
What I would say though, is that I think it is less an issue of the statistical strength of the data, and has more to do with the methods used to turn data itself into the statistics. For example, I was working with what by now (size projections are paramount in sysadmin planning for stuff like this) should be close to a Petabytes worth of genetic data. The real issue we were running into was that the traditional tools tend to fall apart on data of this size.
What we ended up doing was writing a distribution protocol for a certain application that worked well but wasn't very concurrent, and then every machine on the network besides the storage/sequencers/backup would crunch the data, helping even the big servers out. A big server would get 10-30 workers and a workstation would get 1-4. We turned 2 day analysis into 4 hour analysis.
And once we did the analysis, only one person, the company owner/genius, could decipher it.
I have to say, as a sysadmin, it was probably one of the most challenging and most educational positions I ever had. I actually enjoyed always being the only person in the room without a Phd.
I can see how some people might feel like being between a rock and a hard place: The data firehoses are all in place, our key-value stores are getting fuller by the hour, and we're supposed to sit and wait for decades before we'll be able to make any sense of it? I wouldn't be surprised if some will much rather play roulette today than make a sure bet in 10+ yrs.
"we have no idea how neurons are storing information, how they are computing, what the rules are, what the algorithms are, what the representations are, and the like."
"...you get an output from the end of the layers, and you propagate a signal backwards through the layers to change all the parameters. It’s pretty clear the brain doesn’t do something like that. "
So why can't the brain do some kind of backpropagation?
When we don't know how a thing works that does not preclude us being able to eliminate some of the possibilities. Our ignorance is profound, but not total.
As an aside, are they really trying to patent a slight twist on backpropagation? That seems pretty counter-productive to me.
For some neural nets, you still have a gradient, but the concept of back or forward propagation is not defineable. Based on the topology and structure of biological neural nets, what would you think is the case?
Overall though, I think there is a lot of truth to his message.
Hebbian learning .... maybe. I never thought too much about contrastive divergence.
It doesn't really matter; brains most certainly don't work like ANNs, except maybe in some weird mean field sense for a few things like liquid state machines. It would be a huge coincidence if LeCun or Hinton or whoever magically wrote down the brain equation....
It's Hebbian learning. When a post-synaptic neuron fires shortly after a pre-synaptic one fires, the synapse in question is strengthened (the surface area actually becomes larger). I hope he's talking about higher level concepts of learning, because otherwise he's wrong.
It's seems like the idea is that machine learning and data driven inference have to grow up and become a real scientific discipline. "Why can't you be more like Civil Engineering?" This isn't the best way to look at it. Machine learning is designed for situations where data is limited and there are no guarantees. Take Amazon's recommendation engine for example. It's not possible to peer into someone's mind and come up with a mathematical proof that states whether they will like or dislike John Grisham novels. A data driven model can use inference to make predictions based on the person's rating history, demographic profile, etc. It's true that many machine learning approaches don't have the scientific heft of civil engineering, but they are still very useful in many situations.
I'm not disagreeing with the eminence of Michael I. Jordan. I think this is a philosophical question with no correct answer. Is the world deterministic, can we model everything with rigorous physics style equations? Or is it probabilistic, are we always making inferences based on a limited amount of data? Both of those views are valid, especially in different contexts. Some of the most interesting problems are inherently probabilistic, such as predicting the weather, economic trends and the behavior of our own bodies. "Big Data" is obviously a stupid buzzword, but the concept of data driven decision making is very sound. We should put less focus on media hype terms and continue to encourage people to make use of large amounts of information. Get rid of the bathwater, keep the baby.
> Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.
He is not saying anything about the relative heft of machine learning and civil engineering. He is saying that if you don't worry about whether your predictions coming from big data are accurate, and whether you know a priori that they are accurate, you will still make predictions, but some of them will be wrong, and you don't know which ones. The analogy with engineering is only incidental to his point, which is mainly about overfitting.
You can point out afterwards that a certain prediction made using big data was correct in hindsight by collecting data after the prediction was used to make some decisions, like Amazon might. But you would really like to know whether a decision is likely to be a good one before you make it. And he, as a scientist, is interested in knowing for sure whether his results are correct.
So I typed that in google just to see and indeed I got nothing. I guess their knowledge graph still has a long way to go.
For example, you can set a query "what's the 2nd biggest city in CA not near the river that has weather same as Seattle and is not among the top 500 cities in US".
As you can see generalized query would literally require system to create a program on its own. If we can do this, we would not need programmers and very likely it would be same breakthrough as practically unlimited supply of energy.
I was playing with it, had to go to a meeting and forgot I'd modified the question.
Can't even handle
"What is the second biggest city in northern California"
I don't think that anybody in the research community (except for maybe an occasional crazy) believes that neural networks have any biological significance beyond inspiration. NIPS (Neural Information Processing Systems) has been a reputable venue for work in statistics for some years now with no confusion over the idea that "Neural" does not mean a precise (or even imprecise) imitation of biological neurons.
"Well, I want to be a little careful here. I think it’s important to distinguish two areas where the word neural is currently being used.
One of them is in deep learning. And there, each “neuron” is really a cartoon. It’s a linear-weighted sum that’s passed through a nonlinearity. Anyone in electrical engineering would recognize those kinds of nonlinear systems. Calling that a neuron is clearly, at best, a shorthand. It’s really a cartoon. There is a procedure called logistic regression in statistics that dates from the 1950s, which had nothing to do with neurons but which is exactly the same little piece of architecture."