My experience: I had basically zero math background, but I took ML with Ng and probabilistic graphical models with Koller, and later was a TA for Ng's ML class, during my Masters' degree and thought I was all set to go into machine learning jobs. To my surprise, I consistently found myself in interviews stumped by questions from basic stats, particularly significance testing, which people with more traditional stats backgrounds assume is basic knowledge (and it should be), but which wasn't taught in any of my ML classes.
I'm in a job now that involves some machine learning, but the ML component is 50% marshalling data (formatting, cleaning, moving), 40% trying to figure out how to get enough validated training examples, and 10% thinking about the right classifier to use (which someone else already implemented). Which to be honest is not very interesting.
So yeah, becoming a real data scientist is hard, requires a lot more knowledge than you get in one ML course, even from Andrew Ng, and the reality of the work often doesn't make it some dream career. And the competition for jobs isn't from other people who also just took that course -- it's from PhD statisticians and statistical physicists who might have taken one ML class to show them how to use all the mathematical tools they already have to do the new hot thing called machine learning.
And this can be powerful, of course, but it doesn't really have much of the magic of AI.
This is a key insight. Bravo!
To generalize a bit more, most of ML is applied mathematics. Getting a good grounding in the underlying math is the most illuminating step to learning ML (spoken as someone who wasted a lot of time doing other things thanks to an irrational fear of learning mathematics and am still bad at it)
Deep math/stat understanding combined with the engineering bits(like programming, cleaning the data, running clusters) and the communication bits, (like visualization) brings you to (what should be) 'data science' (imvvho ymmv etc etc).
I am still not sure one person can pull it all off-it probably needs a solid team of specialists. But hey 'data scientist' is a hot job description, and so you can't blame people who know bits and pieces (sometimes very small bits and pieces ;) ) calling themselves 'data scientists' or whatever. "Machine Learning for Hackers" and all that jazz. We've seen all this before with "HTML coders" from the nineties.
Except for running the clusters, I've done pretty much all of those steps myself. I started with a nice statistical idea, built some simple models, played with feature selection and learning algorithms, built model viewers, built classifiers, validated classifiers, built demos, validated demos, built a production implementation, optimized the production implementation to make it small/fast enough, and finally launched a big search quality improvement.
 I certainly write distributed code that runs on them, but maintaining the DCs definitely isn't part of my job description.
 Validation of the final quality in prod is actually someone else's job, not because I couldn't do it, but you might not want me to tell you how good my stuff is, cause you know, I might be biased.
But in college you never get to apply them to real problems. I think if you apply them to real problems you'll have the revelation. Especially when you try other approaches first. But actually applying them requires domain knowledge, data cleaning skills, and programming skills beyond what many people have (certainly myself as an undergraduate).
I think people's expectation of a new scientific or engineering discipline to be "sexy" or "magic" is simply a sign of widespread ignorance of the field, so it's a good thing when something becomes less "magic" and more "real". I bet airplanes were "magic" until we learned how to fly them consistently and safely :)
Predictive analytics has the possibility to radically transform many industries. This trend of hyping up it's difficulty is a dishonest tactic to inflate salaries. Pretty soon we'll get the fraudalent activities of lawyers and colleges with their accredition requirements if this trend isn't kept in check.
The only value any professional can reliably claim to add - data scientist, software architect, CEO, etc. - is the value they can prove. One must be able to justify their job or salary by superior results (quantitatively better predictions, lower costs, faster development cycles). If you can't do this, but instead insist your superior training protects the company from some scary, unseen, vague 'bad consequences' of not having a 'true professional' performing your work, then you're just as big a drain on enterprise value as a lawyer.
Or worse, an MBA. :)
The OP and resulting discussion is about whether people calling themselves (or having titles saying) 'data scientists' actually are 'scientists' or know anything about 'data'.
so one guy having a 'data scientist' title and spending his time plugging data into black boxes says very little about what 'the least part of the job' in general is.
If anything your "some social skills and some database skills" is an even poorer description. By that measure, the business analyst next door who can write some SQL scripts is a 'data scientist'. But then otoh, maybe he is, who knows? ;)
Edit: I just noticed you are the "Are there practical applications for proving theorems? It seems like it's the full employment act for pencil pushers." guy.
My apologies for replying to your comments. Won't happen again. (beware @jacquesm !)
I should log out of here anyway, thanks for the reminder
chmod 444 news.ycombinator.com
edit: Ok, ad hominem attack. And in that thread no one managed to making a convincing case for the utility of millions/billions of people learning maths as traditionally taught. You're probably a data scientist who wants to inflate your salary (there's my attack).
edit: LOL. you're a joke plinkponk, hahaha - being called out on your lies about your chosen profession being hard (so that you can feel special) really gets you doesn't it! http://www.urbandictionary.com/define.php?term=Wiener%20Didd...
Really knowing your stuff in either one of those fields is hard, knowing them very well and knowing enough computer science to apply it all (properly) is more the work of a small team than a single individual.
People like that are rare. There is nothing stopping anybody from calling themselves 'data scientist'. Just like there is nothing stopping anybody from calling themselves software architect or system administrator.
In the end that's just people marketing themselves as good as they know how but that does not mean there isn't a sliding scale between warm body and excellence. I think that is the distinction the article tries to make.
Just because your algorithm can "beat" an industry's worth of work doesn't mean it can be implemented in a practical or efficient way.
This of course devalues your own skills, as you are one of the elite few. Unless you start writing textbooks. Which you should do, if you're one of the few. And self-promote like hell. If that course doesn't cover it - what does? Do you acknowledge that in a few years some shortcuts might be possible - that budding data scientists might not need to hae read every book that you have? I bet if you try, you can make your own shortcut in the form of a book.
Which is followed by a link to my own book on this topic, Agile Data: http://shop.oreilly.com/product/0636920025054.do which attempts to demystify as much as teach.
To fill that kind of need we can use a lot of bodies with Coursera degrees.
Also, someone who has to constantly shift their attention between statistics and database servers might get less done than somebody who can concentrate on the mathematics and let their co-workers handle the implementation details.
This paradigm shift is happening at intelligent companies. You hire a competent software developer who took stats 101 and knows how to Google some stuff, then have him oversee a pair of interns, one who has a math/stats degree and the other who has a physics/engineering degree. In six months you'll have a trained Data Science (tm) team.
OK, so let's say I'm the scientist working alongside a couple of computer science people. Now, every time I have to remove a comma from the files they exported for me, should I ask them to please write a magic line into the shell for me. Every time I have to parse data in json or whatever else I don't know anything about I wait for them to do it for me. Every time data has to be loaded from a database, I explain what data I want, wait for the computer science person to write and execute a query for me? Just doesn't work this way.
When working with data you have to be able to experiment, and to experiment you have to have an idea about what's possible and what's not. If, as a scientist you do not understand what these practical tools can do for you, your experimentation will be severely limited. You have to be able to pair up the most promising mathematical approach with the simplest techniques that get you there. This is very hard to do, unless the same person knows about both maths and computer tools.
But more fundamentally, machine learning (or data science) does not equal statistics + computer science. The skillset you need to be a powerful part of the team is not simply a union of various computer science and statistics skills. It requires a different mindset.
No one ever claimed that taking one class made someone an expert "data scientist." Instead, that single class wetted Luis, Jure, and Xavier's (the three competition winners) appetites, and pushed them more to learn more about machine learning and natural language processing. They then went on to dive much deeper, and excelled specifically in one area of applied NLP.
However, without that first class, there's a good chance none of them would have ever focused on (or heard of) machine learning. Their story is growing increasingly common. Like the Netflix Prize, Andrew Ng's first Coursera class did its part in shining a spotlight onto our dark little corner of the universe.
I'd be very cautious about a long checklist of items that are necessary to be a successful data scientist (which is a pretty ill-defined and encompassing term at this point). That is a decent summary of many useful tools of the trade, but they are by no means useful for all problem domains. For example, I could spend years working on machine learning for EEG brain-computer interfaces without a good reason to use databases or "big data" NoSQL technologies. I especially enjoyed MSR's take on the matter in "Nobody ever got fired for using Hadoop on a cluster" http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...
When we're hiring data scientists or seeking successful ones, we've found focusing on demonstrated excellence in one relevant area plus general quantitative competencies and the curiosity and tenacity to learn new tools and techniques works far better than a laundry list of skills and experiences.
It shouldn't be surprising or bad news that some "data scientists" have deeper knowledge than others. We're going through a quantitative revolution -- many fields and industries are nearly untouched by statistical analysis/machine learning, and so there's a lot of low-hanging fruit in going from "nothing" to "something." Even somebody who only knows a little can add value at these margins. But, of course, that won't be true forever -- look at quantitative finance, which is very competitive and requires a lot of education, because the low-hanging fruit was picked in the 90's.
There's room in this world for the statistician, the mathematician, the database engineer, the AI guy, the data visualization expert, the codemonkey who knows a few ML methods, etc.
This probably needs to be clarified a bit to say that Ng's course skipped this. Daphne Koller's "Probabilistic Graphical Models" (running now at Coursera) covers this in great detail.
Minor tweak in an otherwise nice post.
That seems weird, considering Bayesian learning/methodologies are a big cornerstone of ML.
Anyway, I consider Bayesian logic a cornerstone of ML modeling. It's not so much the content that needs to be memorized/understood as much as it is the way of thinking that Bayesian methodologies present over frequentist statistics.
Still, I'm sure it's a good class.
I'm excited about what the future brings. Many industries has seen the value in data sciences, and Universities are following (see, e.g. Columbia's new data sciences institute).
Now even if I weren't a mathematician it doesn't mean that following Coursera courses for a couple years and doing a lot of work at home wouldn't get me somewhere. You don't have to switch jobs as soon as you finish ML course but you can certainly practice your skills at home.
The "science" part of the term involves testing and iterating, as the scientific method would imply. Not necessarily the knowledge of esoteric mathematics.
I actually found this list encouraging because the things I don't know well on that list are things I'm working on and am aware are holes in my knowledge.
But in the end the reality will always be that the people who are "real" data scientists will be the people that are actually solving real problems whether or not they can check off every bullet point on a check list.
Sound familiar? The entire team I worked with had a similar workflow, but we went by life science domain specific titles rather than "data scientist." I'm willing to bet that other professions have similar roles, merely called something else.
I think "data scientists" are out there in the sciences. They just don't go by the latest buzzword.
It's an XWP vs. JAP issue: http://michaelochurch.wordpress.com/2012/08/26/xwp-vs-jap/
Once you're at or near 30, you realize that you won't be able to stand the software career unless you get an edge in picking projects, because the vast majority of the engineering workload is line-of-business bullshit that you don't learn much from. To grow as a programmer, you have to beat (or cheat) the project allocation game. One avenue is to go into management, but that doesn't work because bosses who take all the interesting work for themselves get undermined. An alternative is the "architect" designation, but becoming an "architect" is even more political than moving into management. Right now, "data science" is a title that has enough of a "+1" to it that it gives engineers the ability to put themselves on the most interesting projects.
It's a "The way to Tara is via Holyhead" kind of thing.
where's the science? it is, after all, a data scientist role. where is learning to do actual science?
what the world's been describing is an analyst or an engineering position, not science. if you don't know how to ask questions, interpret results, structure experiments - then you don't know science, so quit calling yourself a scientist. science involves a rigor of thinking and doing that has been omitted here.
For any field, one has to provide positive encouragement (and a good platform/set of tools and techniques) to people seeking to get into that field, while being grounded in reality.
Also, its not clear folks are using consistently the term vis a via scale/scope. Consider an anaolgue of knowledge and expertise (real estate example):
Is Data scientist an Architect? Or the person that builds the building? Is he the guy that does the plumbing? Although the up and coming "quantitative system analyst" probably doesn't quite ring the same tune on a biz card. And most refer to lower level quant mastery, eg. social engineering or quantitative finance, as a "black art" not a science. Without a high level vision, the concept/title seems...grandiose, until you get to very extreme levels of skill. And then it makes sense.
Good plasterers get paid (and treated) pretty well, but they're never going to be the architects who designed the building.
 Alternatively, i'm starting to see the clearer picture: "Data Scientist" is more a contraction than a description. viz: "Data-Analytics-Literate Compter Science" becomes "Data_Science" if you just omit the middle terms. For marketing purposes, moniker "Scientist" is highlighted as status signifier (Academically ~akin to "Artist").
Instead of chasing after some title ("data scientist ..."), find ways of solving some useful problems with whatever you learnt. The article argues that becoming an expert is difficult, which may be true. But that does not mean you don't know enough to start digging at your problems in hand
I've been testing Tableau, but it's only for Windows and I'm on a Mac.
I'm looking for something which easily connects to your SQL database and allows you to produce all kinds of fancy graphs, with a easy user-interface. Tableau comes close to what I'm looking for, but my gut feeling is there should be a whole bunch of good commercial and free software in this field out there, but my googling haven't gave any good results yet.
So if anyone have any good suggestions, please let me know.
working out the question is what makes it a hard(and creative) process, and then you can apply your ML toolbox.
edit: whats different from a data scientist vs analyst/statistican is they build their own tools as the datasets are too massive & non-standard for the usual toolset.
The software engineering career is in somewhat of a mess right now. It comes down to the "bozo bit" problem. Being a software engineer (even with 10+ years of experience, because there are a lot of engineers who only do low-end work and don't learn much) is not enough to clear the bozo bit, and you won't be able to prove that you're good unless you have a major success, and it's hard to have that kind of success without people already trusting you with the autonomy to do something genuinely excellent.
It's not enough to write code, because LoC is a cost and source code is rarely actually read at the large scale. At least for backend developers, the only work-related (as opposed to political) way to establish that you're worth anything as a engineer is to have an architectural success, but it's very hard to have architectural successes unless you've established yourself as an "architect" to begin with. So there's a permission paradox: you can't do it until you've proven you can, and you can't prove you can do it until you've done it. Hence, the vicious politics that characterize software "architecture" in most companies.
Functional programming is one way to put yourself head-and-shoulders above the FactoryFactory hoipolloi. The problem is that most business people don't understand it. They just think Haskell's a weird language "that no one uses". Elite programmers get that you're elite if you know these languages, but most companies are run by people of mediocre engineering ability (and that's often just fine, from a business standpoint).
It is true that functional programming is superior to FactoryFactory business bullshit, but not well-enough known. Good luck making that case to someone who's been managing Java projects for 10 years. What is better known is that mathematics is hard. It's a barrier to entry. I doubt more than 5% of professional programmers could derive linear regression.
So I see the data science path (and yes, it takes a long time to learn all the components, including statistics, machine learning, and distributed systems) as a mechanism through which a genuinely competent software engineer can say, "I'm good at math, and I can code; therefore, I deserve the most interesting work your company can afford to fund." It's a way to keep getting the best work and avoid falling into that FactoryFactory hoipolloi who stop advancing at 25 and are unemployed by 40.
I have seen some ninja FP artists that have written some really terse solutions in a FP style which are completely unreadable to the uninitiated.
Also, I do not think that there is anything intrinsically difficult about FP. Knowing it proves that you are ambitious and that you have mental plasticity but not necessarily that you are a good developer.
You're just as likely to become that weird dude that writes code no one understands as you are to become Mr Wizard.
It is rare the case when a company will hire an academic person to directly lead a company effort, and when this happens it is because the person is already famous and has a lot of consulting/real world experience.
The path you mentioned is great if you can find a company where you can make a major contribution using your skills, but for the reasons I mentioned it has still the same problem (bozo bit) that you cited before.
I actually hear of this happening a lot for economists in Silicon Valley, even with unfamous people.
I guess at least on the math side you can prove your algorithms against historical data and so merit assessments could be fact based.
i got interested in financial semantics lately and was amazed to find how many people are doing algorithmic trading with accounts anywhere from $10k to a few mil. in fact i know of many startups developing products and services aimed at such people.
if you read the MSM you'd think it's all about HFT and big players, but I think algorithmic trading today is a lot like day trading was in the late 90's.
Being able to derive a linear regression is less important than being able to see the end from the beginning and being able to read math. I know a coder-cum-CEO (freshly minted) who has a hard time understanding how to program a numerical derivative, yet has produced amazing works based on what he does know. I said it before--we all need each other, especially as the focus has been on specialization (so that no one person has the entire skillset requred).
You can make Java a lot more fun if you use generics, functional patterns and frameworks like Google's Guava, even if you still need to create a FactoryFactoryFactory once in a while.
When I program Scala or C# I tend to curse type erasure in Java generics, but when I actually program Java the generics system makes me smile. I can very efficiently turn what's in my head into a design and code.
In a language like Haskell whatever you gain from the language you tend to pay back in missing or buggy standard libraries. If I was writing a webapp in any off-brand language, for instance, I wouldn't trust that the urlencode function works correctly until I'd written some test cases.
I'll admit functional Java is a little verbose (can't write a closure in just one line) but the Hotspot compiler is very good at inlining functional code and collecting the garbage from it. I can definitely say that I get much better performance w/ functional Java than I usually see with the Interpreter pattern with two orders of magnitude less code size.