Hacker News new | comments | show | ask | jobs | submit login
Why becoming a data scientist is not easier than you think (josephmisiti.com)
192 points by misiti3780 1626 days ago | hide | past | web | 94 comments | favorite

I couldn't agree more with this. The ML courses that Ng and Koller teach are really missing a lot of the statistical tools you need to do real-world data mining and ML.

My experience: I had basically zero math background, but I took ML with Ng and probabilistic graphical models with Koller, and later was a TA for Ng's ML class, during my Masters' degree and thought I was all set to go into machine learning jobs. To my surprise, I consistently found myself in interviews stumped by questions from basic stats, particularly significance testing, which people with more traditional stats backgrounds assume is basic knowledge (and it should be), but which wasn't taught in any of my ML classes.

I'm in a job now that involves some machine learning, but the ML component is 50% marshalling data (formatting, cleaning, moving), 40% trying to figure out how to get enough validated training examples, and 10% thinking about the right classifier to use (which someone else already implemented). Which to be honest is not very interesting.

So yeah, becoming a real data scientist is hard, requires a lot more knowledge than you get in one ML course, even from Andrew Ng, and the reality of the work often doesn't make it some dream career. And the competition for jobs isn't from other people who also just took that course -- it's from PhD statisticians and statistical physicists who might have taken one ML class to show them how to use all the mathematical tools they already have to do the new hot thing called machine learning.

My day job is as a data scientist, and most of the applied ML I perform is simply plugging data into some optimizer/detector for best algorithm and running split train/test loops. Most of the work is herding the cats in the business units and justifying my salary to the Board of Directors.

I was pretty gung ho about getting into ML two years ago and put a lot of time into online courses like Ng's, books, and ground-up implementations of a lot of the common algorithms. I enjoyed it, but after a while it became clear to me that a lot of this stuff is better described as applied statistics.

And this can be powerful, of course, but it doesn't really have much of the magic of AI.

"a lot of this stuff is better described as applied statistics"

This is a key insight. Bravo!

To generalize a bit more, most of ML is applied mathematics. Getting a good grounding in the underlying math is the most illuminating step to learning ML (spoken as someone who wasted a lot of time doing other things thanks to an irrational fear of learning mathematics and am still bad at it)

Deep math/stat understanding combined with the engineering bits(like programming, cleaning the data, running clusters) and the communication bits, (like visualization) brings you to (what should be) 'data science' (imvvho ymmv etc etc).

I am still not sure one person can pull it all off-it probably needs a solid team of specialists. But hey 'data scientist' is a hot job description, and so you can't blame people who know bits and pieces (sometimes very small bits and pieces ;) ) calling themselves 'data scientists' or whatever. "Machine Learning for Hackers" and all that jazz. We've seen all this before with "HTML coders" from the nineties.

Work as a search quality engineer at Google and you do pretty much all of that.

Except for running the clusters[1], I've done pretty much all of those steps myself. I started with a nice statistical idea, built some simple models, played with feature selection and learning algorithms, built model viewers, built classifiers, validated classifiers, built demos, validated demos, built a production implementation[2], optimized the production implementation to make it small/fast enough, and finally launched a big search quality improvement.

[1] I certainly write distributed code that runs on them, but maintaining the DCs definitely isn't part of my job description.

[2] Validation of the final quality in prod is actually someone else's job, not because I couldn't do it, but you might not want me to tell you how good my stuff is, cause you know, I might be biased.

Right, it's less sexy than people think. That was my reaction when taking an Artificial Intelligence class and a Machine Learning class over 10 years ago as an undergrad. I was like, "these are unprincipled hacks". I liked graphics better. There were actual algorithms.

But in college you never get to apply them to real problems. I think if you apply them to real problems you'll have the revelation. Especially when you try other approaches first. But actually applying them requires domain knowledge, data cleaning skills, and programming skills beyond what many people have (certainly myself as an undergraduate).

Not disagreeing with your second paragraph, but just wanted to point out that Machine Learning has matured way beyond the "unprincipled hacks" phase, and as was correctly pointed out above, can be seen as a direction in applied statistics. If you look at a modern course in multivariate statistics, there's a significant overlap with ML (http://goo.gl/GTDUC).

I think people's expectation of a new scientific or engineering discipline to be "sexy" or "magic" is simply a sign of widespread ignorance of the field, so it's a good thing when something becomes less "magic" and more "real". I bet airplanes were "magic" until we learned how to fly them consistently and safely :)

Machine learning is full of hacks to connect theory to producing results given constraints. That's not too different from hacks that are done for the sake of performance in graphics rendering.

There isn't much in AI that has "the magic of AI".

There is a successful school of thought that says significance testing is hogwash, and that school may be found near Palo Alto.

Specifically in the Gates building, I believe... but what about the tribe across the road in Sequoia Hall?

So some social skills and some database skills. That's pretty much the description of every IT job. What's so hard?

Predictive analytics has the possibility to radically transform many industries. This trend of hyping up it's difficulty is a dishonest tactic to inflate salaries. Pretty soon we'll get the fraudalent activities of lawyers and colleges with their accredition requirements if this trend isn't kept in check.

I'd add math skills, but I really appreciate this comment. Hyping the difficulty and complexity of these tasks to protect economic turf - and insisting on accreditation and experience vs. actual results - is never a good long-run play.

The only value any professional can reliably claim to add - data scientist, software architect, CEO, etc. - is the value they can prove. One must be able to justify their job or salary by superior results (quantitatively better predictions, lower costs, faster development cycles). If you can't do this, but instead insist your superior training protects the company from some scary, unseen, vague 'bad consequences' of not having a 'true professional' performing your work, then you're just as big a drain on enterprise value as a lawyer.

Or worse, an MBA. :)

You missed the stats and the math.

As practicing data scientists have pointed out in this thread, that is the least part of the job. And, really, how hard is it. As icelander points out, it comes down to plugging data into standard algo's.

'As practicing data scientists have pointed out'

The OP and resulting discussion is about whether people calling themselves (or having titles saying) 'data scientists' actually are 'scientists' or know anything about 'data'.

so one guy having a 'data scientist' title and spending his time plugging data into black boxes says very little about what 'the least part of the job' in general is.

If anything your "some social skills and some database skills" is an even poorer description. By that measure, the business analyst next door who can write some SQL scripts is a 'data scientist'. But then otoh, maybe he is, who knows? ;)

Edit: I just noticed you are the "Are there practical applications for proving theorems? It seems like it's the full employment act for pencil pushers." guy.


My apologies for replying to your comments. Won't happen again. (beware @jacquesm !)

> (beware @jacquesm !)

I should log out of here anyway, thanks for the reminder

chmod 444 news.ycombinator.com

edit: I thought you were agreeing with that slandering fool plinkplonk so I got a little out hand. Sorry jacquesm, i've edited it all out.

There isn't a thing in what I wrote here that warrants this comment.

I don't think there's such a huge body of knowledge in "data science" as people claim. It's not astronomy or biology etc. As a "data scientist" the predictive ability of your model is all that counts. So the social skills and basic knowledge of stats + your progrmming skills/creativity is what's important (and the kaggle competitions have borne this out).

edit: Ok, ad hominem attack. And in that thread no one managed to making a convincing case for the utility of millions/billions of people learning maths as traditionally taught. You're probably a data scientist who wants to inflate your salary (there's my attack).

edit: LOL. you're a joke plinkponk, hahaha - being called out on your lies about your chosen profession being hard (so that you can feel special) really gets you doesn't it! http://www.urbandictionary.com/define.php?term=Wiener%20Didd...

Keep it civil man. You can argue your points without calling names. As a party with no horse in this race, I find the conversation interesting enough without the need to get nasty.

A bit of stats and a bit of math are not that hard.

Really knowing your stuff in either one of those fields is hard, knowing them very well and knowing enough computer science to apply it all (properly) is more the work of a small team than a single individual.

People like that are rare. There is nothing stopping anybody from calling themselves 'data scientist'. Just like there is nothing stopping anybody from calling themselves software architect or system administrator.

In the end that's just people marketing themselves as good as they know how but that does not mean there isn't a sliding scale between warm body and excellence. I think that is the distinction the article tries to make.

The evidence that is coming out from kaggle doesn't support the claim of teams with experienced specialized skill. Teams of a single student have beat out entire industry of companies (the essay score competition for example).

Kaggle competitors work on small data sets, so algorithmic problems don't really surface there.

Just because your algorithm can "beat" an industry's worth of work doesn't mean it can be implemented in a practical or efficient way.

I think the idea is... at the moment there is this elite few people that actually have the entire skill-set of 'data scientist,' to one degree or another... but we need more of them. And the way to achieve this isn't to say, "Get a decade's experience across an enormous area of math, computer science, domain experience, and then come talk to me." The way to achieve this is to make data science seem more approachable than it is at the moment... in the hopes that it will be more approachable as we build courses like the one you critique.

This of course devalues your own skills, as you are one of the elite few. Unless you start writing textbooks. Which you should do, if you're one of the few. And self-promote like hell. If that course doesn't cover it - what does? Do you acknowledge that in a few years some shortcuts might be possible - that budding data scientists might not need to hae read every book that you have? I bet if you try, you can make your own shortcut in the form of a book.

Which is followed by a link to my own book on this topic, Agile Data: http://shop.oreilly.com/product/0636920025054.do which attempts to demystify as much as teach.

There are different classes of skills. For lots of organizations simply adding someone who can run a MongoDB mapreduce into a t-test (or SVM or whatever flavor you like) and print out a bar chart is going to be an upgrade.

To fill that kind of need we can use a lot of bodies with Coursera degrees.

I still don't understand why it's important to have all these areas of expertise embodied in a single person called a "data scientist". Rather than hire one of these rare and expensive people, why couldn't a business hire a statistician and a couple of computer science people and have them work as a team? Given how few data scientists there currently are and the high demand for them, you might even be able to get these three people for less money than one data scientist.

Also, someone who has to constantly shift their attention between statistics and database servers might get less done than somebody who can concentrate on the mathematics and let their co-workers handle the implementation details.

"Rather than hire one of these rare and expensive people, why couldn't a business hire a statistician and a couple of computer science people and have them work as a team?"

This paradigm shift is happening at intelligent companies. You hire a competent software developer who took stats 101 and knows how to Google some stuff, then have him oversee a pair of interns, one who has a math/stats degree and the other who has a physics/engineering degree. In six months you'll have a trained Data Science (tm) team.

The problem with this is that people's skills combine non-linearly. Just like a cluster of 100 machines is not 100 times faster than a single machine. After all, why do we need kite surfers, just tie a surfer and a kite jumper together.

OK, so let's say I'm the scientist working alongside a couple of computer science people. Now, every time I have to remove a comma from the files they exported for me, should I ask them to please write a magic line into the shell for me. Every time I have to parse data in json or whatever else I don't know anything about I wait for them to do it for me. Every time data has to be loaded from a database, I explain what data I want, wait for the computer science person to write and execute a query for me? Just doesn't work this way.

When working with data you have to be able to experiment, and to experiment you have to have an idea about what's possible and what's not. If, as a scientist you do not understand what these practical tools can do for you, your experimentation will be severely limited. You have to be able to pair up the most promising mathematical approach with the simplest techniques that get you there. This is very hard to do, unless the same person knows about both maths and computer tools.

But more fundamentally, machine learning (or data science) does not equal statistics + computer science. The skillset you need to be a powerful part of the team is not simply a union of various computer science and statistics skills. It requires a different mindset.

They are, and continue to. However everyone needs to be able to talk and work together. It's like hiring a programmer to write genome sequencing software. They need to know some biology! In this case the stats people need to know some CS, the CS people some stats, and they both need to know some business.

Not only that, "data scientist" is just a marketing gimmick for consultants. It's a standard set of skills that anyone trained in science or engineering has. It's ok to use it in front of pointy headed bosses but in front of nerds it's slightly dishonest.

^this. I've often wondered where the science in data science is. I don't see it. If you're an engineer or scientist, you probably don't either.

The science in data science involves testing hypotheses, like anyone following the scientific method would do. It's mandatory for validating ML models.

But most public examples of "data science" don't do this. They just publish pretty graphs that a great completely unrigorous.

OK, but that doesn't mean "if you're an engineer or scientist, you don't see the science in data science" is a true statement. It's categorically false.

Well written, but I believe you missed the point of the original article.

No one ever claimed that taking one class made someone an expert "data scientist." Instead, that single class wetted Luis, Jure, and Xavier's (the three competition winners) appetites, and pushed them more to learn more about machine learning and natural language processing. They then went on to dive much deeper, and excelled specifically in one area of applied NLP.

However, without that first class, there's a good chance none of them would have ever focused on (or heard of) machine learning. Their story is growing increasingly common. Like the Netflix Prize, Andrew Ng's first Coursera class did its part in shining a spotlight onto our dark little corner of the universe.

I'd be very cautious about a long checklist of items that are necessary to be a successful data scientist (which is a pretty ill-defined and encompassing term at this point). That is a decent summary of many useful tools of the trade, but they are by no means useful for all problem domains. For example, I could spend years working on machine learning for EEG brain-computer interfaces without a good reason to use databases or "big data" NoSQL technologies. I especially enjoyed MSR's take on the matter in "Nobody ever got fired for using Hadoop on a cluster" http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...

When we're hiring data scientists or seeking successful ones, we've found focusing on demonstrated excellence in one relevant area plus general quantitative competencies and the curiosity and tenacity to learn new tools and techniques works far better than a laundry list of skills and experiences.

Absolutely agree, also author didn't take into account that there are software engineers with good math background who don't know how cool it is to do machine learning. I'm one of them and I've spent last month and a half doing a lot of ML and enjoying the heck of it! My next step is Kaggle.

As someone who writes software for data scientists (https://github.com/snowplow/snowplow) I definitely agree with his analysis. But I would go further: without _domain knowledge_, a data scientist is really just a ETL guy who cleans up big data for the real analysts to make sense of. Applying the whole toolkit to a specific domain (and SaaS B2B looks totally different from supermarket loyalty schemes and from F2P mobile games) is key.

Look -- as long as "data scientist" is a sexy job title, a lot of different jobs are going to claim they fall under that umbrella. I have an applied math background, and I'm fine with scientific computing, but I have much less experience with databases. I'm a very different candidate than a software engineer who took a machine learning course. Maybe in a few years we'll have more intelligent language for making those distinctions.

It shouldn't be surprising or bad news that some "data scientists" have deeper knowledge than others. We're going through a quantitative revolution -- many fields and industries are nearly untouched by statistical analysis/machine learning, and so there's a lot of low-hanging fruit in going from "nothing" to "something." Even somebody who only knows a little can add value at these margins. But, of course, that won't be true forever -- look at quantitative finance, which is very competitive and requires a lot of education, because the low-hanging fruit was picked in the 90's.

There's room in this world for the statistician, the mathematician, the database engineer, the AI guy, the data visualization expert, the codemonkey who knows a few ML methods, etc.

"Coursera skipped over Bayesian learning"

This probably needs to be clarified a bit to say that Ng's course skipped this. Daphne Koller's "Probabilistic Graphical Models" (running now at Coursera) covers this in great detail.

Minor tweak in an otherwise nice post.

"This probably needs to be clarified a bit to say that Ng's course skipped this."

That seems weird, considering Bayesian learning/methodologies are a big cornerstone of ML.

How much can you expect someone to learn in 10 1-hour sessions, starting from a college freshman background?

A fair amount; I've seen some of his work and I think it's pretty good. I don't really like Octave (his program of choice), but I understand why he used it. I would have gone with something higher-level, since a lot of decent tools are out there that add a layer of abstraction to ML.

Anyway, I consider Bayesian logic a cornerstone of ML modeling. It's not so much the content that needs to be memorized/understood as much as it is the way of thinking that Bayesian methodologies present over frequentist statistics.

Still, I'm sure it's a good class.

Indeed, PGMClass is "the" Bayesian class.

Is "Probabilistic Graphical Models" a good course? Do I need to buy the $95 book it links to?

You don't need the book. It's a very good but difficult class.

As I argued in the comments of "becoming a data scientist might be easier than you think", the attitude that entering the field is easy is dangerous because it is very untrue. I really wish there were more qualified people in the field (do you know how hard it is to hire?), but entering it without the proper knowledge doesn't help anyone.

I'm excited about what the future brings. Many industries has seen the value in data sciences, and Universities are following (see, e.g. Columbia's new data sciences institute).

Really though, unless you have a strong understanding of both calculus and statistics you'll never be a "data scientist", you'll just be a library jockey.

And probably database theory, probability calculus, matrix analysis, graphical models, stochastic processes, information theory, &c &c...

Yes, yes, it's not like you started from something. Were you born data scientist? Just try to accept that there may be people with almost full skill set who didn't know where to apply it? I'm pure mathematician by education and I really didn't know where to apply my skills, having followed ML, NLP, Big data, PGM courses I have much better understanding now.

Now even if I weren't a mathematician it doesn't mean that following Coursera courses for a couple years and doing a lot of work at home wouldn't get me somewhere. You don't have to switch jobs as soon as you finish ML course but you can certainly practice your skills at home.

I'm not saying the Coursera classes are bad---just that there's a chasm between being able to implement or derive Naïve Bayes and being able to do meaningful work or study in applied statistics.

The way I see it you also learned it somewhere sometime ago. Coursera offers more and more courses some of which are quite in-depth. Motivated person may be capable to finish university degree in couple years. Coupled with the fact that this person may be gainfully employed in similar area, it will not be that hard to imagine him or her to eventually switch to a position that requires all the old and new knowledge. IMHO, examples that were given in the original article are in line with what I described.

Sounds like we need to all work together.

Eh, maybe. I have a strong background in mathematics (my focus was Game Theory), and certainly understanding the underlying math in ML models is very useful, tools are coming out on a regular basis that make it simple for good developers and DBA-types to apply some ML modeling to their data.

The "science" part of the term involves testing and iterating, as the scientific method would imply. Not necessarily the knowledge of esoteric mathematics.

Maybe I'm mistaken but I think most of the people interested in becoming 'data scientists' are either currently doing lots of software with an interest in stats, or people doing lots of stats with an interest in software. Given one or the other half of this list is probably already very familiar territory.

I actually found this list encouraging because the things I don't know well on that list are things I'm working on and am aware are holes in my knowledge.

But in the end the reality will always be that the people who are "real" data scientists will be the people that are actually solving real problems whether or not they can check off every bullet point on a check list.

I think it's interesting to note the number of professions that may already fulfill a "data science" role, just with a different title. I worked a job where my primary role was data analysis: parsing data with Unix commands, feeding it to classifiers, applying standard algorithms, drawing meaningful conclusions, etc.

Sound familiar? The entire team I worked with had a similar workflow, but we went by life science domain specific titles rather than "data scientist." I'm willing to bet that other professions have similar roles, merely called something else.

I think "data scientists" are out there in the sciences. They just don't go by the latest buzzword.

Replace scientist with analyst and all of a sudden 75% of the people interested in this career path don't care anymore. I don't get the absolute obsession and sexification of a role that has existed for a long long time already.

The data scientist title implies responsibility for creating product. By analogy with Wall Street: a data scientist is like a high-level quant designing trading algorithms, while an analyst is like a BA analyst plugging numbers into Excel for the boss.

Go check what "quant" is actually a short hand for.

Data scientist: in many companies, this means a software engineer with an additional credibility that gives him dibs on the most interesting projects. A lot of data scientists end up working on distributed systems problems that would typically be considered closer to hard-line engineering than machine learning or data analysis.

It's an XWP vs. JAP issue: http://michaelochurch.wordpress.com/2012/08/26/xwp-vs-jap/

Once you're at or near 30, you realize that you won't be able to stand the software career unless you get an edge in picking projects, because the vast majority of the engineering workload is line-of-business bullshit that you don't learn much from. To grow as a programmer, you have to beat (or cheat) the project allocation game. One avenue is to go into management, but that doesn't work because bosses who take all the interesting work for themselves get undermined. An alternative is the "architect" designation, but becoming an "architect" is even more political than moving into management. Right now, "data science" is a title that has enough of a "+1" to it that it gives engineers the ability to put themselves on the most interesting projects.

This makes sense.

It's a "The way to Tara is via Holyhead" kind of thing.

what's missing from any discussion here or in many of these "this is the new hotness" posts is this: science.

where's the science? it is, after all, a data scientist role. where is learning to do actual science?

what the world's been describing is an analyst or an engineering position, not science. if you don't know how to ask questions, interpret results, structure experiments - then you don't know science, so quit calling yourself a scientist. science involves a rigor of thinking and doing that has been omitted here.

Indeed. It may come as no surprise to you, then, that I push people with a degree in physics far ahead of the line when hiring for my Data Science team.

While it is not easy to become an expert in any field or pursuit, one should neither overplay the "it is a very hard field / not very easy" argument nor should one underplay the effort involved in becoming good. 10000 hours (equates to roughly 5yrs at 40hrs per week) to expertize seems like a good rule of thumb to keep in mind. Someone said - "The test of a vocation is the love of the drudgery it involves."

For any field, one has to provide positive encouragement (and a good platform/set of tools and techniques) to people seeking to get into that field, while being grounded in reality.

Isn't the real skill for a Data Scientist one of scalability and abstraction, from Data? While its critical to be able to get the data and make it more plastic, for measurement of metrics, real-time pricing, or even various weak-form predictive variables, its ultimately the analysis and understanding that is Critical to monetization/value extraction. And to build a good system for this level of data transparency, you need some good high-level understanding for clarity of vision. There are lots of people good at all manner Quants, but like the interview question, how much complexity can you explain in 5 minutes? is not one all answer equally. The ability to scale from granular detail to abstract levels of organization, meaning, and pattern recognition, are critical to extracting value in these contexts.

Also, its not clear folks are using consistently the term vis a via scale/scope. Consider an anaolgue of knowledge and expertise (real estate example):

L1 Architect>

l2 Contractor>

L3 Sub-contrator>

L4 Builder/Laborer>

Is Data scientist an Architect? Or the person that builds the building? Is he the guy that does the plumbing? Although the up and coming "quantitative system analyst" probably doesn't quite ring the same tune on a biz card. And most refer to lower level quant mastery, eg. social engineering or quantitative finance, as a "black art" not a science. Without a high level vision, the concept/title seems...grandiose, until you get to very extreme levels of skill. And then it makes sense.

I think quite a few data scientists are taking the position that they're a specialist plasterer in your metaphor, someone who does one job very well, but can't help design the skyscraper they're working inside.

Good plasterers get paid (and treated) pretty well, but they're never going to be the architects who designed the building.

You raise an interesting point here. A good (old world) plasterer is best thought of as an Artisan. This word has been lost in the modern era (Artist!=Artisan), but thats arguably at our loss. The "black arts" are inherently those that require practioners to get their hands slighltly dirty in the details.[1]


[1] Alternatively, i'm starting to see the clearer picture: "Data Scientist" is more a contraction than a description. viz: "Data-Analytics-Literate Compter Science" becomes "Data_Science" if you just omit the middle terms. For marketing purposes, moniker "Scientist" is highlighted as status signifier (Academically ~akin to "Artist").

I have to really thank you for this article, but it may have had the opposite affect on me. I've always felt like I have a disjointed skill set, but this makes me a bit more confident in looking into this field and it give some good ideas of what I should brush up on. I know this may not be what the author intended, but it's appreciated never the less.

I came away with similar thoughts. The article made it seem like becoming a data scientist is far easier than I ever imagined.

What supposed to be a rant turned out to be a nice little list of things worth learning in data analysis field. Thanks!

Thanks for validating I have been pursuing the right skill set. I just wish that list existed 20 years ago when I started college. Instead, I have been teaching myself off and on since 1999.

If I have learned anything from ml-class, pgm-class, nlp-class, and now neural-nets, is that becoming a data scientist is one of the hardest things I'll ever eventually succeed in doing.

The good man was trying the opposite. Which is to make things as simple as possible for early students and get them do the hardest thing they will ever do: "solve some problems using these tools"

Instead of chasing after some title ("data scientist ..."), find ways of solving some useful problems with whatever you learnt. The article argues that becoming an expert is difficult, which may be true. But that does not mean you don't know enough to start digging at your problems in hand

What are the best tools to visualize table data?

I've been testing Tableau, but it's only for Windows and I'm on a Mac.

I'm looking for something which easily connects to your SQL database and allows you to produce all kinds of fancy graphs, with a easy user-interface. Tableau comes close to what I'm looking for, but my gut feeling is there should be a whole bunch of good commercial and free software in this field out there, but my googling haven't gave any good results yet.

So if anyone have any good suggestions, please let me know.

being a good data scientist is about having enough intuition about the dataset to ask the right question aka form the hypothesis.

working out the question is what makes it a hard(and creative) process, and then you can apply your ML toolbox.

edit: whats different from a data scientist vs analyst/statistican is they build their own tools as the datasets are too massive & non-standard for the usual toolset.

So is there a good middle ground between being a web developer and a data scientist? If so what would be the most useful problems for such a skillset to solve?

I was a math major in college, with a focus on pure math. I did a year or grad school (math PhD program) and left to work on Wall Street (and worked for a couple startups, and Google, in that mix). In all, I spent 6 years as a mix of quant, trader, software engineer, startup entrepreneur, data scientist.

The software engineering career is in somewhat of a mess right now. It comes down to the "bozo bit" problem. Being a software engineer (even with 10+ years of experience, because there are a lot of engineers who only do low-end work and don't learn much) is not enough to clear the bozo bit, and you won't be able to prove that you're good unless you have a major success, and it's hard to have that kind of success without people already trusting you with the autonomy to do something genuinely excellent.

It's not enough to write code, because LoC is a cost and source code is rarely actually read at the large scale. At least for backend developers, the only work-related (as opposed to political) way to establish that you're worth anything as a engineer is to have an architectural success, but it's very hard to have architectural successes unless you've established yourself as an "architect" to begin with. So there's a permission paradox: you can't do it until you've proven you can, and you can't prove you can do it until you've done it. Hence, the vicious politics that characterize software "architecture" in most companies.

Functional programming is one way to put yourself head-and-shoulders above the FactoryFactory hoipolloi. The problem is that most business people don't understand it. They just think Haskell's a weird language "that no one uses". Elite programmers get that you're elite if you know these languages, but most companies are run by people of mediocre engineering ability (and that's often just fine, from a business standpoint).

It is true that functional programming is superior to FactoryFactory business bullshit, but not well-enough known. Good luck making that case to someone who's been managing Java projects for 10 years. What is better known is that mathematics is hard. It's a barrier to entry. I doubt more than 5% of professional programmers could derive linear regression.

So I see the data science path (and yes, it takes a long time to learn all the components, including statistics, machine learning, and distributed systems) as a mechanism through which a genuinely competent software engineer can say, "I'm good at math, and I can code; therefore, I deserve the most interesting work your company can afford to fund." It's a way to keep getting the best work and avoid falling into that FactoryFactory hoipolloi who stop advancing at 25 and are unemployed by 40.

I don't know how you can say that knowing functional programming inherently makes you a better programmer than another. A bad programmer is a bad programmer regardless of domain.

I have seen some ninja FP artists that have written some really terse solutions in a FP style which are completely unreadable to the uninitiated.

Also, I do not think that there is anything intrinsically difficult about FP. Knowing it proves that you are ambitious and that you have mental plasticity but not necessarily that you are a good developer.

You're just as likely to become that weird dude that writes code no one understands as you are to become Mr Wizard.

Mr Wizard was accessible to children!

A big problem that science graduates have is that the market doesn't believe in academic training. A big company will take a smart engineer and train him/her on a particular technique (data mining, for example) rather than hiring someone with that particular skill (usually from grad school). Even if they eventually hire such a person, the new hire will usually be working for the in-house "expert" until they can prove themselves, which is pretty much a political process.

It is rare the case when a company will hire an academic person to directly lead a company effort, and when this happens it is because the person is already famous and has a lot of consulting/real world experience.

The path you mentioned is great if you can find a company where you can make a major contribution using your skills, but for the reasons I mentioned it has still the same problem (bozo bit) that you cited before.

> It is rare the case when a company will hire an academic person to directly lead a company effort

I actually hear of this happening a lot for economists in Silicon Valley, even with unfamous people.

I would expect it to be a lot harder to break into writing trading algorithms than into enterprise architecture at most companies. The latter I found so easy it was both embarrassing and frightening. ("really? this is IT? this how we decide how to spend this 12 million dollars?! with this powerpoint?")

I guess at least on the math side you can prove your algorithms against historical data and so merit assessments could be fact based.

it's easier to break into algorithmic trading because you can trade on your own account

i got interested in financial semantics lately and was amazed to find how many people are doing algorithmic trading with accounts anywhere from $10k to a few mil. in fact i know of many startups developing products and services aimed at such people.

if you read the MSM you'd think it's all about HFT and big players, but I think algorithmic trading today is a lot like day trading was in the late 90's.

Thanks for the insight about algorithmic trading. Any tips on finding out more? eg: good aggregators or blogs?

try the "Algorithmic Trading" group on LinkedIn; when I joined it I was amazed to find I already had hundreds of people who were contacts in that Group.

One would see open-source projects and volunteer coding project (such as the one I will post at the end) are a good way to get a leg up.

Being able to derive a linear regression is less important than being able to see the end from the beginning and being able to read math. I know a coder-cum-CEO (freshly minted) who has a hard time understanding how to program a numerical derivative, yet has produced amazing works based on what he does know. I said it before--we all need each other, especially as the focus has been on specialization (so that no one person has the entire skillset requred).


Right on.

You can make Java a lot more fun if you use generics, functional patterns and frameworks like Google's Guava, even if you still need to create a FactoryFactoryFactory once in a while.

But "fun" isn't the same as efficient and maintainable.

So far as maintainable I'd say the internal DSLs I've made in Java as as maintainable as internal DSLs made in other languages -- that is, its very maintainable until you run into some complexity cliff.

When I program Scala or C# I tend to curse type erasure in Java generics, but when I actually program Java the generics system makes me smile. I can very efficiently turn what's in my head into a design and code.

In a language like Haskell whatever you gain from the language you tend to pay back in missing or buggy standard libraries. If I was writing a webapp in any off-brand language, for instance, I wouldn't trust that the urlencode function works correctly until I'd written some test cases.

I'll admit functional Java is a little verbose (can't write a closure in just one line) but the Hotspot compiler is very good at inlining functional code and collecting the garbage from it. I can definitely say that I get much better performance w/ functional Java than I usually see with the Interpreter pattern with two orders of magnitude less code size.

The author clearly has a bias (I'm a data scientist so respect me and pay me a lot). He's then gone on to describe some standard programming and maths skills that a huge number of people have (taught to engineers/scientists/programmers). I'm going to get downvotes and be labeled troll but I just have to plainly disagree. Data science isn't some magic new career field, it's simply the application of standard scientific tools to tables of numbers. As netflix and kaggle competitions have clearly demonstrated, literally anyone from anywhere has a shot at be the best on any particular spreadsheet of numbers (that what it boils down to, (possibly large) spreadsheet of numbers).

You actually do not contradict the author that much. The thing is, the skills that are required are somewhat standard programming and graduate level maths, which narrows down the number of people to graduate level computer scientists. As he mentioned, the data science is NOT JUST application of someone's algorithms to the data, you need to have skills to preprocess the data, be able to use distributed systems for big data, actually know what is going on under the hood, etc. Also, I doubt that the netflix and kaggle competition winners were anyones from anywhere, they probably already had quite a bit of experience with ML.

Agree, unless your arguing that graduate level math(s) are only 'truly understood' by individuals with post-graduate math(s) degrees. In my opinion, reading (and practicing) is a viable alternative to the same skills. Or, as more famously said by Matt Damon... (http://www.youtube.com/watch?v=ymsHLkB8u3s)

Actually, the strongest candidates I've screened have been self-learners who list Coursera or their own OSS projects on their resume with little academic background. (Our best analyst is a guy who is qualified to repair VCRs with his trade skill diploma in rudimentary electronics.)

The original gigaom article was about neophytes (people who only took coursera course) beating out people with much more experience. This has occurred many times on kaggle. All the preprocessing etc. you're talking is not a new thing, a lot of database developers have been doing that for decades (and that's in addition to my original claim that pretty much anyone with a science or engineering degree having those skills as well). As I said elsewhere, this data science is hard meme is just a ploy to inflate salaries (which every career does anyway, so there's nothing particularly sinister about it).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact