Hacker News new | comments | ask | show | jobs | submit login
Big Data's Big Problem: Little Talent (wsj.com)
162 points by Liu on Apr 28, 2012 | hide | past | web | favorite | 156 comments

I'm not sure that the kinds of employees that this article describes will ever be a large number. There could be more of them in the future, but someone who is top-notch at all of statistics, programming, and data-presentation has long been less common than someone who's good at one or two of those. Companies might consider looking at better ways to build teams that combine talent that exists, instead of pining for more superstars.

I'm reminded indirectly of an acquaintance of mine who works on repairing industrial machinery, where companies complain of a big skills shortage. They either fail to realize or are in denial about what that means in the 21st century, though. It might've been a one-person job in the 1950s, a skilled-labor type of repairman job. But today they want to find one person who can do the physical work (welding, etc.), EE type work, embedded-systems programming (and possibly reverse engineering), application-level programming to hook things up to their network, etc. Some of these people exist, but it's more common to find boutique consulting firms with 3-person teams of EE/CE/machinist or some such permutation. But companies balk at paying consulting fees equivalent to three professional salaries for something they think "should" be doable by one person with a magical combination of skills, who will work for maybe $80k. So they complain that there is a shortage of people who can repair truck scales (for example).

I see this in software dev consulting, and it's probably in many other fields as well too.

I see companies looking for one person who's a highly-skilled DBA, sysadmin, developer and who can interact affably with all levels of people in a company, including customers, at the drop of a hat. Often it's because they had one 'magical' person who did all that, though usually not very well, and the next person (or team) who comes in after on that project has to deal with a bundle of undocumented crap that never really 'worked', but worked well enough to keep some people happy.

This scenario has been surprisingly common, and I worked with a company last year who was committed enough to hire dedicated dba and sysadmin positions, vs continuing to rely on app devs to handle all that stuff. The separation of concerns has worked out pretty well, though at first there was some concern about the cost of adding 'dedicated' people. In reality, all that work was being done anyway, often in ways that weren't terribly understandable by anyone outside the project.

The time it takes to get feature X done, tested, db upgraded, rolled out, servers maintained, patches applied, etc... is going to be roughly the same. If that's going to take, say, 100 hours, it's either 3 weeks of one person, or 1 week of 3 people. It's not worked out quite that cut-and-dried, but it's coming close. And the ability for each person/team to focus on their core skills and let someone else handle the "other stuff" has meant that, generally, the quality of things is better all around.

Perhaps slightly offtopic, but at least in Software dev consulting, there's a market for the multi-hatted individual in consulting directly, and pay is generally commensurate with the number of hats you can speak intelligently on to a customer.

This isn't necessarily true everywhere -- when I lived in Memphis, TN, I often felt that I was dooming my professional career as, every time I ran into a challenge, I'd fork my efforts and start tackling it. This means that I was getting skilled in (but not expert in) a wide variety of things. In short, my knowledge set was very wide, but somewhat shallow.

It wasn't until I moved to the DC area that I realized there's not only a market for that type of person, but a fairly lucrative one.

There are other tracks too, Architect, Management, whatever... whereas the scale repairman who can weld, machine and EE on a scale system is not exactly limited to just working on scale systems, but isn't necessarily opened up to as many positions as with software dev.

I don't think your off-topic at all. I think your point speaks to the core of the issue at hand, which is, companies only want to hire folks who have mastered a particular skillset that they only deem important to them at that specific point-in-time.

I don't for one minute think there aren't enough people who can't do "data driven analysis" or whatever it is that companies find important "today". Companies just don't want to invest their time and money in creating that someone who will be able to do the specific thing that they need.

They always seem to think that someone will be "out there" to solve their specific need, and all it takes is a job posting on some job board.

Take a note from the old record label industry, that would have A&R departments, who would take on an individual artist who had merit, and then nurture them to become the world-class musician that makes them a ton of money.

The record industry analogy is brilliant. Major labels of the past were not unlike production lines. They weren't just management and promotion companies, but rather more like schools where promising resident talent was honed and music would be written for and played by most fitting artists. If one goes through hits of the 50-70s, they will find that the same songs by one label would often be performed by multiple artists, sometimes for years, before that final breakthrough performance would come along when the perfect match between the singer and the song would be found.

It was, then, a matter of knowing what to look for in new talent and finding ways to bring out their strengths through good and appropriate songwriting. People these days often complain about the quality of music being lower than it used to be, but the truth is that for every hit of the early times there would be many times the number of "duds" and just unsuccessful attempts that would either be mediocre songs or bad matchings between the artists and the records they perform.

IMO, despite even that, I still personally think records of the past are a lot more full of substance and soul than anything released today.

That's good to hear you found your niche. I sometimes look back with pangs of regret, to be honest. When my compan(ies) needed visual basic help, I figured it out. Then never used it again. Then repeated that over and over with asp, php, sas, sysadmin, graphic design, crm, erp, even non-tech things such as accounting, loss prevention in varying degrees... So I can't really apply as an 'expert' at any. I don't dwell on it, but I am not sure if I would do it exactly the same way again.

It also depends on what it is you "go wide" on.

Early in my career I had the option of taking the proprietary track (mainframes, DEC/AXP, Microsoft Windows NT, several other proprietary platforms), or the open one (Unix, Linux, shell tools, GNU toolchain). One thing I realized was that just on an ability to get my hands on the tools I wanted to use, the open route was vastly more appealing. It's gotten somewhat better, but you could still easily pay $10-$20k just for annual licenses for tools you'd use.

The other benefit I found was that there was a philosophy of openness and sharing which permeated the open route as well. I've met, face to face, with the founders of major technological systems. And while there are many online support channels for proprietary systems, I've found the ones oriented around open technologies are more useful.

Dittos on training for proprietary systems. It's wonderful ... if you want to learn a button-pushing sequence for getting a task done, without a particularly deep understanding of the process. The skills I've picked up on my own or (very rarely) through training on open technologies have been vastly more durable.

There is a dark side to the separation of concerns. Lack of understanding. One person doing three jobs can fully understand what choices make the overal work more efficient. One DBA, one sysadmin and one dev only have direct insight into their own work. People, in my experience, can only optimize for what they understand and generally only care about their own work.

I believe this is why we are seeing roles, like devops becoming popular. Specialization introduces a communication overhead and most companies already do a terrible job of employee communication. Merging tightly coupled roles back together helps reduce friction and improve productivity.

In a small example such as your own, it probably works out closer to pipelining. The more common case, from my experience, is that the communication overhead and lack of understanding eat up more and more of your time as each new person is introduced. Law of diminishing returns takes effect and a year later everyone is always in meetings.

There's a balance to be struck. Specialization brings with is tunnel-vision, and having the bulk of people on a dev team have a decent understanding of the other parts of the stack is certainly useful to help avoiding that.

Jack-of-all-trades devs have a place, but dedicated people in specific roles also have a place. Those places will change and move over time as the nature of the project and the business changes (initial dev in to maintenance, early market upstart vs established leader, etc). Understanding and accepting that role changes may be necessary is probably the hardest thing for some business to accept, and properly making those changes (filling roles with good hires) is arguably one of the hardest things to execute on.

I think you are very much on point with your comments.

I am one of those people who are actually capable of going from MIG welding to designing websites, writing iOS apps, developing embedded hardware and software as well as GHz-range electronics with FPGA's, mechanical design and FEA.

The only option for someone like me seems to be to run your own business. Nobody is likely to pay for the combined skill set. Which means that having these skills is both a blessing and a curse depending on your point of view. I can take any product from drawing to completion. My resume scares most employers. And, in many ways, rightly so.

Hiring someone who can do the work of five people is very risky. You loose one person and your entire team is gone. And, of course, there is no way that one person can have the productivity offered by a team of specialists.

In the end, someone with my skill set either ends-up doing their own thing or in a managerial position where the wide knowledge base and context gained from actually being able to do the work can be harnessed to guide and assist a team in achieving the required goals.

As an entrepreneur, having a wide skill set can be priceless so long as you start letting go as soon as you can start hiring specialists. This can be hard for some. It's great to be able to do it all when you want to launch something and save a bunch of money. Once launched, you need to divest yourself from responsibilities as quickly as possible because you will hit productivity walls and you simply can't focus on everything at the same time.

The age of the generalist is pretty far gone. If employment is the goal it is best to focus on one subject and become really good at it.

I think another issue is that no one is interested in doing 'on the job' training. Few people learn statistics, programming, and data-presentation in college (not all three anyway). Companies might consider finding someone smart with one or two of the skills and expecting them to learn the other skills on the job. And what is talent? To me, talent is the ability to learn to do something quite well. To say there is a lack of 'talent,' when you have only searched for people who /already/ have those specific skills is disingenuous. You have barely looked.

Who is complaining about the lack of talent? Companies! Who produces talent? Companies! Who can fix this? Companies!

Part of running a company efficiently is is hiring, grooming and retaining juniors (who will become seniors). For God knows what reason, few companies actually understand this. If the skills you are looking for are lacking, hire someone smart and teach them.

I agree somewhat, but retention is a fairly major problem. The shift away from career-length employment means that neither employers nor employees assume there will necessarily be much loyalty or longevity in the relationship. I think the decline in on-the-job training is directly related. Engineering firms used to be able to assume that it's okay to lose money on the first five years or so of an employee's work, if they built up skills that will make the company lots of money over the next 30-40 years of the employee's career. But if the company invests five years of significant training in a junior employee, and then the employee jumps ship to do freelance consulting or work for a competitor, the training never ends up paying for itself.

That's definitely true, but I think that a part of on the job training is building loyalty. If you like the people you work with and the salary and benefits are pretty decent, you are not likely to want to go looking for another job.

For my last job (at a really big company), the only time that substantially increasing my salary came up was when I was already on my way out the door, and they realized 'oh shit, we really depend on this guy' and tried to counter offer.

I would get glowing reviews but the maximum my salary could possibly ever increase in a year was 5%. Changing companies it could increase by as much as 50-60%. So we can say that it is a 'bad relationship' but mostly it is simple math. Of course you are going to have to pay the highly skilled person a salary that is commensurate to their skills. Additionally, you should work to keep them at the salary the market will bear rather than waiting for them to get a better offer from someone else. It will cut into your profit margin, but it is a lot better than being on the defensive and having to counter offer against a company that the employee has already talked herself into wanting to be at.

The whole thing seems to be a problem created by the volatility and rapid growth/movement of the tech industry. Job loyalty originally started to decline in part because companies didn't last, and began a downward spiral as logical "next steps" were taken, such as cutting on the job training.

As for salaries, I think the gradual increase was historically normal, but the potentials for dramatic growth in skill and effectiveness is new and enabled by the tech sector. We haven't figured out how to cope with it yet. (No, you can't just give everyone 50% raises YOY).

What prevented employees from jumping ship before? Was there better long term benefits associated with staying with a company?

Yes, on the latter point: vacation time and pensions typically increased based on length of service with the company. You ended up with a much worse pension if you had 10 years' service with each of four companies, than if you had 40 years' service with one company, under the traditional defined-benefit pension schemes.

There are probably a lot of cultural changes influencing it though, perhaps more; changing jobs frequently as a salaried professional just wasn't something many people in my dad's generation actively considered. One of many factors might be the change in how promotions are done; it used to much more often be "within the ranks". You worked your way up to FooBar VP or even FooBar CEO by starting in a regular job and getting promoted up the ladder, which required staying at the company for a long time. Now it's more common to hire external people right into senior posts.

Yeah, one of the long term benefits was that you could work there your whole career, if you wanted to. By the mid-80s though, it was clear that layoffs were to become a regular fixture of corporate life.

I completely take most of your points, but I think that pretty much all quantitative PhD's are going to be close to "data scientists". Given that stats and explaining your research are requirements, all that's left is to train them to program, which a lot of people are already doing. As a matter of fact, since I heard about this big data stuff I've been honing my skills in this area, in case the hype actually manifests.

> I think that pretty much all quantitative PhD's are going to be close to "data scientists"

Having taken all but one of the core requirements for a masters' degree in statistics at a university with a well-respected statistics department, I can tell you that's very much not true.

The true challenges in data science have almost nothing to do with what you spend 90% of your time as a graduate student studying (whether you're getting an MA or a PhD, this applies the same). You may happen to end up a qualified data scientist, but that's not by design of the program.

The big problems in data science are almost a disjoint set from the big problems in statistics (at least the solved ones), and that's because the things that are tractable from a theoretical/mathematical perspective are very different from the ones that we hope to solve in the workforce. We're just starting to bridge this gap in recent years (particularly with the advent of computers), but that's a very, very nascent trend.

This isn't unique to my university, either - most schools just simply aren't teaching the type of skills that a data scientist - not a statistician, but a data scientist - would need to be competitive in the work force. Those that do know these skills mostly do by chance - either because they branched into statistics from another discipline, because they were forced to learn it on the job, or because they took the time to learn it themselves.

All three of those are pretty rare - I recently took a class in applied data mining and Bayesian statistics. Except for a few undergraduates majoring in comp sci, the class was mostly graduate students in statistics, and those who knew how to program were in the stark minority (and were very popular when we were picking project groups!)

> all that's left is to train them to program

And to turn everything that they've learned and studied for the past two, four, or more years on its head so that they can actually put it to use. Okay, not everything, but at least 80% of it. Seriously, studying statistics at a high level is incredibly valuable, but it's not sufficient - it's not even going to get you half of the way there.

I'm in the same boat and one of the funnier professors in math stat loves to talk about the students he hasn't "ruined" because they manage to learn programming and practical finite sample wisdom and go on to be successful in the industry.

And then he talks about his other students, with great love, who just like proving theorems.

Again, I completely see where you are coming from. However, the difference between a Masters and a PhD are huge, far bigger than the gaps between any other form of education.

In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me). The most important characteristic of a PhD is that you learn all this yourself (I certainly did). For instance, I was the only person in my department to learn R (although there were some oldtime Fortran and C programmers in my department), and then I ended up learning some python and java along with bash to deal with data manipulation problems and administering psychological measures of the internet. These are the kinds of skills that lead into me possessing some of the skills needed to be a data scientist, and with some experience in the private sector, I'll get there.

Bear in mind that I (almost) have a Psychology PhD, and this would all have been far easier for me if I had worked in physics, chemistry or any of the harder sciences. So from my perspective, I can see that this is where the data scientists of the future are going to come from.

Note that I looked up the job market, and made a conscious decision to train myself in these kinds of skills throughout my PhD, but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.

I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers. I'm not that good a programmer though, certainly not outside the application of stats, but within the next few years I will be.

> In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me).

Yes, but the original point was that more or less any quantitative PhD would be expected to have these skills.

> For instance, I was the only person in my department to learn R

Case in point - and I can tell you from knowing the PhD students that I if I found someone who knew how to program in any language not used primarily for statistical computation (R, Stata, Matlab, SAS etc.), I would consider them the exception, not the norm.

The exact opposite is true about a data scientist.

> but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.

Or you just don't care about those types of jobs - and apparently there are plenty of those, because many, if not the majority, of PhD students I can think of aren't looking for data science jobs.

> I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers.

It's not programming per se, but the computational power that it brings makes certain techniques feasible, and other concepts and methods aren't obsolete - just no longer optimal. This is really a comment about statistics specifically. Most job postings for data science positions mention some form of the phrase 'machine learning' - and if they don't, they often have that in mind. Unfortunately, while demand for machine learning dominates the job market, in the grand scheme of things, it's just one branch in the field of statistics, and its 'parent' branch was relatively obscure until very recently. To this day, if a PhD student finished their program having next to none of the required academic background for machine learning, I doubt most academics would bat an eyelid. It's just not considered important from an academic standpoint. It's unfortunate that we have such a disconnect between academic interest and industry demand, but it's very much the case.

A basic example that I often cite about how computational power has fundamentally changed statistics from how it was for the previous few decades is in our selection of estimators. (I often cite this because anybody who's ever taken a statistics class probably had this experience). In every introductory statistics class (and for many non-intro classes as well), when studying inference, you spend 90% of your time talking about estimators for which the first moment has an expectation of zero, and the 10% is a 'last resort' when no 'better' estimator exists. Who decided that the first moment was the most important? What about the second? Third?

Well, it turns out that the first moment is easier to calculate, and, by coincidence, it happens to be the most relevant when your dataset is small (say, between 30 and 100). But once you're talking about datasets with observations which number in the thousands (which is still 'small' by some standards today!), you'd be insane to throw out any estimator that converges at a linear rate (rather than at the rate of \sqrt{n}) just because it introduces a small bias.

But we do - and that's reflected in the the sheer amount of academic research and literature that discusses the former, and the sheer lack of that reflects the latter. In many cases, the theory exists, but it was developed in an era in which it could never feasibly be applied.

Vestiges of this era are visible even in many statistical software packages - for another basic example, regressions by default assume homoskedasticity in errors, even though this is almost never valid in real life. Why? Because in a previous era, everyone imposed this assumption, because while the theory behind the alternative had been developed, it was expensive to carry out in practice (it involves several extra matrix multiplications).

I'm painting with a broad brush, but the general picture still very much holds.

I completely see your point on estimators and the nature of many if not most statistics courses. I suppose that I was lucky enough to study non-parametric statistics in my first year of undergrad, as there's a subset within psychology that's very suspicious of all the assumptions required for the traditional estimators.

That being said, I think you're missing my major point which is that a PhD should be a journey of independent intellectual activity, so the courses one takes should be of little relevance, and so can therefore be downweighted in considering what PhD students actually learn. I accept that this is an idealistic viewpoint (FWIW, the best thing that ever happened to my PhD in this context was my supervisor taking maternity leave twice during my studies, which forced me to go to the wider world for more information about statistics).

I accept your point about machine learning not being a major focus of academia (well except for machine learning researchers), and I think its awful. Its very sad that the better methods for dealing with large scale data and complex relationships are only used by private companies. Its not that surprising, but it is sad.

That being said, I firmly believe that any halfway skilled quantitative PhD can understand machine learning, most of which is based on older statistical methods. It may not be taught (yet), but its not that much of a mindblowing experience. I do remember that when I first heard about cross-validation I got shivers down my spine, but that may just be a sad reflection on my interests rather than a more general point.

_delirium's post acknowledges your point but is looking for an even rarer person:

"There could be more of them in the future, but someone who is top-notch at all of statistics, programming, and data-presentation has long been less common than someone who's good at one or two of those".

Someone that can program, understands statistics and can present the data in an appealing manner without losing significant fidelity. Many people underestimate the difficulty and skill required in presenting data in a way that makes sense and also actually says something.

There is a significant gap between presenting data that is satisfactory to a research advisor and something that a business person with barely enough time to think can grasp without misconception.

Again, I completely see the difference (and am actually in the process of moving full time to the private sector from academia, so will probably understand a lot more in six months) but visualising data well is not that hard. Step 1: learn R Step 2: Learn PCA Step 3: Learn ggplot2 Step 4: play with the different geoms until you understand them (seriously though, everyone's eyes are optimised to find patterns, and if you can apply significance testing to these then you should be good) Step 5: profit!? Note that I am being somewhat facetious here, but I suspect that the mathematical knowledge and ability to apply this to business problems will be the real limiting factors, as good practices in data analysis, programming and visualisation can be learned. Granted that will take a long time to learn, and there will be individual differences, but its doable.

Whether or not it will be done at all though is another matter.

Again, delirium's point is trivially true if one requires these people to know all of statistics, programming and data presentation as I don't think there's anyone who knows all of any one of these subjects.

I suppose it somewhat depends on what the skill levels for each of these areas need to be, and that varies from person to person as well as from application to application.

Allow a short vignette from a former academic and now management consultant.

We spent six months at a major pharmaceuticals client examining their reimbursement data. Poring over many millions of rows of transaction data and thousands of payment codes (which, of course, were unique across sales geographies), we determined the ten regions at highest risk of reimbursement collapse. R was used, maps were created, beers all around.

But almost none of it was used for the executive presentation. In fact, the only part that was included was that we had ten regions that needed fixing, and our suggestions on how to fix it. You see, the CEO was dyslexic, the chairman of the board was colorblind, and the COO was a white-boarding kind of gal, so given this audience the nuts and bolts of our advanced statistical analysis were simply irrelevant.

This is hardly surprising. If we are having so much trouble hiring people who are fluent in Big Data, how can we expect business leaders to be even conversant? With only slight exaggeration, the way you do your analysis and the visualizations that you create are not important.

Companies are demanding Big Data scientists because they suddenly have lots of data and see the term Data Scientist in the news. But what they really want is not Data Scientists, it's business insights and implications from Big Data. The customer needs 1/4" holes, but we're all arguing over which brand of laser powered diamond drill they should buy.

Nailed it.

I still remember my first business presentation. I had a slide talking about how I did a statistics study. I was told to take the word "study" out because it had bad connotations for the target audience (middle managers at Bristol-Myers Squibb if you're curious).

The comment was probably right. But I was horrified.

I agree, and it's all the more true if you consider that "presenting" data may actually be more like creating an interactive environment to explore data.

I believe that data analysis yields the best results when perusing the data and tuning the models are closely connected tasks.

"I think that pretty much all quantitative PhD's are going to be close to "data scientists"."

Exactly. Fundamentally, "data science" is known far more widely by its other name: science. Yet we've reached the bizarro-world place where there are huge numbers of un- or under-employed scientists looking for work, while companies are freaking out about hiring the "rare" computer programmer who happens to know some statistics and self-identifies as a "data scientist". It's rather absurd.

Anyone who can earn a PhD can learn to program but being good at engineering the very complex processes that are needed for effective machine learning applications is a skill that is not so easily acquired.

I've worked with quite a number of quantitative PhD level people in my career and most often the quality of their code leaves much to be desired.

This is a total ploy by large companies to increase the H1-B visa cap. I have seen companies post job openings with starting salaries of 40K for experienced developer positions so they can then claim there were no American applicants.

Its a self fulfilling prophecy, if companies outsource the jobs then people don't study those skills for fear of their job going to India; then the companies complain that there aren't enough Americans with CS degrees so they need more H1-B visas.

The US needs tariffs in the IT industry to save our skill base so we can be competitive long term.

Protecting the industry with trade barriers definitely sounds like the way to make it competitive.

If we do this our IT industry might one day lead the world, like our sugar industry, automotive manufacturing and softwood lumber do.

Or: Like the South Korean electronics industry does.

One other thing is that the traditional entry route to software for the non-traditional candidate was via the helpdesk or QA department. You got your foot in the door, impressed the established engineers by turning around tickets quickly or by writing comprehensive bug reports, then when they needed another developer, you got the tap on the shoulder. Some of the best engineers I've worked with have come in through this route, including the best manager I've worked for.

If these kinds of positions are outsourced, how do we tap into that seam of talent anymore?

If dev is outsourced, then they'll be right there in the building when they need to tap a QA shoulder.

I've seen this method of advancement with those around me at where I work. However, I've had the exact opposite experience that you have had. Although, I'm willing to admit to selection bias regarding the sample of candidates.

That's a great point. In theory, having a shortage of X at price Y is nonsensical. It doesn't make any sense to say there's a shortage of worker drones with a mastery of multiple advanced disciplines. There are "enough" of them, they're just called consultants and they make like a thousand dollars an hour. Or if there aren't enough of them, offer a salary that will enable them to at least pay off the student loans they'll accumulate getting a Ph.D and a couple masters degrees on top.

...in theory.

Good take on the article. I.e., there's a sense that money is to be made through "big data" if only we could find the right, highly skilled person.

Now, we notice that we're not making those gobs of money, as promised in the business press (and shown through a few example successes), and conclude that there must be a shortage of such people.

"claims of severe talent shortage in Big Data http://online.wsj.com/article/SB1000142405270230472330457736... Ok... where are the high salaries (500k$ a year)? No? No real shortage."


Business has a shortage of "big data" folks in much the same way I have a "huge sailboat" shortage. Neither of us want to pay for it. We want it, but not for the going rate. Only one of us has a media platform, though.

The salaries are already moving north of $200k even outside of Silicon Valley and New York City and getting more expensive by the month. How high do they have to be before we have a "shortage"? The problem is not lack of money, it is that demand has greatly outstripped a finite supply.

Very high wages do not automagically create new people with the requisite skills and this is the real bottleneck. It takes significant aptitude and years of training/experience to become useful as a "data scientist". It is not as easy as I think people are imagining. We train people with excellent raw skills where I work, usually strong applied mathematics backgrounds with natural programming skills. It is much easier than trying to find someone outside with these skills, though we do attempt outside recruitment. It still takes years to develop the people we train into a good, basic data scientist.

Finally, some basic labor market economics. Just like how employers have restrictions on the number of people they are able to hire, employees have restrictions on the number of hours they are able to work. Or for the sake of this example, whether they are able to be a data scientist or not. The aspiring data scientists' restrictions all have to do with their ability. And as you point out, the cost of human capital investment in this area is very, very high.

One doesn't need to be an econometric theorist to be a data scientist, but I feel that far too few hackers truly appreciate the elegance of some standard, say, 1st year economics grad school econometric models. Things like panel models, IVs, 2SLS, GMM and even just taking seriously the basic assumptions of OLS regression -- there's a reason most econometrics classes (grad or undergrad) always start with the ~5 assumptions of OLS regression.

TL;DR Economists would make great data scientists (and better economists) if only they understood and appreciated computer science more.

In context: $200k is roughly what attorneys are paid in their 5th-6th years at large firms that service large corporations. Lawyers are in surplus right now.

In finance, talented individuals are routinely paid well multiples of $200k for their work (even post-crash).

So while $200k is high for salaries generally, it certainly is not high enough to imply a shortage in a highly specialized field.

Look, this job title is at most 2 years old. How can someone have years of experience in this? OTOH, there are plenty of people with strong applied math and good programming skills.

The set of skills existed before it had a trendy job title so you can have the experience even if it was called something else. This is true of most of the people currently working as data scientists. In a similar vein, I was designing big data systems years before "big data" became a term or trendy. For any particular odd skill mix you can come up with, there are people with that skill mix who are already doing a similar job. But usually people do not intentionally build that skill mix until it becomes an official job title and career path in the eyes of the public so it is a very small pool of people.

In the case of modern data scientists, having strong applied mathematics and programming skills is about halfway to where you need to be and a good starting point. The demand has temporarily grown much faster than the convertible talent pool can develop the additional set of skills required.

I am of opinion that if demand is high enough, companies will start hiring "halfway there" people. But this will happen only if the market grows big enough. Right now it is still a niche market where companies are cherry-picking right candidates, it seems. At least this is the impression I get from reading this thread.

The question of the size of the market is crucial. Small labor markets are very inefficient. This means that the number of qualified people is small enough, but the number of companies they can choose from is also small. It is hard to find a job when the number of companies hiring is probably less than 100.

Even more,

I don't notice an effort to expand the workforce by training or by recruitment of non-traditional workers, etc.

The contra-logical statement "99% of programming applicants are unqualified" gets a lot of play in this field. But I would suggest something like "we can make 99% of applicants look like idiots with our circus-like hiring process".

Yes, we've decided we have a shortage once we decide on five arbitrary disqualifications, expect all applicants to work 18 hours a day and start yesterday having no time to get up to speed (so experience on earlier large systems, say, is indeed not useful).

The base-level skill set is being a very good applied mathematician with some good computer science skills. This is why a lot of "data scientist" types have degrees in things like physics. A lot of the database ETL stuff can be learned.

This is the reason why I cannot be a "data scientist", despite being an expert in parallel algorithm design and with strong database ETL experience. It would require me spending a couple years studying mathematics in depth that I do not currently know. The vast majority of programmers are at least as deficient as I am in critical skills for these positions.

We train our data scientists at my company but we usually do not start with software engineers. Our feedstock is strong applied mathematicians with some programming skills because the mathematics part is by far the most difficult to train for someone who has not already been doing it for years.

This is the reason why I cannot be a "data scientist"

Are you worried about this outcome at all? Do you see yourself playing an important role on a data team, one with less modeling responsibilities but more infrastructure/DB responsibilities?

I'm considering this path and would love to hear your opinion.

To be clear, I chose this outcome. I am good with mathematics but not the mathematics usually needed as a data scientist and I have relatively little interest in investing the time to learn. Being a data scientist is a great job for some people but probably not what I would choose even if I was a developer again.

There is a continuum of skill balances; some people are more "data" than "scientist" and vice versa. The most useful balance varies from job to job. There are plenty of opportunities for people that have strong skills standing up clusters even if you have relatively weak analysis and model building skills. I would not dissuade anyone from becoming a data scientist, it will pay very well for the foreseeable future, but the skill set requires real effort to acquire. At a small company there is likely opportunity to learn the trade by coming at it from the infrastructure side of things.

It is a young enough area that it should be pretty easy for talented individuals to invent a career if they apply themselves.

Thanks for the excellent summary. This line:

I am good with mathematics but not the mathematics usually needed as a data scientist

resonates with me. If you're thinking of data science, you're facing a loooooong road of coursework (scientific computing or numerical methods, linear algebra, PGMs, machine learning, AI, possibly some optimization too) to get your foot in the door. I'm going to try, but one could spend years finishing that work.

In some ways, getting a data science gig is the opposite of getting a web developer gig. In DS you're competing with a large supply of intelligent PhDs, so credentials are very important; for web dev, your portfolio goes much further than any credentials.

That's a very good point: insisting that a single person must have skills in math, computer science and data interpretation is creating a purely arbitrary set of qualifications.

If we make a comparison with other fields, we can see that trying to find a single person who has skills in several diverse areas is not something that's usually done. For example, do companies try to hire bond traders who can implement their own trading software? Or do we insist that airline pilots or surgeons or CEOs should be able to build and repair the technology they use?

And if a company did manage to find a person who was both a good statistician and a good software developer, wouldn't the combination of responsibilities pull this person in too many directions, making it hard to focus on on what they were doing? Also, it would take a lot of effort to stay current with the latest developments in both math and computer science.

If a company was too small to be able to afford to hire three full-time specialists to analyze their data, they could outsource their data crunching needs to consulting companies that specialized in this kind of work.

I also have a problem with the newly-coined term "data science". Scientists are engaged in discovering fundamental truths about the way the physical world works. I don't think that finding trends in a company's data counts as science. (I don't think that 99% of "computer scientists" are scientists either, including the professors I knew in grad school.) I liked the older term "data mining" much better, but I guess it's not trendy enough anymore.

I don't notice an effort to expand the workforce by training or by recruitment of non-traditional workers, etc.

"Code Year" for data scientists would actually need:

-- "Code Previous Year," where everyone would need to kick ass on Bayesian inference, linear algebra, and production-level software development; then,

-- during "Code Year," they'd proceed to learn about distributed algorithms, graphical models, and HMMs, then learn about distributed frameworks like Hadoop.

See Joseph Misti's comment here for more--it's really the most accurate list of skills one needs to become a data scientist:


>We want it, but not for the going rate.

Does it cost 500k$/year for someone who got in that field to be in net positive ? I mean I know that people in US are complaining about high cost of higher education - but 500k$/year for it to be viable career path ? I think saying there is a shortage is justified, if the salaries are decent (and from anecdotal evidence I know that they are) people should be made aware that this field might be worth entering and that they should look in to it.

pmb is (correctly) saying that there is, by definition, no shortage of big data people. There's just a shortage at the (obviously below market clearing) price employers wish to pay. Also, you're ignoring the steep lead time to become a deep expert in stats / ML -- most likely a PhD plus significant programming time plus work experience.

>by definition, no shortage of big data people

Shortage is defined by price being above the market equilibrium, we can debate what the equilibrium is but I think even at the current wages people should be interested in getting in to this field (I know a friend who is doing postgrad in math and interning for BI because the prospects are great), it's just that they can't get in fast enough - therefore IMO there is a temporary shortage.

>most likely a PhD plus significant programming time plus work experience.

Is it normal to expect 500k$/year for that experience in some other field ? I would sure like to know, maybe I can still switch :) I mean I know there are people making that kind of money but it can't be the average for PhD with work expirience ?

Of course, by that definition, there is never a shortage of anything.

This is a very good quersion, and made me think a little. Here's my stab at it:

If we define the shortage as "shortage of people willing to do X for $200,000 a year", that's clearly a bad definition. You should just pay more (as earl suggested) to get what you want. But what if that's just not possible on a macro level?

Consider if you have an aggregate demand of "the market needs a total of 500 Data scientists". If there are only 250 data scientists in the world, their salaries will be bid up, then I can see somebody crying that there's a shortage for affordable data scientists (whatever that means). But any capitalist will tell you that they're just looking for a free sailboat. On the other hand, the 250 data scientists are being paid a lot, and on some level skills are somewhat fungible, so you end up with non-data scientists (maybe vanilla statisticians/actuaries) moving sideways to get in on this payday. So you have some retraining, and in the long run things tend to work out. So there's no shortage.

But in the long run we are all dead. Thus even if we define a shortage as the shortfall in supply at _any_ price, this isn't sufficient. In the short run there can very well be a shortfall.

If it takes 3 years to train a data scientist (I'm just making stuff up here), and there are 250 data scientists on the market, if you have an aggregate demand for 500 data scientists _today_ --- completely price unconditional --- you just cannot fill it. Price is almost irrelevant (on the macro level --- you as an individual can always outbid your competitors). There is a temporal shortage that cannot be filled.

I think this is the precise definition of what a shortage is that you are looking for. Shortages exist in the macro scale. Shortages do not exist for individual companies (they should just pay more, and if the benefits are not worth the cost of hiring, there isn't a shortage, they're just cheap).

Well, I applaud your effort but you've missed things.

The market doesn't need anything, human beings need and desire things and, in the capitalist model, the market is the means to balance those needs and desires things.

If 50 entrepreneurs desire 500 data scientists for their enterprises and only 250 fifty such scientists are on offer, they'll bid up the prices until some of them decide "I don't want them that much" and then we're done.

Of course, if our 50 entrepreneurs all have vast, vast wealth and a very strong desire for those scientists, we may see them creating bootcamps for quick data scientist development or whatever. Then they might indeed wind-up another 250 data scientists and again, we're done with no mystery.

Of course, you could argue that markets aren't as efficient as some claim but that's just about irrelevant to our reasoning here since we're mostly reasoning by definitions and extreme and so the OP basically holds true - you say you want but you've shown you don't want it that much and so you're really blowing smoke...

By that definition, the only absolute shortage is in things that aren't available at any price.

There can still be a relative shortage. There's shortage of gold-per-pound relative to rocks-per-pound, for example

As an engineer who's investing in developing "deep expertise in statistics and machine learning" I can only stand to benefit from it, but something about the current wave of Big Data hype makes me instinctively a bit wary.

Does this skills shortage really exist to the extent claimed? are there really enough people out there who would know what to do with a 'data scientist' if they were able to hire one? I see more talk than action, I see vendors circling around looking to flog freshly-buzzword-compliant BI tools, prognosticators trying to push nervous businesses into engaging in an arms race over data.

Of course there's real value there too, for some at least. I hope my concerns prove unfounded, but worth retaining a healthy skepticism I feel :-)

As someone in the big data field on the ground (VP of Engineering). Let me give you my thoughts on it.

Your impression about the hype is correct. There are a lot of vendors offering BIG solutions, if you pay them BIGGER money. Where I used to translate the word enterprise to $$, now I translate Big Data to $$$$$$$.

When I'm hiring, I don't go looking for Big Data people, because generally they don't exist. Statistics is a really great general addition to a programmers toolkit. Machine Learning is valuable as well, although in my experience the application is more limited. What this article doesn't mention is a whole host of other skills required.

Modeling, and not just a formal mathematical model, but applying any type of model to your data to get insight. Check out the model-thinking class on coursera.

Exploratory Data Analysis, much different skill than confirmatory statistics.

Design of Experiments, specialized subfield within statistics.

Logistics, how to setup, maintain, and maximally utilize an efficient distributed cluster and build a pipeline getting your data to the cluster, cleaning it, building it into a model, and then extracting insight and delivering that end value.

Those are a couple of the skills at a high level. At a more nuts and bolts level, Hadoop is the defacto standard for Big Data. Learning how to build a data pipeline out of the Linux tool chain is very common in the data science world.

The overall value stream for Big Data is deep and wide. Most companies don't have expertise in much of these, and so at the current time you have to learn them yourself or find a company focused on building a team around it.

If you are just learning this yourself, you'll probably get an academic knowledge. If you want to make yourself valuable in the marketplace, you'll really want to get hands on experience. Knowing a z-score is one thing, building a process to gather data and compute a model against it is a whole different ball game. As the article mentions, if you have nice clean data it's easy to apply a model. If you have messy ugly data from 20 different vendors and 200 clients with various failures, anomalies, and you have to figure out what type of model is helpful, oh and you have a deadline because for 500th time someone promised something impossible to the client, then you have something closer to what Big Data is today.

* grammar edits

deadline because for 500th time someone promised something impossible to the client

This is a killer in machine learning applications. The toolsets rarely cover the entire extent of what needs to be done, so at least some custom code needs to be written. But results aren't deterministic - you don't really know if it's going to work until you run it. Several iterations are often needed to get to the first useable results. It has all the problems of building any piece of software, plus another layer of risk that the accuracy just won't be there with the first thing(s) you try.

My point is... actually agreeing to be the machine learning guy on a project totally sucks because time estimates are almost meaningless, and the modern business culture is to label anything late as a failure.

I couldn't agree more. Accuracy is a problem, variation is another problem. Dealing with layers in the business who have no math or statistics background but very strong opinions is yet another complication.

These types of conversations aren't uncommon.

Other - "I need you to prove our stuff does X, Y, and Z".

Me - "Ok.."

<time elapses>

Me - "Ok the data shows our stuff does X but Y and Z are just random noise"

Other - "We ran it once before with this other guy and it showed our stuff did X,Y and Z. We've been promising it to our clients for a year. He gave us several examples, but when the clients asked to see the underlying data he couldn't produce it. So we just need you to prove it does X,Y, and Z."

Me - "The data only shows it does X. Y and Z are impacted positively through X, but once you condition on X, Y and Z are not causally affected by our stuff"

Other - "Yeah...well I promised client we would give them a report by {{insert random ridiculous date here}} proving it did X, Y and Z. We are going to lose them if we don't deliver a report saying that"

Me - trying for the 50th time to explain they shouldn't promise a positive result when we've never looked at the data.

There are hundreds of variations on this conversation. Your code is wrong is one variant (which depending on the timeline is hard to dispute). Of course if you take long enough that your code is correct, then you are going to slow. This isn't a science experiment, just make it work is another. Watching someone go slack jawed and start drooling because you accidentally used a math term is always interesting.

I have a whole new perspective of being on the cutting edge. It seems like it mostly means you are on the cutting edge of comments from people who don't know how ridiculously hard what you are doing is.

Man, this is statistics. You should be able to get any result you want!

I'm only half kidding. I can remember writing my first report (project summary) when I did a contract right after grad school. I put in maybe 5 graphs. Two looked good, three looked bad. The project manager just deleted the bad looking graphs and sent it on to the client.

"Some people use statistics like a drunk uses a lamp post - for support rather than for illumination."

The company I used to work for had a performance based product. They only got payed if they actually showed improved accuracy against a given evaluation set. Then they got a fraction of the cost savings (say 1 year's worth).

This seems like it could be a good model for machine learning consulting, and one that I would certainly be willing to explore.

It would work something like this :

  1) You show me your problem and your data.  

  2) We  come to an agreement on how accuracy would   translate 
into financial results and on a fair split of the savings or earnings.

  3) I develop a model.

  4) You evaluate it based on 2.

  5) I get payed based on 2.  
If my model doesn't meet minimum performance criteria I don't get payed. If it does very well, and assuming the problem was economically interesting in the first place, you save a lot of money and I get a fair sized chunk of it.

Feel free to explain why this business model wouldn't work.

Edited for formatting.

Most business people aren't interested in model accuracy as a term. They want something that provides benefit, e.g. cost savings, increased revenue, increased profits, etc.

The sales process of convincing someone they need an accurate model is tough, especially because robust models are time consuming and expensive to build.

If you can come up with a model that shows good results, and people know they need those results, then you can start a company selling either a service or product to get those results. If people don't know they need your results - then you have to educate them, in which case it's a much more difficult business to start.

I don't know many business people with the temperament, understanding, or the pocket book to deal with general research type problems.

I think there will be a day when this will work. But right now my concerns would be:

1) The people looking for outside help probably don't have ANY model, so baselines are difficult. (I use synthetic baselines like just predicting the average of the predicted variable every time, and that's a very valuable tool, but I don't think you could get paid by beating them.)

2) When would you cut your losses on a failed project and move on? That would be incredibly difficult to do on a project that you had spent weeks or months on and not been paid. It's like cutting your losses on a failed trade in the stock market... it seems like it would be easy and obvious until you actually experience it yourself.

3) Once a company was in a position to take you up on their offer, they might as well post it on Kaggle and get a hundred people to work on it for peanuts. I'm really rooting for Kaggle b/c one of their long term goals is to let people like you (and me) make a living doing analytics work like you describe. But right now they just don't have the volume and all the projects pay out just a few thousand dollars (and only if you beat the other hundred participants).

4) If I was a company, and didn't have the expertise in house to build the model myself, I'd be wary I was really getting what I paid for. If I'm paying for a 10% boost in accuracy, how do I measure it rather than just taking your word for it?

Another thing that increases difficulty is that depending on people's background and experience there are very different views on what is most important.

For example is my view exploratory data analysis and visualization are less important than using strong models and figuring out how to apply them to problems. I say this because I haven't seen any visualization methods that really tell you much about how hard or easy it will be to develop a predictive model. Sure you can do a 2-D LDA projection and if there's a huge amount of overlap you know you're not looking at a trivial problem. But if the problem is linearly separable someone's probably already got a good solution in Excel.

As for the "Big Data" buzzword it applies well to some problems like NLP or web analytics where massive datasets are available. In these cases it's clear that the more densely your data samples the problem space the better your performance will be and even very simple models will perform well.

However there are many applications where the amount of available training data is not so large and you need to use models which are powerful enough to discover non-obvious patterns. Applying such models and adequately evaluating them, which is critical to avoiding over-fitting with relatively small data sets, requires developing quite complex processes.

Thanks for the explanation. I'm trying to get into that business myself, and it's good to know where the remaining gaps in my knowledge are. (For me it's the "logistics" part.)

Some years ago, there was a similar wave of enthusiasm for "data mining." Plus ├ža change...

@NyxWulf: where do you work / email me / curious

I felt a similar kind of skepticism when I saw it took ~3 years to improve the Netflix recommendation system with just ~10% - in the context of the Netflix Prize, with great minds (data scientists and practitioners) participating and collaborating.

Maybe the initial system was quite good and it had no space for easy-and-fast enhancements, I don't know.

But 10% overall improvement result in 3 years (just as quantitative ratio, esp. if it translates directly to the same growth pattern in financial revenues) is something that makes the business types yawning.

There's a fantastic paper, Hand 2006, which notes the strong tendency for simple models to get nearly all of the performance possible out of solvable problems.

Hard problems do better with complex algorithms but there's also just less to be gained.

The best solution tends to be simple models applied to the right kind of data such that the problem has become easy. This is sometimes pretty difficult though since the simple models are designed on simple data, which might not always be what you've got.

But what if 10% improvement means 10 M$/year ?

Anyway I think there are many applications where getting the absolute best performance isn't as important as finding the problem, figuring out how to apply a machine learning model to it (which includes getting the necessary training data) then training an off the shelf mode. The later of these may take a day or less, the other phases may well require both more thought and more effort.

10% increase in 3 years translates to around. 3.3% yearly growth rate. So in your example the 3.3% increase would be that 10M$ => so 1% of your annual business revenue is 10/3.3 or just above 3M$.

But that means you already have a really significant business that makes ~300 M$ per year. And you manage to increase it just by peanuts (relatively speaking).

And there is inflation in economy, and the alternative costs of not investing such a huge sum or part of in Apple stocks (for example) during those years.

My point explained better:

The startup success of getting from zero to millions just because of clever ML/data-science/statistics is something to be respected and admired. But for already big-business all this big-data buzz might provide just minor enhancement opportunities at best.

Of course all these numbers are hypothetical, I have no idea what the actual Netflix numbers are, but aren't you assuming 100% profit margin ?

If you actually have a machine learning application that increases annual revenue by 10% and your initial annual revenue is $300M (like in the example) and your profit margin is 50% then (neglecting the $1M cost of the model because it's small and amortized over many years) your annual profits go from $150M to $180M which is a 20% increase. I don't think that is a number to yawn about.

On your last point I actually think the opposite is true. The larger a company's operations the more potential cost savings there are. If profit margins are slim, as they are in many industries, the effect on earnings of relatively small cost savings or revenue increases can be large indeed.

Well, the widely-known real example I mention is the Netflix Prize result -> 3.3% CAGR (cumulative annual growth rate, the 10% is for the whole period of 3 years, not for 1 year).

The numbers you mention are the hypothetical ones. Go and find a real publicly documented example that is close as values and margins to what you describe and I might agree.

NB. #1) Google and web-search, or some similar startup success story, as example doesn't count - they're the 2-person startup gone wildly successful, not a previously big entity that hired 2 ML-geniuses to open their eyes.

NB. #2) If I could bring a revolutionary increased value to a company - through vastly enhanced data processing and analysis - rather than doing a consultancy and educate somebody I'd rather enter the industry as competitor and prove the "old" guys don't understand the business anymore.

Last month Forbes reported that Netflix said 75% of what it's customers watch are from recommendations. Definitely some bang there.

Also, note that 10% improvement was far from linear: year 1: +8.43%, year2 +1% and year3 +0.6% (!!).

An interesting observation, indeed - so the enhancement opportunities were effectively explored within the first year or so. The next two years count for far less, though I guess the cleverest approaches started to emerge just then.

At a conference I attended last month, one of the keynotes estimated that there might be 250 people in the country with the skills need to build non-trivial, ontology-based data systems. Even if that is an wild exaggeration, it is at least evidence of a perceived shortage.

Also note that an ability to transfer domain experts' knowledge into working models is at least as important as the Stats+ML bits.

I'd say the current academic research in ML is not oriented towards producing people who can use ML in real applications.

I've hovered around the periphery of a world-leading ML research group, and the first takeaway I have is that 7 years ago I thought the stuff they were working on was going to take the world by storm, but looking back, I can say it hasn't.

This group does a number of research projects on narrowly defined topics. 4 out of 5 of these projects try out some refinement of the method that doesn't really work. Maybe 1 out of 5, if that, point to a real improvement.

The big thing that's lacking are serious attempts to push the state of the art by attacking a problem holistically and "taking no prisoners" -- yet this is exactly the kind of thinking necessary to commercialize ML.

The leader of the group got tenure so he thinks everything is going OK. He won't even offer an analysis of why this technology hasn't been widely commercialized. PhD students from this group usually interview at Google, Microsoft and Facebook but these three employers are the only ones they consider as an alternative to academic employment.

There's definitely a need for people to continue working on developing new models and ideas. I think that's where academic research fits. That said the academic world could do a much better job of effectively evaluating and comparing models so that practitioners have a clearer view of what works where. There also seem to be a lot of biases that get perpetuated in the academic world, like the bias against neural networks.

However I think what's really needed for this technology to develop to its true potential is figuring out how to apply it to real problems and I think that's more a role for industry practitioners than academics. The problem is that for people to make a living at this there needs to be a market. I think what we are seeing in this area is a shortage of both supply and demand with the supply side hindering the demand side and vice versa.

I agree with Paul's comment. The 250 number feels low to me, but that is applying a specific model. Typically people come with some set of favorite models, and many of them provide that vast majority of the benefit a business needs. Especially when the current model in use is slipshod and busted at best.

The 250 number refers to the medical field. So the requisite background includes at least biology, and more preferably clinical experience. That diminishes the pool somewhat.

A similar phenomena certainly manifests in other highly specialized fields though. There are far fewer people with both the skills we're talking about, and deep water E&P or big 3 audit experience, for instance.

Managers frequently wail about skill shortages, but very often it's pure hypocrisy. The real problem is the reluctance to do any training (and I don't mean formal training) combined with the desire to get proven experts in whatever field. Proven experts must have years of experience in applying their expertise. If no one lets people with less experience to work in that field, where the hell would those experts appear from? Another dimension?

Can I be a 80% developer and 20% "data scientist" in your company to try the new role out? The bigger your company is, the less likely the answer to be a "yes". Since Big Data implies a big company, the resulting "shortage" is not surprising. It's self-made.

True statement. It seems the trend now days is "get a grad degree, foot the bill and time yourself". Many companies I see and work with tend towards that mentality, as opposed to building a base of highly skilled workers on from the inside. It's easier and cheaper for a company to ask if you have a piece of paper, than for them to train you and get you up to speed.

Companies looking for "proven experts" are working on yesterday's problem. Find a better employer. Move somewhere with more choices if you need to.

How media sees Big Data:

BIG database => BIG machine learning algo => BIG MODELS => PREDICTIONs, Insights => $$$

How it is actually done:

awk -F"\|" '{print $1}' SCRAPED_file_pipe.txt | sort | uniq -c | head -n 10 => $$$

Ah, you've used Ab Initio then.

This article is nonsense.

Talented developers are talented developers. At Quantcast we used Hadoop in production before it was even called Hadoop and now we process 10PB a day. We forbid our sourcers from using Hadoop as a resume search term because it meant absolutely nothing.

Statisticians who can code are scarce, but companies that know how to use them are scarcer.

Actually that is silly -- McKensey should now that there is and will never be a talent shortage. There will only be shortage of talent at a particular wage rate.

If the companies paid newly graduated 'data-scientists' (what other kind of scientists are there? The tea-leaf reading kind?) 200k/year then they would have a lot more. It is pretty simple economics.

Companies already do pay close to $200k/year for entry level data scientists.

(what other kind of scientists are there? The tea-leaf reading kind?)

"Data scientist" refers to the guy who can set up a hadoop cluster, do statistics on TBs worth of data, derive useful conclusions and speed it up by tweaking the low level data formats or microoptimizing the calculation.

The issue is rarely paying these guys an extra $20k, it's simply finding them.

Setting up some lasers and a photonic crystal, imaging the output, making a graph in excel or matlab and drawing conclusions is a different skillset. Someone who can do the latter is a scientist who uses data, but he is not a data scientist.

The problem as I see it is that most companies are looking for the all-in-one perfect candidate. There is indeed a shortage of such people.

Say you need someone who knows a lot about Hadoop and Amazon EC and is also intimately familiar with most learning algorithms and has a PhD. You are having trouble finding the guy. You start crying about "the big data talent shortage".

And here is the problem. Most PhDs have no experience with Hadoop or Amazon EC. Some of them might know Java well enough.

Now, consider a smart guy with PhDwho knows Java and has done something parallel with it, working on real "dirty" data. He can pick up Hadoop in no time from your software engineers. He will learn to tweak and optimize in his time - it is domain specific and cannot be learned off the job.

Will he be hired? Probably not. But people will keep crying about shortage.

How hard can it be, though? Like taking a normal CS person and making them versatile with hadoop and so on? Could it be done for 20K$?

Making a CS person versatile with hadoop is not that hard. Making a CS person versatile with statistics is much harder.

See Zed Shaw's seminal article "Programmers Need To Learn Statistics Or I Will Kill Them All".


Making a math/science person versatile in CS is somewhat easier, but even that can be tricky. Many of them are bored by file formats, architecture, etc, and simply don't have the mindset of of engineering.

How hard can it be?

Very hard. You run into all types of candidates who just aren't there yet: people working on research that's irrelevant to real world applications, people who have done data analysis/BI work that brand themselves as "data scientists," those who have the pedigree but cannot process and explore real-world data, those who have good analytical chops but not the distributed or advanced modeling experience, etc.

I've witnessed it first-hand, and it's tough to find the right person.

If it is that hard the bar is probably set too high. Most of the skills are learned on the job after all. Most smart PhDs who can program well and have sound knowledge of statistics can learn to do this stuff.

Given enough time, anyone smart enough to finish a PhD can acquire a set of skills. :)

But it's more than just solid statistics. We're talking about having enough mathematical fluency to develop models rigorously (not just "oh, we'll minimize MSE!!"), test those models, then implement those models--possibly using a distributed algorithm.

From what I hear, these skills take years to develop. Choosing to groom the wrong person is an extremely costly mistake, so making the choice is difficult.

All mathematics consists of rigorous models. But choosing and tweaking a model is more of an art. Most data scientists apply existing models to new data, they do not develop new ones.

I am sure it takes much less than "years" for any smart PhD in applied mathematics to learn most of data analysis tricks. It is not theoretical physics after all.

Most data scientists apply existing models to new data, they do not develop new ones.

I meant "develop" in the software sense. Data scientists use off-the-shelf libraries during initial research, but those libraries usually lack an important feature preventing them from going into production (typically, no support for concurrency).

I am sure it takes much less than "years" ... to learn most of data analysis tricks.

I used to be cynical about "data science," too. After four months of working on a data science team, though, I'm a believer.

A data scientist is really a "full-stack data developer." He or she needs the ability to work with advanced models, use them to analyze large amounts of data, and modify those models to work concurrently or in a distributed system if desired (and its often desired). It's more than just "analysis tricks."

> do statistics on TBs worth of data, derive useful conclusions

That's gonna be the hard part. Most CS people I've met flee from math and, more generically, theory.

They do? Which companies? How do I find them? :p

Build a demo project showing you can do data analysis and they will find you.

Who is paying $200k for entry besides maybe google?


*Equity value, may vary unpredictably.

And it's entry post-PhD, not entry from college.

So $120k and (very expensive) lottery tickets is what you're saying =P

$120K as a startup employee? Damn, I live in a wrong country.

Bags of money are already being waved around, that is not the problem. Wages are already moving north of $200k for these positions because you can't find people with the basic skills for any amount of money.

Being a "data scientist" as currently defined in practice requires someone to be a polymath with skills that are individually high value and not commonly found together. Roughly speaking, you need some aptitude and experience in the following areas:

- mathematics, particularly statistics, computational geometry, machine learning, and probability theory

- parallel algorithm design, something for which most software engineers have no skill

- database ETL processes, formerly a highly specialized discipline only found in the database administration world

You can learn the mathematics in school or with some study. Most software engineers never develop a knack for parallel algorithm design even when they try e.g. virtually all software engineers who claim to know parallel algorithms can't explain why hash joins do not parallelize well. Lastly, ETL is something that isn't normally found mixed with the other two but which usually requires some significant experience to do correctly. Even if you are a master of mathematics and parallel algorithms, ETL skills are something you usually learn by apprenticing with someone who is an ETL master for a couple years.

Finding people that even have basic levels of skill at all three of these things is very difficult even if you loosen the criteria significantly. Unlike some other tech job fads, you can't mint a crop of data scientists in a year.

When I look at the junior level data scientists we trained internally with great basic skills out of school, it has taken years to develop them. This level of effort and length of time is the real bottleneck.

Computational geometry??? That's a new one for me. Do you mean only linear/convex programming?

Incidentally, I would really like to hear about the kind of Real Work that data scientists end up doing with TBs of data, because I'm always fuzzy on the details. MCMC? Variational methods? SVMs? Or is it more oriented towards frequentist statistical methods, applied at "web-scale"?

I mean actual computational geometry. Reality is significantly non-Euclidean in complicated ways that have to be accounted for if precision matters.

Spatio-temporal analytics or the processing of sensing data frequently requires this. For a simple example, the surface of the Earth is approximately an oblate spheroidal surface, not even a 2-sphere. You can use Euclidean approximations for many cartographic purposes but for analytics this can introduce large errors in the analysis. Understanding how to compute non-Euclidean geometry models is surprisingly useful.

Isn't that point of view also colored by ideology? Even supposing there are enough people capable of becoming that kind of "talent", what if those talents are also sought for in other kinds of jobs?

Granted, if it were really urgent, perhaps companies would start looking in the most remote places for talents, so with a population of 6 billion perhaps there really would be enough who could be trained. How many of those 6 billions are "free" in a sense, as in not needed for maintenance of human life (farming, medicine, building shelter and so on)?

But do economics really work that way? Could we extrapolate that logic to conclude that there is no problem in the world at all? All it takes is enough money to solve every problem - alas, the money doesn't seem to be there, or allocating it properly is apparently hard. (Hm, some of those talents might be able to help, for a true bootstrap solution).

It's not that you can solve every problem, but supply and demand are very real. If the wage rates for Big Data get high enough more and more people will try out the field, which will generate a larger supply.

The reasons companies don't just throw out huge salaries though has to do with the demand side. The salaries companies are willing to pay is related to the marginal advantage they can gain from hiring someone with that skillset. If for example a company will gain say 200k per year in total advantage, that would place a hard cap on how much they would be willing to pay in salary.

So if the advantage is very high, companies will pay more. If the supply increases sufficiently wage rates will drop because there is over supply. If the supply doesn't increase enough, wages will increase more - however each company will drop out at it's own value point. This provides the natural limit to where most salaries cap out.

If those talents are also sought after for other jobs then the price will go up until one of the jobs will be done by some other method or some other person. I do have trouble imagining that anybody who is working on a farm would be a good data-scientist but then I no very, very little about farming.

Economics is not tainted or colored by ideology, it is a science. It is the study of how best to allocate limited resources that have multiple conflicting uses.

In this world there is nothing that is free, everything comes with some price. As long as there is a human want that is not fulfilled, there is no additional humans.

That isn't necessarily bad though. You can charge societies progress to how few people are required to provide food to the rest. Once most Americans worked in argriculture, now only a few do. That is a good thing, because the rest of us can the do something else and satisfy some other human want.

And the remaining farmers are better of too, since they don't have to work as hard and have things like tvs and computers.

> Economics is not tainted or colored by ideology,

You must be joking. Economics is the most ideological of the sciences, because so much of it cannot be tested in nature or a lab, as it only can be tested an impractically large scale and with many confounding factors.

You're absolutely right. That said, I don't buy that big-data is as revolutionary as the Internet. While in theory, every single business can collect data and optimize based on what they see, this is way too complex for most businesses to deal with. While big data has certainly been critical for the business model of ad-based startups, I don't see it being used in other industries. People keep alluding to data-driven medicine and genetic analyses. These are some of the most complex information analyses domains, and yet, I don't see benefit commensurate with the big data hype. I'd love to hear counter examples though!

This technology will change the world more than the Internet or any other technology in human history.

You're right that adoption is very slow. I'm convinced that businesses could save trillions of dollars by applying existing weak AI to their problems. Why aren't they doing it ? For one thing there's a huge gulf between the average business person's understanding of what is possible and what actually is. On the other hand the people who understand the technology don't have domain experience in various businesses. You can't develop solutions if you don't know what the problems are and it's very hard to guess at what economically relevant problems exist in fields you've never worked in.

There are probably other barriers as well. Domain experts are unlikely to champion technologies that may, well, replace them. Bayesian networks were developed in academia 20 years ago that outperformed doctors at medical diagnosis. Why aren't they being applied ? There are probably many reasons but I suspect conscious or unconscious resistance on the part of the medical community plays a significant role.

As for a talent shortage, I don't buy it. I'm exactly the sort of person this article talks about, with a strong mathematical background, excellent implementation skills and real world experience in developing and applying machine learning algorithms that have made millions of dollars for my former employers. I have had a website and a LinkedIn profile for over a year that make this fairly clear. How many consulting inquiries have I had ? Exactly zero.

Whether or not you have these skills: potential employers need to SEE them. There are lots of pretenders out there, and employers are appropriately wary.

Are you showing employers results on your web page that a worse ML practitioner can't match.

Putting results from a kaggle competition on my LinkedIn page landed me my current job (and I am still contacted by potential employers every couple weeks).

The employers of the world aren't stupid, but they aren't omniscient either. So you need to make it easy for them to see that you have the skills you claim.

Did you win the competition ? If not what was your rank, if you don't mind sharing ? I'd be really interested to know how much impact this could have.

I was #4 at the time my current employer contacted me. I've continued in the competition with a small team of their employees, and we are currently #2.

I don't know how sensitive the number of job contacts is to your ranking. My hunch is that a lower ranking would still establish credibility.

I couldn't reply to Estragon's comment below, so I'm replying here.

I'm convinced of it primarily because I'm convinced that the majority of activities that human workers perform do not require general intelligence (ie. strong AI). This includes most manual tasks : cleaning, cooking, customer service in restaurants, etc, but also many tasks performed by office or even "knowledge" workers.

A product my previous company developed replaced over 20K workers over a period of years. Few people even know it exists.

Once you are aware of this and follow the news you see it happening in virtually all domains: e-discovery systems replacing attorney hours, automated news story generation, etc. One area that has been lagging is robotics but this will start to develop very quickly, especially with the new DARPA Robotics Challenge.

Now to be clear many of the "Big Data" applications people are talking about may not fall into this human labor replacement category, but the underlying technologies are essentially the same.

I have had a website and a LinkedIn profile for over a year that make this fairly clear

It doesn't work like that. You LinkedIn profile might easily land you any job in Software development, but not consulting.

In my opinion, if you want to do consulting for big corp. you should figure out what it takes to it. An attractive website and presentation, few buzzwords, client testimonials, business cards, and the other blablabla. Yes, it's irrelevant (and shitty) to what you are actually doing, but that's actually the world of consulting.

I think you can go further than that. To get people asking for your time as a consultant you have to demonstrate experience and get close to vendors who already support clients you are interested in. For example, targeting a niche "big data" problem with a particular tool, and then developing a relationship with the community supporting that tool. That gives you access to the people who are looking for consulting.

That's definitely not my world and one reason I left big-corp in the first place.

EDIT ADDED: It seems like a really broken market if buyer decisions are completely orthogonal to the product being purchased.

I hope you are right. If you are, you have identified a potentially very, very lucrative option for a start-up. Broken markets can provide you a lot of money when you fix them.

This market is already (at least partially) covered by small consulting shops which provide sales front for competent freelancers who don't feel like doing the whole corporate networking&sales ritual.

Where do I find these small consulting shops?

Well, the companies I know tend to do their recruiting based on word-of-mouth recomendations - so they have to know you from some previous job or you need to be recommended by someone etc. It's a tiny sample of a couple companies though, and that probably doesn't generalize to all "small consulting shops".

Big Corp - "We can't find people who can do X!!!!"

Person who can do X - "I've been saying over here I can do X."

Big Corp - "Oh. We don't look over there, it's not how it's done."

I think I'm seeing part of the issue here.

  I'm convinced that businesses could save trillions of
  dollars by applying existing weak AI to their problems.
Why are you convinced of this? Do you have any case studies with obvious broad applications?

  I have had a website and a LinkedIn profile for over a year that make
  this fairly clear. How many consulting inquiries have I had ? Exactly
Neither LinkedIn nor your ISP are responsible for marketing you. Have you been presenting your software at restaurant conferences? Do you have any case studies where it saved someone x% of their food costs?

Slightly off-topic, but your website breaks after visiting the RDMS page, as all the other links seem to be relative, so they attempt to go to pages such as /products/people.html

Sorry about that, thanks for letting me know. It should be fixed now.

Until the pay is comparable to finance, good luck?

I'd love to work on (arguably) cooler problems, but the combination of lower pay and the constant need to use the "hot new thing" to solve problems doesn't make transitioning look remotely attractive.

Really, the second is the HUGE obstacle: - You don't know anything about aNNs? Sorry, no job. - Nobody uses aNNs anymore, SVMs are all that matters. Sorry, come back after you catch up. - SVMs? Man, we need someone who's got expertise in optimizing RFs and Bayesian Trees. We don't want "black box" machine learning. We need to "understand" the results. Sorry no job. - Decision trees? GTFO man. We're doing rNNs now. - I'm pretty impressed with your data mining knowledge, but we're looking for someone with a background in DLMs and GPs. Sorry, no job. - repeat until vomit/suicide

I kind of wonder about the need for "badass" math skills; I'm not terribly convinced that math wizards are extraordinarily high value relative to people with other types of data analysis skills.

It seems the problem is that some companies are looking for person who is expert in setting up scalable systems (Hadoop cluster, storage, high availability, etc.) and that she/he also knows statistics and efficient ways of processing and understanding the data. Good luck with that.

My observation is that requirements like this come from people who did mainly web programing (and actually that was making a lot of money so with money they become influential): assuming that this equivalent of writing both ruby code and javascript code.

Building team is hard and in order to solve "big data" problem you need to build a balanced team.

Basically a solution looking for a problem.

They are right, the complexity that big data caters for requires expertise at both technical and business level that would be costly (though may not be at infrastructure level). In the current economy, it looks even more difficult where businesses want to squeeze the maximum out of dollar investment.

IMO, its too early stage for big data solution adoption. However stage could be set for startups who can come up innovative solution that brings the cost level down together with simple and useful easy to grasp solutions.

I've done this kind of thing most of my career, including doing it for NASA and Unilever Research. You can't really train an average graduate to do this. You need someone with a pretty highly developed integration between 1:intuitive/creative abilities, 2:mathematical/analytical skills, and 3:engineering/ability to make things happen. Add to that 4:work experience in the real world, and 5:ability to easily understand how things work in a field you delve into for the first time... And there's very few people in the world who can do this. At my previous work place we tried for a whole year to hire someone who would at have at least some of these skills and seems promising to develop the rest on the job. We couldn't find anyone although we interviewed about 30 different people (from about 500 resumes most of them with a PhD in ML from a good university). And this was in central London, UK.

But you never tried training them. Sure, no one can do it right off the bat, without prior experience.

So I was wondering if any fellow HNer is on a quest to be at least comfortable around these problems. Can you share your plans? Currently I am starting with some linear algebra and I have plans to move to statistics then pick up a book on machine learning. I would really use some advice.

I'm on this path right now. Working simultaneously on a MSCS at NYU and a full-time developer gig at Knewton (a company which truly understands the value and risks of data R&D).

I'd really start with this awesome curriculum[0] by Joseph Misti. He nails the mix of modeling, algorithms, math/stats, development, and distributed/Unix skills one needs to become comfortable around these problems. More importantly, his advice agrees with my experience on the data team at Knewton--we really use a bit of all of the above skills to solve our problems.

My email is in my profile if you (or anyone else) would like to chat about this a bit more. I'm also in NYC, and totally willing to grab tea and chat in person.

[0] http://www.quora.com/What-skills-are-needed-for-machine-lear...

Andrew Ng's machine learning class on coursera is a very nice, easy introduction to the subject.

Yes, and I started that before realizing that I need more background knowledge, hence starting with some math. :)

Not entirely sure this is true, to be honest. Most of "data science" lies in the work of collecting and cleaning the data to get it into a usable state. A recent story on the"fallacy of the data scientist shortage"[1] goes into more detail, but in reality what we want in this quantity are better data analysts. I love the idea of data science as, essentially, viewing statistical analysis from a computer science perspective, but the breathless predictions of a huge shortage seem a little overblown.

[1]: http://smartdatacollective.com/nraden/48952/fallacy-data-sci...

Can anyone explain what talent refers to in this context? Is it someone who has learned this stuff, someone who is capable of learning it, or someone who was born with an innate understanding?

Looks like the prelude to yet another H1-B buildup.

The only way we can really measure "shortage" is via compensation. If that's so, we need more hedge fund managers and surgeons, not grunt data crunchers. There are many problems with guest workers. The richest people in the world get special access to indentured labor. It targets specific industries thus amounting to a subsidy. It helps big business crush small business. The H1-B in particular is a tool to increase outsourcing and keep wages down (wages which are typically earned in the highest cost of living areas in the country at 60 hours a week). H1-B ? No, thank you! On the job training and good wages? Yes, please!

Only this time US is less attractive. Even mexicans started to move out.

I just listened to a lecture on this at BU. Emerging Internet Technologies at IBM or something of that nature. He was basically trying to sell us his product that crawled the internet (mainly a firehose at Twitter) and gathered statistics for advertisers and presented it in pretty graphics.

The main issue they had was developing language recognition. Deciding if a user 'liked', 'loved', 'hated' or was 'neutral' about a product. Another issue that stood out to me had to do with their reliance on the internet. Just because 200 users tweeted that 'this movie is going to suck' does not really represent the overall opinion.

To reiterate, the whole buzz of the lecture was the biggest turn off. He wasn't explaining about how to expand on his product or where to go from here. Just that they had developed a product and we could use it instead of attempting to develop one ourselves.

Maybe 200 is too small a sample size, but you can glean and predict stuff with that sort of data. My master's thesis was about predicting box office sales based off twitter data. (coincidentally these guys published a couple months before me: http://www.fastcompany.com/1604125/twitter-predicts-box-offi...) but we had incredibly similar results. It is pretty interesting to see what you can do with that data.

If it makes you feel any better, you can build it yourself, I certainly did. I didn't even use any libraries like NLTK to build my sentiment analysis. Read some research papers and built it from scratch (code wise at least, the ideas used were fairly common). It's a fun challenge. I still work with that code every day and use it in my startup now :)

Surprised Ben Rooney did not mention IBM acquisition of Vivisimo the day before (Apri. 25) this article (Apr.26) http://www-03.ibm.com/press/us/en/pressrelease/37491.wss "IBM Advances Big Data Analytics with Acquisition of Vivisimo"

This is good news.

Why don't the people who hire doctors, dentists, and lawyers suffer from the same talent shortage that the people who hire 'big data' computer scientists feel?

Because it's better across the board to start your own startup than work your ass off for a 4% raise at a place which recognizes you as top talent. I'm on the verge of starting a startup myself, removing myself from the people in this list. There is a shortage of talent in computer science, but never in the other disciplines, it may take another 30 years for the suits to have the ability to understand why.

I agree with your overall line of thinking.

In your last sentence you wrote: "There is a shortage of talent in computer science, but never in the other disciplines, it may take another 30 years for the suits to have the ability to understand why."

Could you elaborate on this point -- do you feel that there is not a shortage in disciplines such as dentistry and law because many people are willing to work very hard for only 4% raises?

Thank you!

You can actually make good money by going into the law or medicine. You have to work and and be skilled of course, but lets be honest that is also required for a start-up.

I can't help but think that ability is because you can't run a law firm without being a lawyer so the boss has some idea of what it means to be a good lawyer and how to treat them.

Most lawyers at the top-income end hate their bosses and jobs. Law firm partnership track at large firms is a dog-eat-dog 80-hour week hell. People are only happy when they are the few who claw to the top, or the many who drop out. The ones in the middle are suffering as bad as any stereotypical bank programmer.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact