Being able to spin a good yarn isn't really enough here. If data science just becomes a code word for brogramming your way through a set of black-box ML algorithms, then I will welcome the inevitable crash of data science.
If insight is the goal, then classic applied statistics plus reproducibility feels like a much better story. At least if insight rather than "making it go" is the goal.
A fundamental challenge I see here is how bottom-heavy data science feels now. There are tons of people out there trying to "get into data science" from other fields, but the number of people with substantive domain knowledge, strong programming skills, and the math background to be able to understand the ML black boxes is quite small relative to the number of people calling themselves data scientists. In other words, real insight definitely is (or should be) the goal, but real insight is really hard, and scikit-learn is so easy.
My hope is that this improves over the next 5-10 years - the more mature data science becomes as a discipline/career, the better the education will be and the more experienced people there will be. There is a risk in the mean time, though, that a flood of relatively inexperienced people causes a collapse in expectations for data science, making businesses less eager to hire them in the future.
Furthermore, there's a number of practitioners that expect their data to be ready for them in some perfect state. Probably a majority of the task is create a pipeline for acquiring data and labeling it appropriately if necessary, which may require developing some ontology or classification with rigid guidelines such that someone in India can delegate the task to a large team. Then the practitioner spends an inordinate time optimizing some heuristic that has a meaning that drifts over time, or is completely inconsistent with the goals of the product. These are both problems outside the realm of domain knowledge or experience.
-Some candidates can write great code, but don't have the math background to understand what ML black boxes are doing.
-Then there are STEM PhDs that have never written non-research (i.e. maintainable) code or had to formulate a qualitative business problem into a quantitative problem they can solve.
Both types of candidates need to come in at a "junior" level and do some on-the-job learning in order to be fully successful data scientists. IMO it appears to be easier to teach STEM PhDs how to code than programmers how to do math, but that might be personal bias (since I came from the former group).
Also, quant devs are heavily involved in building the calculation engines that invokes the models. These engines handles real-time dataflow and calibrations etc and are often highly non-trivial.
My guess is that that type of role is relevant in a data science context. This is much more than data cleansing and piping data between databases.
Tuned out programming never really required much math background, it is the level brain teaser that programming posed is as much as math education. So anyone who's has survived math advanced degree would take program like piece of cake, but it doesn't mean people from non-STEM background is hopeless to master data science.
Yet it's a joke to refer data science without referencing to advanced math concept. Albeit significant domain knowledge, data science is not just business analysis aided with spreadsheet. Modelling is an essential part of.
Then what's the point of the Ph.D.? Why not just go straight from B.S. to junior data scientist then?
Anyone have any good resources for self-teaching stats? I have a BS in math but only took one stats course, and it was as terrible as all intro-stats classes are. I have a strong, proof-based understanding of probability theory, but haven't found a similar approach to stats. It all seems to be "if data looks like this, use this test, watch for these pitfalls" which is terrible for building intuition.
Datacamp also launched a bunch of new stats courses recently. I haven't checked them out yet, but their courses are usually good quality.
We solve the latter problem by having business analysts or product managers that "get" the technology enough to provide direction, even if they wouldn't be effective implementing it themselves. I think there's a next phase where, as we try to do data-science at scale, we look for a similar role that deeply understands the business and knows enough about the analytical techniques to define the problem and work with a team of specialists to figure out the best analytical approach.
People talk about data science teams being multifunctional - with programmers, data engineers, data scientists, and designers - but we always leave out the role for someone with deep business expertise and shallow but meaningful data science expertise.
Also, depending on the political priorities of the organization, data science may not even be really used. Executives/management may look for analysis results to support their ideas, and just throw out the ones that don't align with what 'they already knew to be right.' After all, who wants to be proven wrong?
EDIT: One anecdote -- I worked for a company and showed pretty plainly that the length of customer engagement had fallen since the previous year. My boss basically said "why did you point that out?" because it made them look bad to the owner of the business.
huh. Hasn't crashed yet.
The number of people who can't work out what kind of solution a DS scenario needs is very disappointing. I'm not even talking about giving a "correct" solution: most can't even work out the class of problem!
Here's something to think about: Are you doing visualization? Building some kind of model to explain existing behavior? Building a predictive model? Is it supervised or unsupervised?
This is pretty basic stuff (surely it's close to the FizzBuzz of data science?), and yet it is borderline impossible to find people who just nail it.
Why is this?
Some people think that anyone who uses a SQL RDBMS doesn't qualify as a "data scientist" and that the role is limited to people who have experience in "big data".
In general, if you want better applicants for this type of position, your job description should be explicit about the actual activities associated with the job, you should post it where people who know how to do those things hang out, and you should make sure it's apparent that you're willing to compensate well. You'll still get plenty of bad applicants, because every job posting does, but this should help refine it a little bit and clue in some good people that you're worth applying for.
So the answer to your question is basically "Well, what is a 'data scientist'?"
> 95% of the time I solve my questions through iterative SQL queries in a few hours, while I see most people using laborious statistical methods the first chance they get.
I do hiring for a data team, and explicitly don't advertise a data science role. While we do have projects that are advanced enough to fall under a data science moniker, the majority of candidates we got for that role had very... academic expectations. But a business isn't a static, cleanroom environment with everything already collected, cleaned, standardized, validated, and normalized for use.
Re-titling the job posting to Data Specialist or Data Analyst resulted in a lot more candidates that are perfectly well suited to the type of problem solving you mentioned. There's an endless number of business problems where this skillset can be applied, making them very flexible and providing high labor utilization. Including getting to a "good enough" state for the few problems we have that could benefit from the more advanced statistical methods a data science candidate would bring to the table.
That said, even with the vast majority of analyst candidates, I find them very eager to apply known methods–flexibility and problem-first thinking is rare and extremely valuable.
The nuance between analyst and scientist is less clear. Can you describe what type of candidates the two draws or what you look for depending on the title?
My background is in Engineering (I'm a materials engineer by qualification). What differentiates me from a statistician, analyst etc is my domain knowledge. I have almost 15 years experience working with industrial processes. I have the background knowledge of chemistry, thermodynamics, mechanics etc. Which someone with a stats background would be lacking. So when I am asked to optimize an industrial process I can utilize that expertise whilst developing models.
I would expect that a data scientist would know more about machine learning and would have a much stronger stats background than me. They'd also probably write much better code (I work in C/C++ and SAS, from what I have seen data scientists tend to be Python/R focused).
The core concept many people seem to miss is that the point of data science is to find meaning in large quantities of data, to recognize patterns, and to present them in a meaningful and easy-to-understand way. Really to allow for educated data-driven decision making. Each approach is a tool for you to make an informed idea, but if you apply them the wrong way then... well, could be worrisome down the road.
Understanding the full problem and then finding the right tools or approaches to solve it is necessary instead of putting everything inside a black-box model.
Some people are doing the take home test but they make it into a multi-week ordeal, or say it should take "about a full work day", which always stretches out into 4-5 evenings after the real-world is accounted for. I've never had enough interest in working somewhere to finish those long take-home tests; always get like half to three-quarters of the way done before I decide I don't really want to work there that much anyway.
If someone needs to improve their conversion funnel and help with segmentation and reporting, they need an analyst. If you want to build an algorithm to determine what content is shown to each customer when they make a request, you need data scientists.
However product managers aren't typically involved in solving data science related problems. This is primarily because most product managers don't have the math/stat/compsci background to be useful.
However I predict this will change in the next 5 years.
I've been hearing from multiple people that this is a gap that's really hard to fill right now -- PMs who can work with heavy DS and AI products. It's much easier to train experienced data scientists to be PMs than the other way round.
There's nothing new in "data science", as per this post, than what has always been true of building a piece of software for non-technical clients. It has always been true that having domain expertise provides a huge boost. It has always been true that requirements are moving targets, that objectives are fluid, that clients don't talk computer science ("data science" in this case). Clients (internal or external) often don't know how to describe their own workflows, and especially edge cases, in rigorous ways. All deja vu. It has always been true that you need to "frame the problem", "clean the data", "design and apply the algos", "communicate the results". We've been grappling with this for 50 years.
The article is marketing for a data science bootcamp which likely answers those questions. There has been a lot of discussion on HN about the merits of bootcamps for developers, but not much about the merits of bootcamp for statisticians, or even the entire hiring workflow in that field.
I've been looking into Data Analyst/Science jobs at companies in the San Francisco Bay Area and almost every position wants a Masters/PhD, either explicitly stated as a requirement or implied. If there is a high demand/low supply of data science jobs out there, I'm unsure how a data science boot camp/tutorial would be able to compete.
My sense is that while there's a huge variance in quality on both, the median bootcamp seems to be more in touch with industry and better at imparting real-world skills than the median master's program. I'm not sure if employers have started to recognize this yet (from your comment, it seems that they haven't). But once the feedback loop completes, I'd wager that they will.
Also, getting a graduate degree and attending a data science bootcamp doesn't seem to be mutually exclusive. For instance, there are data science bootcamps that specifically target PhDs.
Honest question - is this really necessary and applicable in this scenario? We're talking about a full time employee accessing company data, presumably with any necessary permissions, to generate insights for internal consumption within the company about its customers?
I think marketing is the obvious first use case, but in large organisations there are often gains to be made looking at operational data.
I'm finishing up a Ph.D. in engineering (heavy into climate change research, so tons of programming + mathematical + statistical knowledge in addition to combing through TBs of data with R and other languages).
What kinds of problems are frequently present in the data science industry that differs from academic research?
"A decade in academia taught me a bunch of sophisticated algorithms; a decade in industry taught me when not to use them."
