Hacker News new | past | comments | ask | show | jobs | submit login
Data Science: Reality Doesn't Meet Expectations (dfrieds.com)
462 points by danielfriedman on April 7, 2020 | hide | past | favorite | 163 comments

> Moreover, you may quickly realize much of this work is repetitive and while time-consuming, is “easy”. In fact, most analyses involve a great deal of time to understand the data, clean it and organize it. You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.

This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.

As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.

I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.

Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)

I took a data visualisation class in uni that handled this really cleverly. The second assignment sounded very easy. The teacher provided links to the sources where we could find data.

Most people figured that with such a simple assignment (not significantly harder than the first one, which was also easy-ish) they could put off doing it until the last moment.

Most people failed.

This real world data needed hours upon hours of cleaning before it was in any way useable. Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.

Never again will I underestimate the dirtiness of real world data. One of the best teachers I had.

This is universal to STEM degrees I think. In mechanical engineering classes you analyze a beam, in real life you analyze an assembly with 50 components that have undergone 100 revisions with 20 different materials and loading from 4 directions that vary with time. Oh, and you have 4 sensors to give you information to analyze critical stresses. But one of them is broken, and Bob who can fix it is on PTO until next Monday, so...

Internships are supposed to fill this gap but it'd be nice if all students could get a taste of real world systems and data. For tech, maybe if they could partner with the IT department at the school to get them exposed to real, messy data. Maybe there are some teaching datasets with over a billion rows that people could play around with.

> get them exposed to real, messy data

This times 1,000.

The biggest surprise to me when I got out of school was how messy things were - data, systems, management, priorities...everything.

When I went back to grad school, we had arguments about the assumptions. It was a total 180 from undergrad, and much more useful. So when I came out of grad school, I was able to deal with the ambiguities - maybe even thrived because I understood them.

I majored in nonprofit management and every class had a required field work component with an area charity. I learned so much from the combination of intense coursework and real world experience. Now that I'm the head of data science at a corporation, I wish such integration existed in this field.

> This is universal to STEM degrees I think. In mechanical engineering classes you analyze a beam, in real life you ...

Hard to believe this. Don't these degrees require rigorous laboratory assignments where the student learns to differentiate best case scenario with real world uncertainties? STEM is not just some IT certification

As a mechanical engineer : No, my education didn't.

The problem is that most real world problems take too much time to really solve to fit in any modern ciriculum.

Hmmm. We had a whole course on measurement systems that get to the heart of understanding that source of your data and inevitable bias/error is more important than just crunching the data as given. For example, from a typical four year degree.

Not really. MechE courses are really theoretical, and the labs are focused on just being enough to demo the theories. Most of my professors had never worked in industry, they had been in academia their entire lives. Even they wouldn't know how to bridge the gap.

In an ideal world, we'd have separate tracks for people entering industry versus academia/research, but that's a long way off.

That's insane. ME degrees that I know seem to be defined by industry (ie. application of theory). Nobody pursues that degree to stay in academia/research. Anyway you can always pursue an advanced degree if you want to stay in academia. Don't get it twisted though - STEM is not a vocation as per your suggestion that "people entering industry" deserve a special path.

Edit: the comment I replied to has since been edited to show how the teacher understood what he was doing, and how he made it teachable lesson and not a punitive one.

Not to miss the point, but I don't think "All my students failed" is the mark of a good teacher. It sounds like the teacher failed to prepare their students for the nature of the assignment. Perhaps he was surprised as they were when they all failed, as I doubt failing most of his class was his intention.

You are being downvoted but you are exactly on point. If some fail they may be bad students, but if the majority of my students fail they're not bad students, it is me who is a bad teacher.

> Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.

I think what he meant is they 'failed' to get it completed on time and it was meant as a teaching lesson.

That wasn't in the original comment.

Failing is a form of learning. Enabling to fail (preferably in a safe way) is very valuable for learning.

Agreed, and the now edited comment illustrates how the teacher made it a safe lesson. That portion wasn't in the comment when I replied, and it sounded more like the teacher simply failed to prepare their students.

"Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start."

That wasn't in the original comment. It has been edited since I replied, which is fine. I do it all the time, sometimes you miss that someone replied during your editing.

Similarly, it's not good teaching practice to trivialize deadlines.

Yeah plenty of time for workplaces to do that for you. I can count on my hand the number of times something has been a hard deadline. This teacher taught a valuable lesson usable for the rest of the student's carreers and "the most students shouldn't fail mentality" has led to professors I know personally questioning the caliber of student they are receiving and this is a top 30 program I'm referring to. More people should fail, maybe they'd start treating things seriously and the problem of underqualified technical applicants would resolve itself.

I’m currently preparing a data visualization course to be taught this fall, and I would love to hear more about this! If you’d be willing to share some of those resources or the contact information for your professor, I’d really appreciate it. You can find contact info at the link in my profile :)

Not parent poster, but Thomas Powell is the Data Viz instructor at UCSD.

Do you still have the assignment?

>You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.

I don't get why building a model people consider to be the "fun" part. That's mostly spitting data in, watching a loading screen, and then observing the output.

That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be. Likewise, learning the business side and seeing what is possible no one has considered is great fun too.

My favorite part is feature engineering. Pre-processing and cleaning is fun too, but morphing the data into formats that extract a diamond from coal is a lot of fun, and what data science is all about. Clicking go on some ML algo is just icing on the cake, seeing it reveal bits maybe even I overlooked in the data.

If you like ML why not be an MLE? That's what MLEs do, and they're a more desirable job. DS is all about the research, discovering and learning new information, and making the impossible possible.

The standard whatever.fit(X, y) isn't very appealing but there are much more bespoke models that require creative engagement with stats/CS knowledge, e.g. Bayesian hierarchical models or deep learning models that are more complicated than what can be copy/pasted from Medium.

I've done a lot of ensemble and stacked ensemble learning. I've also used BERT and a couple of other advanced ML, but usually I resort to advanced feature engineering if I can first, so I get what you mean, but it's still not as fun to me as figuring out patterns in data.

It's sort of two-sided, I think. It can be fun to figure out _meaningful_ patterns in data. I don't really find it fun to figure out that "so and so didn't use software that understood NA values back in nineteen tickety two, so some NA values are NA because they're newer, and some NA values are 0 because 0 is just like NULL in somebody's head, and some NA values are -999 because that was a thing they did in the Before Times."

MLE is a fairly new title that, as best I can tell, exists primarily in those few places that have a mature enough workflow to have people who can actually dedicate their time to the ML part and have other roles take care of the rest.

Everywhere else, there is only DS, and it involves everything.

To answer your first question though, the training and testing of these models is fun because it feels like a puzzle game: did all my understanding and preparation of the data (and the business) pay off and the model does its job as expected? Is there something I’m missing? What’s the simplest model + configuration I can use that produces acceptable results and what does that say about the problem space? Can I combine models in some way to get the results? Is nothing working because it’s an ultimately fruitless exercise and our hypothesis is wrong? Or is there something we’re missing that is in turn the reason the model is missing something? Etc etc.

Then as the output you get something that ingests some data and then makes a decision with it! That’s cool to me.

I get where you're coming from. I guess just the problem domain I'm in, and my experience level, I tend to get what I expect from a model, and if I don't I'm more like, "wtf?" which isn't anywhere as fun of a way to do that part of the process.

Also, I know what is possible and impossible before I start writing code (if you don't count EDA code). There are exceptions, like it should be possible but it turns out the data is bad, but it didn't look bad from the EDA. Thankfully I've never had that. I always perform a Feasibility Assessment before anything else.

Not to imply what you're doing is somehow incorrect. Problems can vary quite a bit and I recognize that. For example, there have been times where I've had to mine to see if anything is there, doing ML over it to validate a hypothesis then using that information to create a new hypothesis, rinse and repeat. That's scary, because I could turn up nothing. I haven't done a lot of mining I admit though. Usually my problems are much more obvious from the get go, or much more research intensive.

One time I did three months of reading papers on arxiv.org just to figure out if something was feasible and how to best do it. Though that was definitely not a standard problem.

> That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be

Exactly! This is the reason why I love my job. It gets even better when you uncover a non-intuitive insight.

Can you please elaborate on the feature engineering part a little bit?

I have been in the data Analytics space for 15+ years. The one mantra I try to always focus is what’s the business impact of what our team is creating.

This is a simple yet very powerful rule that helps us quickly disband ideas that:

1. Do not have a robust testing mechanism. No model is useful unless it performs in the real world. Measuring this is a severely non-trivial problem with multiple operational considerations.

For e.g. are you able to run manage true control/test groups? How do you build a “reverse” data pipeline to verify your models? And, if you are required to update model weights constantly, where and how will you update the model parameters?

2. Conversely, some of the most impactful products I worked on were probably delivered in simple excel sheets or had just under 20 lines in my Jupyter notebook. Not every business problem is demanding a deep learning network. For e.g. we worked on a data-driven capacity forecasting exercise for a call-centre. I can tell you that the sophistication of the model was the last thing on my mind as I had to work on careful interpretation and data collection.

3. Data Science departments should sit closer to business than what appears to be the trend correctly. At least business data science teams ( Apart from technical data teams focusing on product analytics to improve performance etc ). Courses and academic programs, I think, have developed a bias towards tools and techniques without the underlying analytical interpretative techniques needed to work with data. For e.g a new data scientist in my team delivered excellent code but she couldn’t detect logical misses in the data (for e.g losing some data during processing, using columns with almost all data missing)

On the other end of this spectrum, we are in the lagging end of the hype bubble still so there are many top leaders who are expecting to plug in “data science” and realise Billions of dollars in savings, new sales etc.

There was a remark in the old school Linear Algebra book we had in university (Edwards & Penney) that stuck with me, to the effect (probably I recall the details wrong) that one of the authors were once involved in data analysis of water samples collected from a bunch of rivers by 15 engineers, and it turned out no 6 of these engineers' measurements were internally consistent. The moral of the story was that real world data is messy, you need to learn least squares and related methods to make sense of the data.

Now with "data science" you've taken a step further, and instead of applying the math to lab reports on meticulously filled out forms, you're going to aggregate all the messy sources you can get your hands on. Of course your headaches will multiply.

>This. Universities and online challenges provide clean labeled data, and score on model performance.

First homework assignment in the stats class I teach is to clean data that the class generated with directions they all perceived as clear. It's near about the most hated assignment I have ever given. Amazing how many ways there are to encode gender of a experimental participant.

Male, M, m, male, Man, ...

gender.lower().startswith('m')... done! :)

Except a real dataset will have its fair share of "nale", "amle", etc.

I would pay student who figured that out $20

This rings true to me. I've seen a lot of models get built that are never used. Although in my experience it wasn't that data scientists didn't care about business value, it's just that data science often requires breaking down silos and asking other teams to change their behavior.

This article mentions that leadership often doesn't support data science, but I think it actually doesn't go far enough. Leadership doesn't just have to support the data scientists, it has to actually tell other teams to prioritize data science projects over what they are currently doing. Since these data science projects are riskier than standard projects, it makes sense that leadership doesn't often do this (and focusing on the standard projects could be the right call). However, it also means that it's very hard for data scientists to create business value.

As a research-oriented data scientist at one of the larger tech companies, I can confirm that even here, a lot of people are unsure about what exactly data scientists are supposed to do. My most frequent request is "tell us why metric X dropped", to which the answer is often a subtle combination of many different factors (often random fluctuation) that doesn't lead to a pleasing actionable result in the sense of "here's why it dropped; go do this to fix it".

The really interesting research type work (Bayesian modeling, convolutional neural networks, etc.) takes a long time to implement and may produce no useful results, which is a really bad outcome at a company that measures performance in six month units of work and highly values scheduled deliverables and concrete impact. Many of the data scientists I work with tend to stick to methods that are actually quite simple (e.g., logistic regression, ARIMA) because these at least deliver something quickly, despite the fact that many of my coworkers come from research-heavy backgrounds.

In my org, there's nothing stopping anyone from pursuing advanced machine learning; for the most part we set our own agenda (in fact, determining priorities is part of the job role). And some people do in fact go after state-of-the-art ML, with some really cool results to show for it. But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.

Edit: while this may sound a bit negative, I will add that my description of data science isn't a complaint per se; I am mainly trying to inform those who are seeking a career in data science of what to expect compared to what is often promised. The work that is most valuable to a business is not exciting all of the time, but I don't think there is another job in the tech industry that I would find more enjoyable than my current one at the moment.

>But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.

I think the sad truth is that this is the reality of work no matter if you are a Data Scientist or not. What you thought you would be doing to show your worth and climb the ladder gets blurred in with KPIs you didn't set, politics you didn't create, goals and deadlines you had no input into, etc. One of the unique challenges you can face as a Data Scientist is that you may interface with people in many different groups, all of which have different goals which may be in conflict with each other. Compare this to other roles where you ultimately only follow the goals of the organization you report into.

Sounds more like it simply doesn't work very well, rather than any of the reasons you listed.

It's often the case, I remember when that stupid Amazon infographic was going around about decreased load times meaning big upswings in conversions.

A client paid for a significant project to reduce load times, which we succeeded in to a huge degree with most of the pages going from 1.5-3 seconds secs down to 250-500 ms. Absolutely no meaningful swing in conversions at all. I've done this a few times since, but never seen conversion move at all when I've done performance improvements.

Nada, zilch. I honestly think it's absolute bullshit. I've always suspected since that it was someone massaging figures in Amazon to justify their job.

We had this effect one of our gaming websites, but in reverse: we accidentally added around 900ms to every page load. Gameplays dropped by around 15%. We removed what was causing this and they instantly went back up.

People played it mostly during breaks: lunch breaks (our peak load was during lunch hours in the US), "smoke breaks", etc. So they didn't have a goal, they just had time to spend doing something. Each gameplay took anywhere from 1-5 minutes. Users averaged to 5 plays per day. Our guess was the extra load time caused people to hit some exactly poor threshhold where they were able to play 1 less game during their time allotment.

Edit: we were curious and A/B tested it and saw the effect too. We didn't run it for too long, but a 15% difference is quick to verify when you're measuring something that happens 35 million times per day.

Perfect example of why understanding your domain is so critical to analytics. There are some key assumptions that need to be made before any thing is explored. Love it.

> I've always suspected since that it was someone massaging figures in Amazon to justify their job.

Well the first rule should be looking skeptically at someone whose "analysis" involves something their core business provides/sells. Facebook and Google have been pushing data driven narratives about how effective their advertising is, and yet as a data scientist working at a large Fortune 500 company, we never were able to show meaningful impact anywhere close to what was claimed. This was met with pushback, as before my team was created the company relied on external analytics vendors who always came back with results that were magically what everyone was expecting/hoping for. But when my team tried to recreate what they had done, they would withhold information claiming it they were "trade secrets", or what they did provide was riddled with egregious errors.

I actually think that is the biggest argument as to why every company should have some kind of data science team. There is certainly important predictive models and analytics to be done, but the most consistent ROI would be to keep the company grounded and not dropping huge sums of money on the trendiest snake-oil analytics/AI solutions being hawked by vendors.

>...we never were able to show meaningful impact anywhere close to what was claimed. This was met with pushback, as before my team was created the company relied on external analytics vendors who always came back with results that were magically what everyone was expecting/hoping for...

This was why I left my last job managing a Data Science team at a large company. It's nearly impossible to complete with a slidedeck from an external vendor that shows exactly what people want to see. Especially when decision-makers and check-signers move on to different jobs in 2 years, so there is nobody to answer why that was done in the first place. Arguing against those vendors brings out the worst in the interested parties and you become the bad guy.

Load times might not effect conversion linearly. People deal with 3 second loads until one day a competitor does .3 second loads and gives a better experience, then in a matter of months you lose your customer base.

I think this is where causal inference and experimental design are important.

> People deal with 3 second loads until one day a competitor does .3 second loads and gives a better experience, then in a matter of months you lose your customer base.

I tend to suspect that the effect of pricing will make a difference of 2.7 seconds in load time negligible. A 3 second load just isn't a large cost, even if you run into it repeatedly.

>Sounds more like it simply doesn't work very well, rather than any of the reasons you listed.

The use case you've described has a defined problem and a measurable metric. Problem: We think load times influence conversions: Metrics: Measure load times and see if they are correlated with conversions. Maybe in your case somebody decided to skip the research part and just pay to reduce load times.

Imagine a totally difference scenario. You work for a established (30+ years old) company that sells consumer goods. Executives approve a $25 Million budget to "improve the customer experience" over the next 3 years.

This directive goes to all the various organizations: Sales and Marketing, Product Development, Technology, Customer and Market Research, Customer Support, etc. The various orgs have 3 months to come back to executive management to justify how much budget they need and their execution strategy. Each org thinks thinks they are the key mover in improving customer experiences and wants as much of that budget as possible. Every org works at a difference speed and with different philosophies (e.g. all work is done in-house versus some or a lot of work done by external agencies).

Let's add some more reality into this. Even if the CXO of an org thinks they don't need to be in this process, it looks bad if they don't say they have a strategy and need budget. There's also a significant chance that 9 months into this project somebody will get restless and the whole initiative will get restructured with different timelines and goals.

I could do on, but armies of Analysts and Data Scientists will get pulled into this to drive "data driven decision-making." A lot of the expectations will that the "smart people" will show that everyone's particular biases will be the most important one and needs.

It's hardly an environment for anybody to do rigorous analysis or for anybody in an Analytical role to shine. Think the scenario sounds insane and made up? It's not. Welcome to big-co.

This is very accurate. I've found that the simplest model with good enough results is often the best in the business world. On the one hand, that means I spend less time pushing the boundaries of what we're capable of doing as an organization. On the other hand, most business questions don't need massively complex answers so a quick regression may suffice.

> The work that is most valuable to a business is not exciting all of the time

This probably describes just about every job in a for-profit business.

If jobs were exciting, they wouldn't have to pay you to do it.

This article is pretty spot on. As someone who has worked in data science/analytics for over 6 years I have found that the field is filled with hype, managers who are not sure what data science actually is, and an absurdly wide amount of skills jobs expect you to be able to do well.

Apply for and interviewing for data science jobs is a total nightmare. You are competing against 100s or even 1000s of applicants for every job posting because someone said it was one of the sexiest careers of the 21st century. Further exacerbating this, Everyone believes that data is the new oil, and large profit multipliers are just waiting to be discovered in this virgin data that companies are sitting on. All that is missing is someone to run some neural network, or deep learning algo on it to discover the insights that nobody else can see.

The reality is that there is an army of people who know how to run these algos. MOOC's, blogs, youtube, etc have been teaching everyone how to use these python/R packages for years. The lucky few who get that coveted data science job can't wait to apply these libraries to the virgin data only to find that they have to do all kinda of data manipulating to make the algos even work, which takes days and weeks of mundane work. Finally they find out the data is so lacking that their deep learning model does very little in providing actual business value. It is overly complicated, computationally expensive, and in the back of your mind know you can get the same results using some simple logic.

Managers who don't understand data science fundamentals learn from the news and have their data scientist implement those buzz words so they can look good in front of their bosses.

I think there is a place for data scientists who understand the fundamentals of the models out there, and know when you should not use them. Data science is also increasingly a subset of software engineering and a good data science in a tech company should be able to code well. I also think that there is not some huge unmet demand for data scientists. Just a huge amount of hype and managers wanting to look good by saying they managed a data science team.

Any work is dull and depressing when done under the supervision of idiots. Some companies, although probably less than claimed, are genuinely data driven rather than HiPPO driven, though. This might be particularly important to look for theses to do interesting stuff in the fields of data science.

Data science is correctly valued when you realize how relatively unimportant it is. It is a small cog in a larger machinery (or at least it ought to be).

You see, decision-making involves (1) getting data, (2) summarizing and predicting, and (3) taking action. Continuous decision-making -- the kind that leads to impact -- involves doing this repeatedly in a principled fashion, which means creating a system around the decision process.

For systems thinkers, this is analogous to a feedback control loop which includes sensor measurements + filters, controllers and actuators.

(1) involves programmers/data engineers who have to create/manage/monitor data pipelines (that often break). This the sensor + filters part, which is ~40% of the system.

(2) involves data scientists creating a model that guides the decision-making process. This is the model of the controller (not even the controller itself!), which is ~20% of the system. Having the right model is great, but as most control engineers will tell you, even having the wrong model is not as terrible as most people think because the feedback loop is self-correcting. A good-enough model is all you need.

(3) involves business/front-line people who actually implement decisions in real-life. This is where impact is delivered. ~40% of the system. This is the controller + actuator part, which makes the decisions and carries them out.

Most data scientists think their value is in creating the most accurate model possible in Jupyter. This is nice, but in real-life not really that critical because the feedback-loop inherently moderates the error when deployed in a complex, stochastic environment. The right level of optimization would be to optimize the entire decision-making control feedback loop instead of just the small part that is "data science".

p.s. data scientists who have particularly low-impact are those who focus on producing once-off reports (like consultant reports). Reports are rarely read, and often forgotten. Real impact comes from continuous decision-making and implementing actions with feedback.

Source: practicing data scientist

Had to make an account to upvote this. Absolutely dead-on. I think you can generalize this comment to almost any specialist skill. "No Silver Bullet" should be a business doctrine as well as a technical one. You need to do a lot of things well to succeed in business. Specialists just provide you a capability. You have to implement and use those capabilities as part of a larger system if you want to create a machine that generates profit.

I should add that programmers have the crucial albeit boring role of creating CRUD front ends (forms) for data input.

That is a akin to a sensor input, and one that is surprisingly important. Without a good CRUD form, data either doesn’t get entered at all, or is entered in crude, unvalidated ways like as loose Excel files with formatting that is all over the place.

> I attended a 12-week data science bootcamp in mid-2016. ...

Yeah, well there's your problem, my dude. I've been doing what might be described as "data science" since I quit physics in 2004. Aka before the term existed. It's a great area to work in for intelligent people who want to use their brains to impact the real world; vastly better than what people get paid to do in physics. If customers don't know what the tools can do, it's because you as the data scientist have failed to explain it to the customer. If your work product isn't in front of the decision makers, you've also failed: they can tell the bottom line impact and will reward you accordingly. Sometimes there is no data in their data; they should know that up front.

As for whining about poor data quality: n00b. What do you think they're paying you for? Nobody gives a shit what people do in Kaggle competitions.

I don't think the op would care much for your delivery but you make some great points.

> If your work product isn't in front of the decision makers, you've also failed: they can tell the bottom line impact and will reward you accordingly.

This one in particular stood out. There is an aspect of salesmanship (or navigating corporate hierarchies) to the role. Things will not be obvious to the decision makers. Perhaps the data scientist has to take some responsibility in bringing their work to the fore.

I stood up a data science operation at my company over the last few years, and have noticed a key difference in data-science projects that have been successful and those that have failed. It hits on a number of points brought up in the article, namely where does data science "fit" in an organization delivering software and how is the value realized by the business.

The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc. This is high-level pie in the sky stuff that works well in pitch meetings and client meetings, but when it comes down to brass tacks this leaves very little vision of what is trying to be achieved and even less on a viable execution path.

More successful deployments have had a few items in common

1. A reasonably solid understanding of what the data could and couldn't do. What can we actually expect our data to achieve? What does it do well? What does it do poorly? Will we need to add other data sets? Propagate new data? How will we get or generate that data?

2. The business case or user problem was understood up front. In our most successful project, we saw users continuously miscategorized items on input and built a model to make recommendations. It greatly improved the efficacy of our ingested user data.

3. Break it into small chunks and wins. Promising a mega-model that will do all the things is never a good way to deliver aspirational data goals. Little model wins were celebrated regularly and we found homes and utility for those wins in our codebase along the way.

4. Make is accessible to other members of the company. We always ensure our models have an API that can be accessed by any other services in our ecosystem, so other feature teams can tap into data science work. There's a big difference between "I can run this model on my computer, let me output the results" and "this model can be called anywhere at any time."

While not exhaustive, a few solid fundamentals like the above I think align data science capabilities to business objectives and let the organization get "smarter" as time goes on as to what is possible and not possible.

As a person doing data science / ML in the last 4 years, I mostly agree with your points. Especially about the hype driven demand for DS/ML. One thing that is often neglected though is the exploration part it. There really is a lot of data out/in there that your company knows anything about, but can probably benefit from knowing. E.g. even a simple crawl of a popular jobs/ads/... site done diligently for e.g. 6 months can reveal many interesting insights about market structure and trends. Google and its mission to organize all data in the world exist for a reason. This however is in stark contrast with the approach that most executives take. Instead of managing it as a well thought strategic/long term investment, they want to time-box it, to get immediate value and to show off to senior management or customers. I've seen this tendency in both big corporations (mid-level management) and startups, which makes me think that the confounding variable is the fund/incentive management process. In both big corps and startups, there is a limited time&budget to show meaningful results and people optimize for that, which often involves taking shortcuts, neglecting strategy and outright lying. In contrast to that, I've seen projects driven by wealthy individuals, who don't look for immediate value, but are scratching an itch (e.g. curiosity). These usually fare better than the former as long as budgets don't get out of hand (to exhaust the cash cow). I would argue that these are most successful, because of better alignment of motivation (person paying the bill) and execution (person driving the process).

A math friend of mine often consulted for scientists. His least favorite were those who asked him to "make some clusters". (think k-means) "What are you looking for? What is your hypothesis?" "Just make some clusters and we'll see."

Not utterly without merit, but fairly blind fishing nonetheless.

>The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem...high-level pie in the sky stuff that works well in pitch meetings and client meetings...

I'm been in various external and internal facing Data Science roles for 8+ years and this is spot on. IME it's the #1 reason Data Science projects "fail." If you can replace "do some of that data science" with "do some of that black magic" that probably means nobody actually checked to make sure the data and problem made sense in the first place. But somebody somewhere already committed to it, so the Data Science team has to deliver it.

> The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc.

While I agree on the point, there's a case that's arguably worse: When those executives hire Data Scientists and then ask them: "So what can we do with Data Science?"

Teams being small, data being crummy, infra being hard, and yet expectations being high aren't so much complaints as the they are the job description.

The point of data scientists and the related roles listed in the article are not to just churn out the fun stuff, but to wade through the institutional and technical muck and mire it takes to bring the fun stuff to bear on a relevant business problem and to communicate the results in a way that people of all walks can understand.

Yeah this guy seems to think Data Science work should be like doing a problem set for CS class. Sorry that you have to deal with messy data, fragile infra, and limited resources - I know it's not "fun", but frankly that's what the money is for.

That's the whole point of the article. Expectation (in this case, his own coming out of the bootcamp) vs. reality (what data science is actually like).

The author wants to be an MLE but doesn't know it.

As somebody in an ML Engineering role, i.e. somebody who could be asked to either fix the logging infrastructure or build some models, I would have agreed with this.

But even in this day and age with ML being the new hotness, you will find people who are quite happy to work on infrastructure and don't have a huge amount of interest in training models themselves, and it is probably a lot easier to hire them than people who can do both, and you may get better results from actual specialists.

I wrestle with this too, there's a lot of context to determine what skillset is better.

I suspect, if there are lots of relatively simple ML problems, then a generalist with integration chops will be more effective in getting them out quickly and "good enough". The specialist may take too long on models that are too heavy and impractical.

If there's one big ML problem (Google search, Netflix recommender, Amazon search, etc), where 1% additional makes a difference, then yes, specialist DS/modeler is probably preferred.

Larger, older org/heavier existing infra/more specialized culture will also tilt the scale towards specialists.

It's obviously a spectrum, but I feel like any org who is considering hiring a data scientist probably needs a data engineering team to begin with since you can do a lot of the analysis people want by just counting.

I also think it's unfair to specialists to say they will always overcomplicate things more than others, I've seen plenty of generalists with researcher envy do the same thing.

I'm generally confused by the hype around ML and 'data science'. it seems like CS has somehow regressed to the behavourism era of psychology or economics before the Lucas critique.

The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.

The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.

My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.

And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color

There’s also the issue of data scientists just not having a seat at the table. Anyone can validate their point by using data to support their answer just like anyone can validate their opinion by doing a google search.

In my mind I see more data scientists being ignored or turned into “yes men”(https://www.interviewquery.com/blog-do-they-want-a-data-scie...)

I only see ML and data science as having real value when considered as a single component of a larger system, most of which will not consist of anything close to ML. Many real world environments are too entropic to see much accuracy from ML models except in very, very limited bands (facial recognition, for example).

As other commenters here have posted, without the integration of data science into both the business needs and the rest of the existing tech stack it will remain a fun school course activity.

> CS has somehow regressed to the behavourism era of psychology or economics before the Lucas critique?

Can you please elaborate on this please?

See: https://en.m.wikipedia.org/wiki/Lucas_critique

At a high level, it argued that basing predictions on historical data is problematic. The details of the argument are somewhat specific to economics, but the principle is more general. That's also why people recommending stocks say "past performance is no guarantee of future results."

One of the key issues is that circumstances change, and information about such changes will often be external to a data set.

In the Lucas critique, policy changes are an example of this. You can't predict future economic performance based on past economic performance if relevant policies have changed. But any complex situation has such factors that are external to the data that one can easily collect about it.

in psychology there was a time period between ca 1900 to the mid century where behaviourism rose in prominence, which was the paradigm, simplified, that internal processes of the mind are not really interesting, and what matters is rather only the relationship between input and output, treating the mind as a black box of sorts (roughly analog to ML models).

This came under heavy attack during what is called the cognitive revolution, which put focus on understanding mental processes at a structural level (for the reasons outlined in the post above).

Economics went through a similar process. Up until the 70s Keynesianism was very dominant, which mostly focusses on using aggregate economic quantified data, i.e output, unemployment, capital and so on to make policy suggestions. This began to be attacked and supplemented with what's called 'micro-foundations', which aimed to not just look at quantified data, but to model, from the individual up, not just top-down, fundamental behaviour and interaction, i.e the actual entities that generate the aggregate data.

There was also a similar movement to this in linguistics starting (mostly) with Chomsky at about the same time applying the same criticism to how we model language.

As a scientist, I've worked with data for decades. There's always been a prevailing belief that scientists and engineers with specialized domain knowledge are mostly fumbling in the dark and can be replaced with a general purpose technique.

This was certainly the vibe that I got from "design of experiments" when it was the statistical method du jour. Then from "Bayesian everything" and now "data science." I remember "design of experiments" studies being conducted with great fanfare and success theater, while producing zero results.

The long term theme is that science is hard for reasons that managers don't understand, can't manage, and are reluctant to reward.

I've seen a few similar articles now. Does this represent the general view of folks working in data science? "Data Science" is such as meaningless catch all term. The reality is in many organizations it's simply advanced business intelligence or advanced business analytics. There are some industries that lend themselves well to this whole practice, and they tend to be industries that have been borne out of the internet age (e.g. social media, internet advertising, etc.)

Some other industries have been doing "data science" for ages. Credit Risk Modelling, insurance and so on.

Every time I read one of these articles, I feel it's just an individual who entered a kind of crummy situation and they're learning what it means to work in a corporate environment. Some are better than others. Some are more motivated than others. Some have better cultures than others. Some are more willing to make technology a key part of their business strategy. Some are more data driven than others.

My recommendation is to always ask the fundamental question before joining: what are you trying to achieve with data science, and is it actually achievable?

I always thought the non-specificity of the term Data Science was a strange criticism for those in the tech industry to make. How many types of SWE are there? Front-end, back-end, full-stack, devops, security, QA...

I agree wholeheartedly with your recommendation. Like any other job, each company has different needs and expectations and if you want something else out of the role you'd best avoid that company.

Frankly I have the same criticism of those who use the term software engineer. Engineering is a pretty established profession with a set of standards, ethics and practices. Most of us who work in software are not engineers. We are developers. Similarly, a scientist is one who follows the scientific method to do research. So by that logic a data scientist should be a person who uses the scientific method to do research on data. Does that make any sense? And let's be serious, is that what most data scientists are being hired to do?

I'd be best described as an ML Researcher/Engineer and I'm not in the private sector so take my opinion with a grain of salt, but my understanding is that many DS roles require application of the scientific method.

A lot of DS can be boiled down to some sort of statistical testing or inference (A/B test email marketing for example) or applied ml (classification, regression). I'd argue thats science (if done right).

Data Analysis, the plot a few charts and put it in a slide deck kind? Totally agree with you. Definitely not science.

>So by that logic a data scientist should be a person who uses the scientific method to do research on data. Does that make any sense? And let's be serious, is that what most data scientists are being hired to do?

I certainly do, but I've been doing data science for the better half of a decade. It seems starting around 2015 when Data Science became a sexy title, a lot of fresh blood has been overgrown software engineers wanting the title, not knowing what they're getting into, or having faulty expectations. I don't consider this class of "data scientist" a data scientist, which is why the community has started shifting its job title away from Data Science to Research Scientist to better differentiate.

The good side of this is I'll come into a company with them expecting me to be like an over blown software engineer, and it gives me the opportunity to show off and go above and beyond what companies expect, allowing me to come off like a super hero. Though, it's definitely an uphill battle, and knowing how to work with upper management is an absolute necessity.

> the scientific method to do research on data

Exploratory data analysis is often overlooked and underrated.

Ppphhh we don’t need to do exploratory data analysis or prepare the days, don’t you know that neural networks will do all that themselves!

Doesn’t yield the right results? Clearly not enough data.

Still doesn’t work? Change to whatever the latest model google or fb is using and try again.


And the model will train itself right? That means that you'll have all that empty time to do more data science!


But why shouldn't software developers also be engineers? Surely the difference isn't just a professional organization and accreditation.

Liability. Professional engineers are liable for the work they produce and approve. They get a fancy stamp and liability insurance and can be sued when things go wrong. That's why engineers tend to be those who work on things that can kill you. IMO if you're working on an airplane's software for example you should probably either be an engineer or supervised by one. This matters because engineering provides you guidelines you must follow and ethics you must uphold, and if you aren't following the rules your governing professional body prescribes they can strip you of your license to prevent you from being a danger to the public. There are many other critical pieces of software btw. I just mentioned airplanes because it's one of the most obvious ones.

So yes, there definitely exist software engineers who require licensing by their state or country. Most of us just aren't actual software engineers, that's my point.

A decent amount of data scientists work on AB tests which is science on production data.

Actually I'd say that's as common as the failed ML projects

100% agree with the article. The top misconceptions are spot on. I’m at a big place where data science hype among leadership couldn’t be bigger.

If I may ask, what were you told when you interviewed that convinced you to join?

Reasons for joining: 70% just needed a job, 20% location (silicon valley - wanted to be in tech ecosystem), 10% was combo of: I was told it's a small, entrepreneurial team with undefined remit (so opportunities to forge own path - turned out to be true) and it was impressive in many non-technical ways (company mission, campus, resources, etc)

And no reasons relating to technical or data science know-how on the part of the team/company ;) I already knew coming in that the industry is technologically backwards (big healthcare co)

As a data dude in public/nonprofit healthcare-landia I agree with all this, plus:

- It's essential to have/develop domain expertise in your industry.

- Beware plausible, but incorrect (or poorly interpreted) data that supports yours (or others') assumptions/biases.

- Add on to #4 - at least as bad as this is having well-intentioned people on your team who "know enough (a bit of SQL or low/no-code data tool") to be dangerous. Um, why are you joining unnecessary tables, or using a different alias for the same columns/tables in different queries, with no comments or standard formatting?

- Hold your nose, but anything you do in SQL/R/Python/even fancier programming tool/language is going to pass through MS Excel at least once sooner or later which can irreversibly bastardize CSVs (even just opening without saving!), truncate precision to 15 digits, change data types, etc.

- So glad for the callout in #7 - there are clearly devs/data folks out there who are happy to take on an "interesting programming project at a great paying job" - that isn't serving the best interests of humanity.

This rings very true to me. I'm working on moving over to an SWE role in the next few years for many of these reasons.

I'll just add one: the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).

While this seems obvious enough to anyone with a few years under their belt, to the new DS grad who has their time series analysis canned in favor of slapping a simple moving average in place and shipping it can be rather disillusioning.

Usually, but I've seen the opposite too.

Sometimes a young startup wants to advertise to the board, and they want you to make a presentation. I've made the mistake of showing near 100% accuracy solving a difficult problem important to the business, and expecting a strong positive result.

Instead I got a, "But, are we using deep neural networks?" type comments.

Sometimes a company just wants to market, be it to customers, or to the board. It's important to know your audience.

> the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).

One of the challenges with this is that "reliable" can mean a lot of things when the goalposts of success are constantly moving in large projects with many stakeholders, all of which are clawing for attention. I've seen politics derail so many Data Science projects and destroy the morale of Data Scientists.

It's only natural that a lot of people will realize that a moving average that confirms what people wanted to see anyway will lead to more success (whatever that means).

Nothing reliably consistently beats ARIMA models in time series forecasting to this day

That's pretty sad when you think about it, but it's painfully true.

> Nothing reliably consistently beats ARIMA models in time series forecasting to this day

Not sure this is true in practice. In some situations, Holt-Winters (ie. algorithms in the ETS family) may do better, and it's often a good idea to try both.

There's a claim that Holt-Winters is a special case of ARIMA (the claim is ARIMA is more general), but this is actually not the case. There is equivalence in only a subset of cases. [1]

I've fitted Holt-Winters models that beat ARIMA models. ARIMA models can have trouble generalizing from training data with long horizons because they tend to overfit to the distant past. Holt-Winters on the other hand has a natural "forgetting factor" built-in which moderates this.

As well, my experience is that stacked models with well-chosen exogeneous variables (if you have causal variables) tend to outperform pure time-series methods because they are anchored on more independent variables than just t. Pure time-series models bank on the assumption that patterns have a repeatable time-dependence, and most of the time this is just not true, so have to be augmented with other variables.

[1] https://otexts.com/fpp2/arima-ets.html

I’ve been doing an MS in Data Science very slowly due to work and 2 new kids. Finishing the degree this year in year 4. I was very excited about the prospect of doing something different. A few things have changed for me.

1). I am hearing about Data Science Teams being furloughed during these times. That isn’t happening in my function (Corporate Finance). I am glad to be secure even though I enjoy much of the data sci work.

2) I’m able to apply Data Science concepts in my current role, and it’s adding a lot of job security and providing me with exposure. I am much less interested now in moving to straight Data Science and instead am applying my learnings in my current role as a sort of in-house Data Science guy. But I have a lot to learn to be honest.

3). There seem to be a lot of “thought leaders” acting like they are big experts in the area and really don’t know anything many of us amateur scientists don’t know. They pull perfect clean datasets and show these magic transformations they just copy from others to get YouTube hits or Twitter followers. That just never happens in real life, and many leaders are seeing this and losing interest in this function in the returns they are getting from sole data science folks.

This isn’t unique to data science. I personally know people in finance that are poor coders and even worse quants, yet they go around lecturing at universities.

I work as a data scientist. Some of the author's points are workplace-specific: lack of leadership, being the only data person, ethical concerns. The others are just aspects of the job - communicating about your job and impact, dealing with vague specs or managing low-quality datasets.

Neither of those quite match the articles title, perhaps it just refers to the author's personal expectations. Neither of them seem that specific to data science, or without parallels in other software jobs. And neither of the points read like a slight towards data science to me, like some of the other commenters here suggest.

One issue might be that organizations subconsciously resist the data scientist, or more generally, the nerd in his/her attempt to take over decisions. If these decisions are invariably tied to the goals and careers of managers, how can the data scientist have a "seat at the table" in all but the most enlightened and technical companies? The disorganized state of data and infrastructure suits the ambitious manager well, who can just put in enough effort to find data to have their project greenlightened or to answer one specific question.

Progress may only come slowly, ideally through products bought from 3rd parties whose results are understood and controlled by management.

I did "data science" for about a decade, consulting with plaintiffs firms and state AGs on antitrust and fraud cases. For each case, the work flow was roughly this:

-- write discovery requests

-- review production, and check out data and documentation

-- write supplementary discovery requests

-- review production, and check out data and documentation

[repeat as needed]

-- analyze data, and write deposition questions

-- help attorneys wring answers from deponents

[repeat as needed]

-- analyze data, and produce required output

-- write parts of briefs and expert reports

I generally did that in consultation with testimonial experts and their data analysts. Sometimes that didn't happen until we'd documented the case enough to know that it was worth it. And occasionally small cases settled with just me as the "expert".

It's a small industry, and not easy to get into, unless you know key players at key firms. But the money's pretty good, and the work can be exciting. I loved being that guy in depositions whispering questions to the attorneys :)

This all involved pretty simple calculation of damages, through comparing what actually happened vs what would have happened but for the illegal behavior. But-for models were typically based on benchmarks.

After data cleanup in UltraEdit, I did most of the analysis in SQL Server. I used Excel for charting and final calculations.

I would expect "data science" is doing some form of numerical analysis. Otherwise it's just record keeping... with computers.

The hardest part of what I did was getting enough documentation to understand the data. Sometimes we got fixed width text files, with no in formation about column definitions. Or column names. Or what values in descriptive columns meant. Stuff like "class of trade".

But generally you're right. It was just simple calculations using sales records. But lots of records, at least several gigabytes, and sometimes several hundred gigabytes.

Record keeping is 90% of data projects.

The second 90% is basic math at high speeds.

Right, record keeping. But when it's not your data, things get complicated. Imagine trying to understand how another firm's data systems work. You can talk with managers, who know how the business uses data. But they have no clue how the data are stored or managed. And you can talk with IT people, who know how data are stored or managed. But they have no clue how the data are used.

And yes, speed. Aggregating hundreds of gigabytes was nontrivial to do quickly. I started with Access, and then learned to manage and use SQL Server. And eventually a multi-Xeon server with lots of RAM and SAS-attached storage.

This reads like Indiana Jones teaching Archeology. Yes, as a data-scientist you actually have to work, most of the work is digging in dirt, and mostly you won't find anything of interest.

It works well when subject matter experts exist in the org and collaborate/supervise/drive data folk, to solve some issue the sme's have spent enough of their own time thinking about.

If its just data folk by themselves getting dumped with org data and told to find pirate gold...then its a crap shoot.

The real issue with data science, from the perspective of ML pipelines/using ml in products, is most people are straight up not smart enough for it. The second the problem falls outside the bounds of a commonly used model, 90% of data scientists are ill equipped to come up with a profitable solution. So they stumble around in the dark, producing nothing of real value. People underestimate the degree to which extreme mathematical maturity and skill can bend the results of commonly used ml models.

This reads as a series of bad job experiences and I think is explained by a wide variety of job functions that all can have "Data Scientist" as a title. Someone else's experience could be totally different. You have to know what to look for and what to avoid. If you're trying to find a DS job, one of your top priorities is finding out what the actual job consists of. For instance, a Data Scientist at Facebook might be called a Data Analyst at many other places--no modeling required.

I know this because I've been on that journey. But there's no reason to expect some department head that's never been exposed to DS to know this. They just copy/paste some other company's job req. If you're more junior, here are my tips:

- If it's a "new DS team" that supports a variety of teams: beware. Bolt-on DS doesn't work well, as it's really hard to build a meaningful solution that's not deeply integrated.

- If it's an old company or in a conservative industry: beware. There are likely to be data silos and difficult ownership models that make it nearly impossible to get and join the data you need.

- If it's a small company: beware. You're likely going to need a broad set of knowledge that's won with several years of experience to be able to build end-to-end solutions that are integrated into the rest of the tech stack.

- If it's not an engineering-driven culture: beware. DS will often be used to provide evidence to someone else whose already made up their mind and pretend they're being data-driven, and you'll be the disrespected nerd that's expected to do what it takes to deliver the answer they want. Most companies claim to be "data-driven", few are, and even fewer understand data-driven isn't always possible or desirable.

Industry is still trying to figure out how to use ML and are still learning that it's not as easy as hiring someone that knows about all the algorithms, but rather it takes deep technological changes to data infrastructure to enable the datasets that can then be used by the ML experts. But you don't have to be the person that helps them figure this out the hard way (i.e. by being paid to not accomplish much due to problems outside of your control). Better to find a place with a healthy data science team that can help you learn and contribute. They exist.

Agree with your points on "old company/conservative industry" and "non-engineering culture"

I'm at a place that is both, and both are huge pains.

On the engineering side, it's a bit different though: technical roles are looked down on, and there is no engineering culture, eg, for data. Data is just a bunch of flat files everywhere, across many silos. No leadership to put it together into logical buckets for easy access and interoperability

- If it's a small company: beware. You're likely going to need a broad set of knowledge that's won with several years of experience to be able to build end-to-end solutions that are integrated into the rest of the tech stack.

For what it's worth, my first job was as a solo data scientist at a series B startup. It was a nightmare and I sucked, but boy did I learn a lot.

Great read. A lot of those problems are real, and some of those I’ve experienced myself. But I think at least some of them are related to the immaturity of the field. We’re only at the beginning of creating the tools and platforms to facilitate DS, making it more reproducible and easier to measure.

For example, I’m working on the tool to make data management easier and convert datasets into a structured representation. If you have experienced that you spend a lot of time on preparing and analyzing data, and it is tedious, please reach out to me michael at heartex.net, would love to get your feedback on the product we have built so far.

> But I think at least some of them are related to the immaturity of the field.

I agree. More so, I sometimes feel that in the end the field will break up once things start settling down. Some roles will migrate more towards engineering, some will go back towards data analysis.

The expectation that a data scientist is a funnel that can turn anything into magical insights and tools can't last forever.

A really easy way that I try to explain things to people is like this:

You can't compress information until you have it in a format that is appropriate for compression.

That is:

You can't compress (apply/create algorithms) information (data) until you have it (instrumented data collection) in a format (schema) that is appropriate for efficient compression (structured logging/cleaning).

99% of that is Data Engineering and building good engineering practices which have good data practices as a priority.

For any organization that has more than a handful of employees and more than one product, that is a non trivial task and gets more difficult the larger the organization gets.

Totally agree. Non-tech companies that think they need "data science" should instead put same effort into (data) engineering.

It's not quite 99% of the effort but close enough ;)

Search "data science hierarchy of needs"

I wrote a blog post along similar lines in 2018 (https://minimaxir.com/2018/10/data-science-protips/ ); unfortunately, the industry hasn't changed much since then.

As noted in the submission, there's a lot of flexibility in what a "data scientist" is. Normally that's good and healthy for the industry. However, it contradicts a lot of optimistic bootcamps/Medium/YouTube videos, and many won't be prepared for the difference.

My industry (information security) is the same way. Far too broad of a category, leading a lot of people to get confused and frustrated. I see a lot of analysts (defensive security) who thought they were going to be pen testers (offensive security) and didn't realize those jobs were in two completely separate career paths.

I've been a data person for the past year and a half and I'm very disappointed with the bewildering array of titles out there and the rather vague meanings behind them (Data Analyst, Data Scientist, Data Engineer, ML Engineer).

It's overall hurting my ability to build my personal brand and seek roles that are a fit for my existing skillset and aspirations.

What exactly does 'ML Engineer' communicate to employers in terms of baseline skills? Is the role closer to that of a data engineer or an analyst?

I've been working in data roles for 10 years and hold a masters in ML. I've hired and managed each of the roles you mentioned. I think of the responsibilities of each of those roles as:

-ML Engineers as building software infrastructure to scale machine learning inference and training.

-Data engineers focusing on data infrastructure and pipelining into either model inference, training, or other business intelligence platforms

-Analysts consume the product of the data engineer in the BI platform or excel, where the results would be consumed as a report in some form.

-And ML Researchers would be those inventing novel machine learning algorithms to deploy in the ML Infrastructure managed by the ML Engineers

-And data scientists to deploy well-known ML algorithms or statistical inference on varying datasets on the ML Infrascturue or as a slide deck.

How hard could it be to find one person who can do all that?

Depends on the amount of data, reports, pipelines... If the company is small you might not have any of these problems. Every Mom&Pop store has some sort of data to run the business but they don't need a "data" person.

Once you have 10s of datastores + pipelines, 100s of reports and a "data lake" in the TBs you'll likely be needing specialized people.

So far I've spent my career in small teams / startups and it's starting to become apparent that a lot of what's assumed in these titles only applies in larger corporations where resources are abundant and it makes business sense to have a specialist focused on a single aspect.

Unfortunately I'm at a point where I have 'jack of all trades master of none' syndrome and it's causing me to fall in between the cracks professionally. I'd like to move to a larger company where I can develop deep expertise in a narrow topic.

ymmv, but as a data scientist at young startups, I often am the one giving new tasks to the software engineers, and facilitate teaching and training if they need help.

Most of those roles a software engineer can do.

Thanks for the breakdown!

This reminds of the latter days of the LAMP stack. A "web developer" might do front/backend and sysadmin work. I think some people see "data scientist" similarly, wearing many (all) hats, which can work for some environments, but not most corporate ones.

From my perspective as a data person, everything on this list is true. I would add on to #4 to say "You're likely the only data person" to say "You're likely the only data person and expected to do everything you need to do your job yourself" (from sourcing the data to deploying your model).

High Data Scientist salaries and expectations combined with a shortage of qualified people often mean you're expected to be a one-person band, which I find to be miserable.

Point 5, “ Your impact is tough to measure” is also shared by Quality Engineering and SRE, and not unique to Data Science. The point about being a support role holds true for them and it is thoroughly frustrating when a front-end dev makes a small change to a visual element is praised to the roof while complex automation projects by the quality team, ingenious recovery and reliability projects by SRE, and massive and fascinating inferences by data science are undervalued by leadership. The truth is most leaders just can’t connect the dots. I’ve worked as a full stack engineer btw do not taking a dig at front-end work, but it’s clearly easier to measure impact. I’ve worked in quality too and when you’re only called in to ask why one bug got out and never asked about the thousands you’ve stopped it’s demoralizing. It’s part of the reason I started Tesults (https://www.tesults.com), if you’re in one of these support roles, measure, measure and measure and throw those reports into the faces of leadership. It shouldn’t have to be done but without it, the point the author is making here will take place.

> and it is thoroughly frustrating when a front-end dev makes a small change to a visual element is praised to the roof while complex automation projects by the quality team, ingenious recovery and reliability projects by SRE, and massive and fascinating inferences by data science are undervalued by leadership.

I feel this in my bones lol.

The frustration when the results of weeks/months of hard work are glossed over with a “oh that’s nice” in favour of endless praise for the front end team putting a picture backdrop on the search page or something.

Didn’t matter how many times we sold them on the benefits, or explained the work that went into it (at both executive summary level and detail) or did all those things you’re supposed to do, if it was more than one step away from directly causing it, or slightly more abstract than “we moved the button” it was wasted on leadership/management.

Spent a couple of weeks fixing data pipelines and ETL/database infrastructure and processes and now everything runs faster, and runs on a smaller and cheaper cluster and as a result managed to put together some analysis and modelling on customer behaviour that shows if you do xyz you’d expect to see uptick in this thing. Doesn’t matter, Bob changed where the button sits and we saw 20% more sales, good job Bob, everyone: be more like Bob.

Coase: "If you torture the data long enough, it will confess to anything."

via https://www.reddit.com/r/QuotesPorn/comments/b76ujr/if_you_t...

I guess I'm in the minority in these threads..? I've been doing machine learning / model-building / pushing models to prod and maintaining for about 6 years now. It's still 50/50 understanding the data and building/tweaking/training/testing models. But it sounds like most people with this title are analysts? At least that's what posts and threads lead me to believe. I've also met a lot of people with titles like "ML Engineer" or "Data Scientist" who don't do machine learning. They are analysts, engineers, or maintaining data pipelines.

Pushing models to prod is often MLE work depending on the organization, though MLE is often slang like how dev is slang. MLE an be a job title just as Developer can be a job title, but more common than not the job title is Machine Learning Software Engineer, or just Software Engineer for short.

I suspect a lot of people want the sexy Data Science job title, which is why there has been such a push for it, and why most new "data scientists" take the title but do Data Engineer / Infrastructure Software Engineer work or MLE work instead.

I think MLE is more sexy in a lot of ways, and it often pays better than DS work, so it's odd that many haven't flocked to that job title, but maybe the whole software engineer part turns people off for some sort of reason.

Me, I'm more a classic data scientist / research engineer, which involves a lot of digging through data and research and generalized learning, then presenting my findings. I'm not using any ML on the job right now, but often I have in the past. It's just a tool, not an end.

> You’re likely the only “data person... Because people don’t know what data science does, you may have to support yourself with work in devops, software engineering, data engineering, etc.

Nothing has summed up my entire working experience more than this, it’s almost painfully accurate.

On one hand it’s an exciting challenge, you learn a lot and you get good at adapting to these situations.

On the downside I have practically no senior data science people to turn to for help when I do need it, which is frustrating.

I am sorry to sound like I am being obstinate, but my opinion about this is that as a society, since the early 90s, we have put way to much focus on "tech" than we have put on plain old mathematics or foundational science.

I don't mean manufacturing (which is doing really well), but companies like Microsoft, Google, Facebook (and even Apple) and others do encourage you to try to compete against their founders (or maybe society does that) rather than focusing on being solid mathematically. Yes, Google pays people well with those skills, but movies portray mostly their founders, emphasising how rich they are, while mathematicians are generally portrayed as weird. Society as a whole puts more emphasis on Bill Gates than on fundamental researchers.

In fact, if you really want to have a rich representative, you can pick the Simons guy. (See, I don't even know his name.) His Medallion hedge fund was built on mathematics. Ironically, Bill Gates is these days one of the biggest financial supporters of people with science skills that he doesn't have.

It is a fad to be a techie. Mathematics is not a fad, although it does have internal fads.

I do not understand. Have never understood. "Data Science" is, surly, newspeak. The appropriate term, surly, is "statistics".

There are some differences between stats and data science.

At least initially, DS was a lot about machine learning. While those methods may be statistical, it was the computer science field that drove and embraced the ML revolution. Currently, it’s mostly ML engineers who make the impact (deploy) ML and these are mostly CS folks. Statisticians still can’t code themselves out of a box (2013 MS Statistics here from top school)

Also, there has been a lot of innovation in managing data at scale (tools, infra , etc) This, again, has been done by engineers not statisticians. But this is still related to the science of data.

So the difference between the new (data science) and the old (stats) is about culture and about some of the methods for dealing with data at “scale”.

In other words, statistics is just a part of data science, but not the whole.

Data science is an overloaded term, but even so there are some salient differences between it and statistics.

Data science more closely related to "statistical learning" and the knowledge required overlaps with but looks quite different with that of conventional statistics.

An easy way to get a sense of the difference is to compare the table of contents of a book like ISL (PDF free) [1] to the undergraduate curriculum of a statistics program. You'll find that that the focus and indeed culture of data science is really quite different from that of statistics.

Leo Breiman wrote about this in his paper "Statistical Modeling: the Two Cultures" [2]. Conventional statistics belongs to one culture, and statistical learning/data science sort of veers toward to the other (though not completely).

Much has been made about how "data science" is just statistics dressed up to look new, but I'm not convinced this is true. I'm also not convinced that pure statisticians have the right training to be data scientists -- additional training and mindset changes are needed. The reverse is also true: most data scientists lack the rigor and epistemological training to be statisticians.

[1] http://faculty.marshall.usc.edu/gareth-james/ISL/

[2] https://projecteuclid.org/euclid.ss/1009213726

A new title means a new opportunity to ask for more money and influence. See also, "devops".

Going from IT to devops is a great way to double your salary.

Or microservices developer/architect

Indeed, perhaps applied statistics or even data analysis. It has always felt stupid calling myself a data scientist, but the term statistician has certain connotations that are not always relevant for the corporate context.

I worked as a data scientist for 4 months at a VC firm. I have a PhD and thought the work might be legit when I was hired. After the 4 months I quit when it became apparent that my credentials were being used for managerial intrigue and the work was essentially a joke, with no rigor at all. This article hits the nail on the head, unfortunately, these positions are not often real jobs.

There is a philosophical principle which says that any model superimposed on reality could be seen as reality itself, while it is merely a superimposed interpretation, in principle.

Korzybski formulated these principles, among other things.

Most of data science models are as wrong as astrology and numerology. They have no connection to reality, or rather inadequate.

This principle explains abysmal failures of all Model-based "sciences", stating from financial markets and up to virus spreading models.

Simulations of non-discrete, non-fully-observable (AI terminology) system has exactly the same relationships with underlying reality as a Disney cartoon to a real world.

This is why expectations will never be meet, except for natural (non-inaginary) pattern recognition.

A drop of proper philosophy worth years of virtue signalling.

> internally, you can make inroads supporting stakeholders with evidence for their decisions!

The problem is that this can all too easily become motivated reasoning: one provides a stakeholder with support for the decision he already made. From his point of view, this is a valuable service, but it does the organisation a disservice: decisions should be made after considering the data, rather than consider only those data which support a decision.

Also, while ethical issues certainly arise, I think that Greyball is not a good example. Uber evading police enforcing the taxi monopoly is no more unethical than the Underground Railroad evading fugitive-slave agents. The taxi monopoly is itself unethical, and evading it increases the common good.

Disclaimer: I use the term Data Scientist throughout this post; however, popular titles such as Data Analyst, Data Engineers and BI analyst are randomly applied by people who know nothing, and these people share none of the responsibilities of a Data Scientist.

I have never had hopes about the potential impact of being a Data Scientist. I felt every company should be a “data company”, but everything I knew told me that companies are political institutions bounded by the pressures of late stage capitalism. Anyone who things different is dim, anyone who blogs about it is a moron.

My expectations did meet reality.

Where did my expectations come from?

I attended a four year Computer Science degree, followed by four and a half years of earning a Ph.D. I then spent 20 years in industry. 19 of the 20 weeks’ focus were not on machine learning (ML) and artificial intelligence (AI).

I figured I’d spend most of my time buried in code and data, I was right, I had to find shit buried in it, and dig it out with my teeth. Executives hated me because I was a threat, but they needed me so I continued to get paid. I continue to be able to create insight and predictions that almost no one else can, and until this stops I will get a 200k a year salary, benefits and a Tesla.

All of this happened, I can't be bothered to waste my time commenting on this moronic blog post.

How important is Ph.D for data science?

I think it's quite important - or an equivalent.

From about 2012 to 2018 I went round a lot of universities, conferences and companies doing presentations and I used to often ask the audience for a definition of data science (in the hope of getting a good one). The best one I heard came at the University of Bath where someone (I know who, but he didn't say it to back it with his reputation so it's not fair to name him - it wasn't me though) said "Just drop the data, it's science".

I totally think that - Data Science is about doing Science with found and evolving data sources, we aren't often able to construct our experiments from scratch, but we often get to augment them, but we always start from the data we are given - which is why it's a sub-field.

In any case - the Ph.D's I have employed have almost all known how to do Science, and it has really helped. Some people without a Ph.D. learn to do it. Experimental Ph.D's are best.

Maths and theoretical Physics Ph.D's are generally not able to do this!

Will you employ data scientist that have articles and years of expirience as data analyst in University but not Ph.D? How much role will be lack of Ph.D in this case?

As I said - if the person is capable of independent scientific investigation then I think they'd be good. I think that a Ph.D is formal training for that - but not the only way to learn.

Work reality rarely meets expectations. I am sure a lot of “UX” people also got a little disappointed that reality is much more mundane than the fancy title suggests. Or the people whose last algorithm work was the interview.

It really depends on the niche or the industry considered, though: I can happily say that I can do materials informatics from my basement at home now and much faster and better than as a cog in any lab anywhere in the world. Same for a great number of STEM applications, if you ask or follow high-level practitioners through conferences, journals and social media. The elephant in the room is the Intellectual Property generated through STEM applied data science, which is hot and even dangerous as you can see from superstars like OpenAI, DeepMind or politically-motivated aggregations.

The most common complaint I've heard from the data science team is that there isn't enough data to work with.

I'm not fully convinced that data science with ML and more modern techniques are applicable across domains out of the box. I think there is value to be added if data scientists can specialise in domains.

If we take humans as an analogy, even with the kind of general intelligence we have, we need domain expertise to be able to have advanced intuitions and make predictions about the future. I believe this is true for data science as well.

I was a data scientist for one year. Experienced many of the adverse situations explained in the article, plus I thought it isn't for me. I joined my next job as a software engineer (after an extensive interview prep). Couldn't be happier. Still doing plenty of data science. But my product is actually a product, not the analysis (as is often the case with DS). I feel "more central" to the project, to the company. I'm still building ML models, features etc for a living.

As the lead data scientist at a small-ish fintech, I can confirm many of the frustrations and disappointments in the OP. But my trajectory was slightly different - from being the only "data science guy" in 2016, to now leading an autonomous team of four, with quarterly meetings with the CEO, and monthly meetings with our tech leadership. I decide tech stack, workflow, and hiring. Execs decide priorities. Sure, some of it was dumb luck, some of it was actually having a CEO that cares about data strategy, but I like to think at least some of it was me.

So here's what I think I did right:

1. Provide indisputable, obvious business value every month. You should consider yourself an in-house consultant to whichever cost center your salary is drawn from. If you're product development, prove value to them. If you're operations, or sales, or marketing, prove value to them. After about two months, you should be able to justify your existence in two sentences. Just remember, most of your company probably thinks of you as a optional add-on.

Your first few projects should attack high-impact pain points with the simplest solutions possible. My first projects were basically ETL into some basic regression into a dashboard. No machine learning required. But it was better then what they had (which was often nothing), and it was STABLE and RELIABLE. And that leads to the next point...

2. Build trust. With my dead-simple models, nothing ever blew up, there were no nonsensical answers, and there wasn't much brittleness when new categorical features or more cardinality was added. It mostly just worked. And that built my reputation for me. They didn't have to understand what was going on in the model, but they knew, from experience, that they could trust the result. Once I had the credibility, I could start building more complex, more elaborate models, and asked them to trust those as well. If they don't trust your models, then no business value has been created, and your job is worthless.

3. Recognize that data science is being done everywhere in the organization, and respect it. Every department has someone who has built a monster spreadsheet that contains more embedded domain knowledge then you could hope to learn in a month. As data scientists, we like to think that we're helping the organization by building critical metrics to improve performance. But here's the catch. If the metric was truly critical, someone has built it already. It might be ad-hoc, use poor-methodology, and be somewhat wrong, but it works and is good enough. You have to find that person, learn from them, and improve on it.

4. Be as self-contained as possible. Ideally, your critical path should not depend on other teams doing things for you (except for IT setting up data access). You should be able to do it all. From front-end dashboards, to ETL, to DevOps. Remember, you're an in-house consultancy. You should be able to take problems and just handle them, rather then be a perpetual bother and distraction to other teams.

There's more, but if you do these four things, I think you can build the reputation in your company for creating useful, accurate data tools that help other people do their jobs better. After that's achieved, people will breaking down your door to get your help. That's where my team is now - we've got a backlog for at least 18 months, with our work priorities often being set directly by the CEO.

My feeling is that a lot of companies think: "We need a data scientist because all the big players also have one!"

In fact, they actually don't need a data scientist. At best they need someone who cleans data, creates pie charts or even worse, they relabel the database admin job as "Data scientist".

Can definitely relate to this. Work for big consulting firm (F500) as a data scientist, end up in this weird software engineer/ml engineer hybrid role.

I personally love it but am doing more pure software engineering now as the infrastructure is not there and I need to build it myself.

To point #5 in the article, in my experience, ascending order of potential to generate value for business:

An astonishingly large fraction of Data Science output goes to die in pretty presentations.

From what's left, a large fraction ends up in Spreadsheets.

A disappointingly small fraction ends up in live services.

Completely agree. I made a post couple of weeks ago trying to find some solutions for this: https://news.ycombinator.com/item?id=22673236

it's a problem with tech in general. some things come over-hyped. and in the process people forget what's the actual problem to be solved because they fell in love with tools | tech. maybe the solution could easily be done in excel but then that's not sexy. I personally prefer to handle most parts in Python because of automation. writing functions in python is easier than writing functions in SQL or Excel(macros)

Maybe I have a narrow set of experience, but in my mind a “data engineer” is not a substitute for a “data scientist”.

The situation sounds similar to ones years ago for statistics, operations research, optimization, and management science.

I view all of such work as applied math.

My experience is that applied math, from the fields I mentioned and some more recent ones, and more, with emphasis on the more, can be valuable and result in attention, usage, and maybe money.

I've had such good results and have seen more by others.

Some examples:

(1) Airline fleet scheduling and crew scheduling long were important, taken seriously, pursued heavily, with results visible and wanted all the way up to the C-suite.

(2) Similarly for optimization for operating oil refineries: So, here is the inventory of the crude oil inputs and the prices of the possible outputs. Now what outputs to make? The first cut, decades ago, was linear programming, and IBM sold some big blue boxes for that. More recently the work has been nonlinear programming.

(3) The rumors are, and I believe some of them, that linear programming is just accepted, used everyday, in mixing animal feed.

No surprise and common enough, IMHO what really talks is money. If can save significant bucks and clearly demonstrate that, then can be taken seriously.

But from 50,000 feet up, tough to get rich saving money for others. If they have a $100 million project and you save them $10 million, then maybe you will get a raise.

What's better, quite generally in US careers, is to start, own, and run a successful business. If that business is to supply the results of some applied math, and the results pass the KFC test, "finger lick'n good", then charge what the work is worth.

Maybe now Internet ad targeting is an example.

I'm doing a startup, a Web site. The crucial enabling core of what I'm doing has some advanced pure math and some applied math I derived. Users won't be aware of anything mathematical. But if users really like the site, then it will be mostly because of the math. So, it's some math -- not really statistics, operations research, optimization, machine learning, artificial intelligence, or management science -- it's just some math. The research libraries have rows and rows of racks of math; I'm using some of it and have derived some more.

Generally I found that the best customer for math is US national security, especially near DC. E.g., now some people are building models to predict the growth of COVID-19. Likely the core of that work is continuous time, discrete state space Markov processes, maybe subordinated to Poisson processes. Okay: One of the military projects I did was to evaluate the survivability of the US SSBN (ballistic missile firing submarines) under a special scenario of global nuclear war limited to sea -- a continuous time, discrete state space Markov process subordinated to a Poisson process. Another project was to measure the power spectra of ocean waves and, then, generate sample paths with that power spectrum -- for some submarines. There was some more applied math in nonlinear game theory of nuclear war.

Here's some applied math, curiously also related to the COVID-19 pandemic: Predict revenue for FedEx. So, for time t, let y(t) be the revenue per day at time t. Let b be the total market. Assume growth via virality, i.e., word of mouth advertising from current customers communicating with remaining target customers. So, ..., get the simple first order differential equation, for some k,

y'(t) = k y(t) (b - y(t))

where the solution is the logistic curve which can also be applied to make predictions for epidemics. This little puppy pleased the FedEx BoD and saved the company. Now, what was that, data science, AI, ML, OR, MS, optimization? Nope -- just some applied math.

I have high hopes for the importance, relevance, power, fortunes from applied math, but can't pick good applications like apples from a three.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact