Hacker News new | past | comments | ask | show | jobs | submit login
We don't need data scientists, we need data engineers (mihaileric.com)
718 points by winkywooster on Jan 14, 2021 | hide | past | favorite | 353 comments

My experience is in quant hedge funds, where sometimes you get some guys who develop the strategy and some guys who put it into production.

Yes, I do admit there can be some specialization in terms of time spent on science vs engineering.

But you really need people who understand both. Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team. You get people who just don't appreciate that there's other work to do than finding models, and that models depend on that other work to function.

You also get a huge prestige gap, because inevitably management will think that there's a magician and a blacksmith. One guy needs to be paid a lot, and the other guy needs to be paid enough.

These effects feed each other. Magician will say "where's my data" and expect blacksmith to make it, promptly. He won't do it himself, because spending time on mundane stuff makes the magic disappear. And not doing it yourself, or taking the time to understand it, will eventually lead to problems with the magic.

> Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team.

My god, this. These people make me bonkers. Especially because I feel like I have a bit of this tendency myself, the desire just to think big thoughts and do no actual work. Happily, I long ago learned that ideas were approximately worthless without labor, and that I anyway had much better ideas when laboring because it forced me to engage with the details.

And yes, those people can poison a team. My best working experiences have all been with people who a) all valued actual work and b) believed that everybody could have good ideas.

"I'm the idea guy" out of someone's mouth is the stark red-flag warning that their net contribution is 0.

Ideas are so cheap and easy.

Implementation is a long hard road. And where you learn your idea was vague enough that it had almost no value. And only through painstaking iteration can you turn it into something with value.

> Ideas are so cheap and easy.

I doubt this.

It is true however.

Consider it was easy to bring an idea in this world, and the hard part was the initial first thought; writing a paper/article painstakingly rigorously would be unnecessary. Writing a book would be a breeze and no author would ever go through more than a single draft. The idea was born beforehand, was complete correct and perfect, so putting everything down with words is just a matter of transcribing. An organization would not usually hire engineers with multiple degrees, but simply writers or an automated system that would listen and transcribe the idea.

An idea is truly born and exists through a lot of effort and iteration and redefinement and refinement.

P.S. There is a hard and subjective issue of where the line is drawn between ideation and minor uninteresting and menial maintainance/get the money in the bank work.

We have to recall however that a parent brings their child to life and tags along through all the effort and work. A child is the result of years of high to low level of unpleasant work. Drawing an arbitrary line of when you shall stop giving as a parent is naive and egotistical.

Very true. I think I recently deliberated on this point about ideas being cheap because they can so deceptively sit in the realm of the ideal where everything is unconstrained and unchallenged. Grounding ideas and putting them to the test is where you begin to discover all the boundaries and tradeoffs and messy details that must be sorted through. Real work has a way of illuminating all those sticky messy points that must line up first for an idea to have a legging in the real world. Our imaginations are free to come up with all sorts of inconsistent and conflicting ideas that just never come to the foreground because of the centration on the beautiful perfect idea.

The best way I've heard this described is... Imagine the best painting you can come up with or have ever seen. Now go paint it.

What, if your ability for imagination is lacking? What, if this would be true in general? What do these implementers actually implement then?

I also think, I have another definition of 'idea' than most people here. It is common especially in software development, to see the way as the destination and incremental change as development caused by some magical good-ending evolutionary process. This includes a quite unsubstantiated belief in getting the right ideas automagically along the way.

You need both things. A good idea does not descent from heaven. Also, it has history in the person, creating it. This history is hard work in its own, inner fighting against the common and the environment. Jumping out of the box, all these pure implementers are unable to do.

I am working for 25 years in the industry. I have written real code from the beginning. But I am also a mathematician and can say I had a few good ideas along the way. I am proud of it.

What's the market price for an idea?

If it's greater than zero, let me know. I have notebooks full of them. Business ideas. Project ideas. Political ideas. Social ideas. I generally can't give 'em away, much less sell them. Why? Everybody has their own ideas, and they like 'em better. And the ones in my notebooks don't have what really matters: validation.

I’m interested, do you have a blog?

Not really! I have an old personal website, but I'm most active on Twitter, where you can find me as williampietri. Thanks for asking.

"Pure" ideas are cheap and easy; good ideas require a very thorough knowledge of implementation which is usually achieved through experience

Right, and to add upon this - that's because all ideas originate from the senses.

To demonstrate: try to imagine a new color you've never seen before that's not in any way associated to any of the colors that you've seen. (it's impossible)

Or think of it this way: You could explain a car to someone who has never seen a car before, but only so long as the ideas used to explain the idea of a car (e.g. wheels, doors, windows), have already been familiarized to the other person. Otherwise if those ideas weren't familiar, such as a wheel, you'd need to also explain what a wheel is. And if the concepts used to explain the wheel wasn't familiar (and so on), you'd eventually hit a point where you must expose the idea(s) directly to their senses (eg show them), otherwise they will never understand what you're talking about.

So all ideas come from the senses, and your minds ability to combine these "pure" ideas that you've sensed. "Pure" ideas are cheap and easy because they're the simplest ideas - they're what you directly sensed. To have good ideas, you need to combine many "pure" ideas together, hence why those who have experience working closely and thoroughly with something, will often have the best ideas associated to that something...

They're a dime a dozen(and that's a generous appraisal).


Or even negative. I've seen situations where the idea person is so busy being Mr. Toad that everyone around them is regularly scrambling to clean up messes and it ends up being a constant distraction from actually pushing projects through to completion.

Yes I agree, idea guys are bad, but I’d like to make a contrast between two words that are often thought of as synonymous. There’s the ‘idea’ guy and a man/woman with ‘vision’. The difference between these two is that someone with a vision enables others to see and feel what they see and feel about the future of a project in a manner that produces action on everybody’s part. The visionary man/woman would act in accordance with their vision, because there end goal is to create and/or share something truly valuable to others. I could think of a couple of people with visions many of them startup founders: Elon Musk, Steve Jobs, Peter Thiel, Paul Graham, and Sam Altman. Visionary man/woman take risks on their vision and put their plans into action. Sometimes they’re the business manager, or the scientist, and at other times the engineer, but in some way shape or form they lead others toward a goal through their tremendous clarity in their own and other’s collective vision.

It's similar to the expression I hear frequently at my company--"I don't have all the answers, but I have all the questions".

True, Ideas are easy but that does not mean they are next to worthless.

Idea people that just blurb out thoughts that go no where are all over the place but there are a few that convince others that their idea is useful. They don't necessarily need to do the work to make their ideas a reality but they need to convince others that their ideas have value.

We need both dreamers and workers.

If you are an idea person figure out how to get others to believe in what you are dreaming and the idea will become a reality.

Think Steve Jobs, he was the idea guy that made his dreams a reality. People want to believe that he was some kind of super engineer or programmer but he was the one that was able to get all the super engineers to do their best to develop his ideas.

It's very prevalent in amateur gamedev communities. Every day there's someone who played some game, and has some ideas how to make it better. All he needs is a few programmers and artists to make his vision in reality. Usually the kind of projects he wants to make are big AAA games in whatever is the trending genre at the moment (used to be MMORPG, now it's about battle royale). When confronted they often get defensive and don't want to accept the reality that such big projects are made by hundreds of people with multimillion budgets, not by few guys working in a basement, no matter how dedicated they are).

Perfectly believable. I used to coach at startup events, and the dreamers who wanted to make the next Facebook were so exhausting. I eventually wrote up a a stock answer to give them: https://www.quora.com/Is-it-foolish-to-go-to-Startup-Weekend...

What really bothers me, though, is not the randos with this attitude. It's that some of them will grab enough money or power that they'll be able to live out their fantasy. And woe be unto any who hop on board. Quibi being the latest big example.

That’s much better than the alternative. Where their ideas are crap, and their contribution is a direct detriment.

Also, a lot of data scientists find the science fun and the engineering boring. But they have overlapping skill sets - if you aren't good at one, you're probably not good at the other either. Somebody who shows up to a team with the goal of only modeling and pushing all the dirty engineering work to their teammates is basically a worst case scenario because

1) They probably aren't going to produce good models since they're not sensitive to data nuances, but now they've taken over ALL the modeling work.

2) They bring down the job satisfaction of everyone else on the team who would like to be doing at least some modeling.

3) They're sucking up the prestige that should be distributed over the entire team and management thinks they should be paid more for work that it turns out everybody thinks is more fun anyway.

My number one advice to entry level data scientists is to not be this guy. Don't give your interviewers the impression that you won't do your own engineering work because they won't want someone who brings negative value to the team.

Here's the tricky thing:

I love your post; I agree with your post; but it takes a 90 degree turn at the end:

"My number one advice to entry level data scientists is to not be this guy. "

Everything most people are saying here indicates it's GREAT to be that guy. You're paid, you're respected, you get the fun parts, you love your job and it's pretty safe. It just happens to suck for everybody else including team and business... but it feels that in a practical sense, gist of everybody's actual unwitting message is "BE that guy, if you can" :-<<<

If you're that guy and you have a secure job it means you write models no one ever sees in a company which doesn't know or respect data, or you work in some data science factory as a small cog of a fairly well oiled team. The latter does happen from time to time, but it's often the former.

In every other place, your job is on the line to be erased because people will soon realize no one wants a wise-ass who doesn't actually contribute much to the bottom-line in the end.

It sucks being that guy because everyone else ends up hating you.

Depending on the work environment it's not a stretch to see software engineers complaining to management, sometimes going as far to create rumors to get the jr data scientist fired.

So, no the grass is not greener. It's best to not be that person. This is why I go out of my way to prevent that scenario when I lead a team.

Not really.

You just get seen as the product owner/project manager.

That's a really good point.

I tend to be seen as a product lead / owner / stakeholder, so I feel like I'm being called out. lol

I think one difference is the software engineers see me as someone who is helping them by making their life easier. I'm not just throwing work at them blindly. I'm working with them. Also, they like it when I include them in the data science brainstorming sessions to solve difficult problems. I guess it's seen as exotic or something, but whatever the reason, they really love to be apart of it.

I think it's probably seen more as just being a decent boss.

easy to ignore hate when youre pulling a 300k bonus at comp season and can jet to st. barts to go deep sea fishing and drink claws.

Data scientists do not pull that kind of bonus. Today many of them get paid less than the data engineers do.

news to me, and welcome news to hear at that since I'm more in the data plumbing and packaging business, not algo publications.

my personal data points are from folks on buyside. trading margins have been downward trending for years

Quant research work isn't data science work which is probably where the mix up is.

On the quant side bonuses are distributed to the team.

Specifically, this is my advice to ENTRY level data scientists who are trying to find a job and compete against a flood of candidates hot off the bootcamps. I guess once you get your foot in the door, you can be that guy if you want. It seems to be a successful strategy at companies without technical leadership.

The flipside is that there are 4x the job posts for data engineering as there are for "that guy".

Companies understand that you can't hire five of that guy and get things done. If you have 5-8 years of experience as a technical product manager/data science combo then you are very happy as the magician. But very few magicians are being hired out of college, and a lot of "software engineers in data"

Pretty soon companies are going to start realizing that the 4x DEs can largely replace that 1 DS, and they will be more than happy to do so.

I went into DE because I was kind of forced into the space, but I'd strongly prefer doing full-stack DE. Anymore, I still have the opportunity to build models, they just aren't client-facing stuff, but instead are kind of Data Plumber Bots that help me do my job better so I can waste more time building other fun bots that I can't otherwise be paid for.

Seems like a waste of resources, but my manager could have another DS tomorrow, but my role would take months to fill.

Back in the day (3 years ago and earlier) at every company I was at we used the term 'productionization' to describe someone making a model aka a proof of concept, and then someone else, a machine learning engineer or some kind of engineer rewriting it to work on a server.

This process is horrible, and not just because it doubles the work, but because it introduces bugs. When the version up in the cloud does not work as intended, is it a bug in productionizing or is it in the original model? Fixing bugs in this space can take longer than the initial model development and the initial productionization. Many companies have failed over this.

So what's the solution? In recent years the industry has turned to deployment over productionization. The idea is you deploy the model to the cloud directly. Both engineers and scientists work together on the process. The scientist defines what cells in the notebook get called for the final algorithm (as there are EDA / plotting cells and documentation cells too). The engineer sets up the amazon IO stuff, database login stuff, and monitoring services. The scientist works with them to create tests and what to monitor so they get notified if there is a problem with the service.

No more mystery bugs. The model gets directly deployed, the work load is minimal, and it brings people together. The downside is often the engineers and scientists are on different teams, and sometimes companies will not let them merge for a while, so it becomes a telephone game instead of everyone feeling like they're on the same team working together. imo moving the scientist to the engineering team during this time can be helpful, or moving the engineer to the data team.

Some companies have services where entire notebooks get put up into the cloud and all of it gets called, so the scientist has to write the notebook in a way that works for the cloud. It's rarer, but how I prefer it is a wrapper py file is created that calls just the relevant parts of the notebook, kind of like a header file. This process works well for me, but it as far as I know it is not standardized in the industry yet.

In short, if you end up in this situation, there is a better way. Import the notebook into a .py file or into the cloud, don't rewrite it. This (hopefully) will remove this scenario you're describing (comment this is replying to) so those issues will become a historical footnote.

Oh my! We were exactly like this many, many years ago. See reply to this thread[0].

The way I view it is frictions and impedance mismatch. People lived in several universes and there were many "taps on shoulders". Data scientist tries to work on a project but the system upgrade messed up their compute environment and their GPU isn't working anymore. Data scientsits ssh'ing into a "powerful workstation" to have their notebooks run on more RAM or more powerful GPUs, having a certain convention to start their notebook servers with specific ports.

Building models and then wanting to show results to the client and asking a colleague. Set up a VM on GCP, write a small application, scp the model to the machine, create an environment with the same dependencies to load the model, set up authentication on the machine. Email the client. Client doesn't reply in time. You have a bunch of VMs.

Meanwhile the data scientist has produced another model with a notebook and they want the engineer to deploy it. Others want to reproduce it but have the same trouble with running the notebook (libraries, etc.).

A complete mess. We ended up building our platform[0]. We wanted our PhDs to do what they were good at, and we wanted to handle a lot for them. In the same time, we wanted our more engineering inclined colleagues not to do that work themselves, and we let the platform do many of these things (building images, deploying, scheduling notebooks, etc).

- [0]: https://iko.ai

How do you maintain notebooks in production? You use papermill? What about versioning?

Most libraries load entire notebooks from top to bottom when executing, and I believe papermill does too. (Please correct me if I'm wrong, as I've not used papermill.)

This is great for making a dashboard, a report, or some other kind of analytics, but when it comes to a service the customer uses, you typically never want to load the whole notebook. This is where the industry standard way of loading the whole notebook tends to fall on its face.

What we do is the cells that will end up in prod are written as functions inside of the notebook. This helps reduce globals when writing the notebook, so it is good form when prototyping, but also it allows just those functions to be called from the notebook, instead of running the entire notebook.

You will probably want to write your own library to do this, but in the mean time there is one that works for this purpose https://github.com/grst/nbimporter (Ironically the author doesn't recognize this use case.)

Using nbimporter you can import a notebook without loading it. You can then call functions within that notebook and only those functions get loaded and called.

In my notebooks I have a process function which is like main(), but for for feature engineering. On the prod side the process function is called from the notebook. Process calls all of the necessary cells/functions for me in the correct order. This way the py wrapper only has to call one function, then the ML predict function gets called, so it's pretty small on the .py wrapper side. There are tests written on the .py side, IO functions and what not too.

Data engineers love their classes, so it's easy to write a class that calls the notebook, and best of all calling a single function this way does not load globals, so the data engineers are happy. It's a nice library, because otherwisw you'd have to write your own (which you may end up wanting to do).

This way if the model doesn't work as intended in production it's my fault. We log everything, so I can run the instance prod caught on my local machine, figure out what is going on, update the model, and then it can be deployed instantly.

Version numbers on the engineering side I can't comment on as they have their own method, but on my end the second the model writes to a database then I strongly push for having a version number column or a version number metadata table in the database, so it's easy for me to access for future analysis.

Is rewriting your functions from notebook to a py file really something a research scientist can not do?

Or is it infeasible for some other reason?

I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.

>Or is it infeasible for some other reason?

I've had projects where the model doesn't perform as intended. Because one person was making the model and another productionizing it, it was hard to identify where the performance difference was coming from. Was the bug in the model itself or in the productionization process itself? It took longer to figure it out than writing the model or productionizing it the first time.

It takes so long to deal with these bugs because the model gets changed, so then prod gets changed to match it. Changing prod (rewriting functions) has the potential to create a new bug, so you solved one but added another, and still can't identify if it is in the initial model or from prod. This continues over and over again, problem after problem.

It's noteworthy to mention if one person is doing both the model building and converting to production this problem is significantly reduced, but is still a problem. The problem is exasperated from the lack of domain knowledge, being that both people are in the dark from the other person's process.

Furthermore, what if you need to update the model? Do you rewrite prod doubling or tripling your work? Do you take that risk to introduce another potential hard to diagnose bug, even if you're the one doing both roles?

Or do you automate the process, so the same code being developed on is the same code running on the server at the end of the day? No more bugs, half to 1/3rd the amount of the work. Why not do it this way? It's soo much easier to debug a problem in prod this way. You can take the log data and spit it into the local machine and know what you're seeing is what the user saw. No more guessing where the problem is.

One way to think of it is software engineers would think it is absurd to write their code, then hand it off to someone who doesn't completely understand it, to rewrite it in another language and put it up on a server. "Why would you ever want to do that?" they would think, and I agree with this sentiment. It is absurd to have someone (even you) rewrite your work unless you have no other option, and you do have other options. Transpilers are a thing if prod needs to be in another language. I've written models that have to go onto embedded environments. I know these challenges all too well.

>I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.

It depends if you're writing a library, like doing ML / machine learning engineer type work, or you're solving a domain challenge and writing an end to end solution for that problem, and are using standard cookie cutter ML for your phd, aka data science type work.

One leads to an engineering role, and not surprisingly writing a library for it is ideal, so other people can use it. Another leads to a data science type role and not surprisingly showing code snippets in your paper with plots / EDA and all, the same way you'd write a notebook at work, is ideal.

I'm a data scientist, not an ML specialist (though I have invented a new form of ML for work once, but that was just once and not my primary thing). I specialize in end-to-end domain problems I'm solving. I'll write a notebook to solve it, not that I have to. I've been in the industry longer than notebooks were a thing, so I'm fine doing it the old fashioned way. What I am not is an MLE. I don't need to write libraries for other users to use. I don't need to write custom ML. I don't need to do that engineering bit. To be fair, I have, and I know when it's the right tool for the job. On stackoverflow all of my points come from helping people with the glue parts between C++ and R, so they too can write libraries for R. I'm proficient in modern C++ too. I can do the library ML type work, and I have enjoyed it, but I really do enjoy solving domain problems more, so it's what I'm doing, and it's what the previous comments in this chain you're responding to are all about.

Exactely why I left Dolby. Didn't want to be a part of that process.

It's hard for most people entering this field because the incentives are perverted - there's this perception that DS is sexy and you actually don't need to know coding that much (just enough to scikit learn). Thus people with pipe dreams of tweaking model hyperparameters to spin gold come in and get a rude awakening. Not a lot unlike people flocking to become actors to LA.

I worked in investment banking (as an analyst, not an engineer), so very different part of finance, but this was my take as well. Companies might love to talk about how important engineers are, but at the end of the day, if you can't directly link someone to revenue, they get viewed as a cost center and take on second tier status in the organization. Then the same companies complain that they can't find enough (or retain) engineering talent. Not many places get the balance right. Silicon valley treats engineers well because for the most part, the value they bring is more obvious (and also, they don't threaten the existing hierarchy in the company). Curious to hear if anyone has had the opposite experience.

Yes, quite a few developers left our investment bank and went to work for our suppliers (of trading software), stating they'd rather work somewhere where they're seen as value creators rather than a cost center.

I worked for 15 years as a software engineer at Morgan Stanley where they valued the process of taking a 3 martini lunch idea into a production platform so value of engineers was recognized and rewarded as such ... its somewhat easier to whip up a new financial wrinkle its a whole other level of magic to design and implement that idea when it takes 60 software developers 3 years to get that idea to market before the rest of the street ... of course the IT department was/is the largest budgeted portion at the entire bank and for a good reason

Engineers get paid well in SV because they are in demand, have lots of employment opportunities, and therefore are more difficult to retain.

And because their contributions can be tied back to revenue. You need both, demand for talent, as well as the ability & justification to pay for it.

Engineers are in high demand all over the world. But most companies do not profit enough from technology to justify similar paying SV salaries.

Not always. Frequently, the connection between code that’s written today and revenue tomorrow is tenuous and difficult to package in a way that says “look at me! I’m valuable!”

And, then there are those somewhat rare occasions where a project is not intended to increase revenue, and may even decrease it. At my last employer, we guesstimated that a project I worked on for months could possibly have ended up costing us $2M per year in revenue. That was both accepted and expected, because we were doing it to gain goodwill with users, but in such a way that it might end up pissing off a small minority of our customers.

I really wish, just once, I could work on a project and put underneath it on my resume “Increased revenue by X%,” because I’ve never worked on anything that was so easy to directly trace back to the top line.

Cost savings are another story, because engineers can fairly easily quantify how much less money is being spent by doing $THING a bit more efficiently....

My partner leads engineering talent programs for a large SV company and I can assure you they do not track value added by engineers. In the overwhelming majority of cases the value added cannot be tracked. Now, if you're talking Product Managers, that's a bit of a different story. It's simply a supply and demand issue.

tell that to Mr Nugget from The Wire. How much you get paid is a reflection on the CEO | culture not on the value you provide.

I really like your magician/blacksmith analogy.

I'm in industrial automation, but it's much the same. Projects where someone developed a strategy but has never been involved in the details of a machine are doomed to failure (or at best to be unreliable and producing low quality parts). Projects built by machine fabricators are over-engineered, frequently late, and sometimes unprofitable, but damn if they don't work well.

The main trouble, I think, is that when a shiny new contraption is brought to the king, it's too often the magicians doing the talking - whether they're speaking words of power or Common, their job is to talk. Meanwhile, the blacksmith is probably busy at in his workshop some ornate scroll work for the next thing, or repairing the previous gizmo, because he'd rather be hammering away at his anvil than talking.

The higher you go in an org chart, the fewer the number of people who understand the work their company actually does, and the more voices you have between the workers and the decision-makers to take some of the credit for work as it passes up the chain.

That seems to be true in every field I can think of. The smaller the gap, or rather the more practical experience the strategy people have, the better a given org seems to be.

One common issue I run into is that when the blacksmiths start talking, nobody listens.

The funny thing is, even when it's unintentional, people seem to attribute credit to the magician rather than the blacksmith. At my workplace, I even have situations where I explicitly tell people: "I am familiar with X and how (some) of it works, but not all of it. I did not create it nor was it even my idea; all credit goes to Jim. If you need anything to do with X, you're best off asking Jim. But if Jim's too busy to help, I may be able to provide some minor help.

Yet even when I do this, I somehow become the arbiter and authority for all problems and questions on X. 5 years go by and everyone thinks X was all my genius. And I hate it, because personally I do not like X created by Jim - even if everyone else does...

Then make sure Jim knows and feels your appreciation. Maybe Jim doesn't want the attention or questions (or at least doesn't value the prestige/benefits at the cost of the attention it requires). Many times the person who will churn out work in a dark, unappreciated corner will continue to do so if 1 person who they respect shows them enough respect in kind. They know the world they're in. No one expects plumbers to get a free lunch even while many know they should.

To add, quants that can't do the data engineering work are always crappy quants. I haven't seen a counter-example to that. Profitable models aren't going to be delivered on a silver platter. They need to be able to process pretty low level data effectively and build ad-hoc custom tools and data pipelines around that to test out their ideas. Otherwise they're constrained to the tools others have built and that massively narrows the search space that they're capable of traversing.

The best quants are 1/3 statistician, 1/3 developer and 1/3 trader, in my view.

I'm not sure about crappy quants. Some people of the "quantitatively inclined trader who has learned Python" variety are never going to be good at the engineering side - it takes years to learn to be a good software engineer, and that's not a good use of time, for them, or for their employer. But they can still do useful work.

The trick is to figure out how to work effectively with those people. Build infrastructure that keeps them on the rails, refactor their code, push them in the right direction, tell them when they've fucked up, teach them little things with high leverage. As long as that doesn't turn into being their slave, that's fine.

If they're using a dynamically typed language to do monetary calculations, it's not going to be ideal.

Researchers do not need to have deep programming experience, but they have to be comfortable enough to use an environment that can lend itself itself to the problem at hand. On the quant side, unlike on the data science side, the barrier of entry on the programming side is a bit higher. To solve this problem many firms have their own internal programming language.

This is dogmatism swung too far in the other direction, IMO. There are many, many successful production code bases written in dynamic languages. In my own experience as a vision scientist/engineer, there is tremendous value in being able to quickly whip up a concept in Python and then being able to easily visualize the results. Doing this exploration in C++ is wasteful. Implementation takes much longer, the correctness brought by static typing is dubious since the code isn’t in prod, and the canned CV/visualization libraries are fewer and frequently suck in at least some way. That said, there is also tremendous value in understanding how to map your Python prototype into production code, too. Someone strong in this field can do both.

This was addressed in the previous comment

>On the quant side, *unlike on the data science side*,

Vision scientist is on the data science side. You're not dealing with monetary values where floating point error compounds on itself to the point your models become garbage. Quant work is it's own unique field with its own unique prerequisites.

Nothing precludes you from doing integer arithmetic in a dynamic language.

I’m not a quant and this isn’t my area of expertise, but, for example, I’m pretty sure various differential equation solving methods depend on variables taking on continuous values, so floating point basically must be used. Understanding the impact of that is definitely very important. Analogously, I frequently run into numerical precision issues in image processing. Understanding how numbers are represented on a computer isn’t unique to being a quant. Understanding how the choice of representation can impact prod is also not unique to being a quant. The dynamicness of the language isn’t particularly relevant, either.

>Nothing precludes you from doing integer arithmetic in a dynamic language.

You would be surprised. The second you use pandas with a custom data type (let alone any other library you'd want to use) it can randomly auto convert it to a float. Furthermore identifying when it randomly converts the type on you is a pain.

>so floating point basically must be used.

Quants tend to use fixed precision types. It is like a float in every way, except base 10 instead of base 2 so there is no floating point error.

> The second you use pandas with a custom data type

That's a pandas (and maybe numpy) issue, not a dynamic language issue. (If you want to generalize from the specific libraries more accurately than “dynamic language”, it's “using a low-level library whose type system doesn't match the host language type system” issue.

> Quants tend to use fixed precision types. It is like a float in every way, except base 10 instead of base 2 so there is no floating point error.

No, a type that is like binary floating point in every way except base 10 instead of base 2 would be decimal floating point, not fixed point. Decimal fixed point is different from binary floating point in more ways than base.

Quants don't care about floating point precision in research. It's just applied stats

I do, because the results from my research varies when I'm validating the model.

> If they're using a dynamically typed language to do monetary calculations, it's not going to be ideal.

And yet, Q.

> If they're using a dynamically typed language to do monetary calculations, it's not going to be ideal.

I think this is an inaccurate take. No one in finance is doing accounting or model estimation using Python's floats; they are using numpy's float32 (or float64) type instead. I think a more accurate version of what you're saying is that static type checking is useful when modeling complicated contracts; this might be true, but I think it's not that important, as those things aren't that liquid anyway.

Jane Street's decision to use OCaml is almost as much about hiring and history as it is about language features.

> No one in finance is doing accounting or model estimation using Python's floats

We are. When your input data only has five significant figures, and probably less than that of real information, numerical accuracy is the least of your worries.

Or, they're using ints instead, at least for market data.

Fixed precision types technically. Internally they are an int under the hood, so yah basically that.

> "To solve this problem many firms have their own internal programming language."

Any examples other than Jane Street?

Goldman Sachs (Slang)

> 1/3 statistician, 1/3 developer and 1/3 trader

How is being a trader different from being a statistician? Curious as I've never worked in finance before.

By trader, I mean domain knowledge about the markets. Statistics is the toolbox that this domain expert uses to test their hypotheses and turn them into a profitable model. But if the person isn't a domain expert and only knows statistics, their ideas about what to test won't be good.

And to knowledge, I'd add disposition. It's been years since I've been in finance, but the best traders I worked with were all very driven to succeed, to dominate, to win. Markets were really interesting to me, but I never cared much about that part.

Yeah, it's a performance discipline like any other (competitive gaming, athletics, etc) where only the top few % can succeed. If someone isn't very driven then they won't make it.

This is almost precisely the thought process behind how my company hires data scientists who build user-facing analysis.

1/3 statistician 1/3 engineer 1/3 product person who can learn the user's domain-specific needs

What must be communicated to management: It is easy to find other magicians. It is not easy to find another blacksmith. Without the right blacksmith, there can be no magic.

Magicians will be magicians, always hustling (bullshitting), but they will never have the value and job security of the blacksmith. The blacksmith can see the fruits of her own labour, whilst the magician must lie to herself and others in order to claim the blacksmith's value as her own.

If the blacksmith is good enough, she will earn the trust of management and management may consult the blacksmith in the selection of magicians. Management may ask the blacksmith to interview magicians and seek her advice on the final hiring decision.

The blacksmith may not carry the "prestige" of the hustling, bullshitting magician but she can command a high salary and dictate her own working conditions. This is only if management understands her value. What the magician thinks of the blacksmith is irrelevant.

Reliable blacksmiths are hard to find. Magicians are a dime-a-dozen.

  > It is easy to find other magicians. It is not easy to find another blacksmith. Without the right blacksmith, there can be no magic.
What? That runs counter to my experience at every company where I've either seen data engineers or worked as one. My observations of how management treats the two groups is this:

Data engineers ("blacksmiths"): Blacksmiths are paid less. People think of them as less highly educated. Their work is less creative. When they are successful, their work is mostly invisible. They are interchangeable. People think of what blacksmiths do as more like scripting than writing code. Blacksmiths mostly work on configuring systems they didn't build. Blacksmiths do more troubleshooting than building. Their roles are focused on support.

Data scientists ("magicians"): Magicians are paid more. Much more. People think of them as more highly educated. By definition, what they do is magic. They work on prominent projects. Their successes are highly visible. They build large systems that only they can comprehend. They use support staff to clear away mundane obstacles so they can focus on unique, highly creative aspects of work.

Saying that we need more data engineers than data scientists is like saying that we need more janitors than CEOs. That's true, but it's true because we made it true by structuring projects around one prominent, well-paid person supported by a staff of invisible drudges.

This smacks of the positive self-talk that QA and software testers used to give each other: "We are indispensable! We take pride in our craft! Nothing can ship without our signoff!" And then lots of companies reduced their QA or eliminated it wholesale by focusing on continuous delivery and changing consumer expectations of what "broken" or "acceptable" means. The same fate awaits data engineers.

Good Data Engineering is what enables good Data Science. With a good infrastructure you can go 100 times faster. Getting rid of Data Engineers means killing Data Science.

Sure. So in a few years, an all-in-one 80% solution like Palantir Foundry will come along and there will suddenly be a lot less demand for data engineers.

Anecdotally, the former head of QA for Palantir UK is now the head of data engineering for Palantir UK, and Palantir does have an out-of-the-box, end-to-end, it-just-works product that handles 80% of ML workflows. You're betting your career that they won't put it in a box and sell it at commodity software prices?

If something like that becomes mainstream, data scientists who just glue together canned ML algorithms from libraries should also be concerned. The kind of automation that enables eliding finding and cleaning the right data probably can also manage running PCA and basic ML.

I remember the same claims for ISL Clementine before IBM bought it and turned it into SPSS modeler.

The way the evolution in software went, platforms became more capable and allowed individuals to automate more common tasks. QA/DevOps/SRE teams were consolidated and replaced with smaller platform teams which empowered internal engineers to quickly write scalable and well tested services. If data management, instrumentation, and ML tooling become sufficient then perhaps the data engineers will be replaced by a science platform team.

Caveat is that many scientists are expected to publish novel research to advance their career. "Infrastructure" and "data management" do not tend to produce the kinds of sexy projects which are attractive to publish.

I think this model (an integrated team) is what I see, there is a huge benefit in terms of short decision loops from having one team - but data engineering skills are really important in enabling it. Also if the people doing the data engineering are close to the data science then there's much more likelihood that they will produce effective solutions for the backend of the project.

aye - this pattern often trends towards the roles merging over time. The counter example to the platform team approach from the software side is Software Engineers owning the full infra and ops that their services need.

There's a big incentive for companies that need to hire more people to give the "better" title out for the same kind of work. If managers do maintain a gold/silver role on their team all of the folks in the silver role will look at the gold role as their next move. Worse there is a net-negative productivity drag where the gold/silver role constantly debate what's in-scope vs. out of scope for their work.

I once saw a team where the scientists were meant to be equivalent to SDEs in coding skill, but the scientists could only in practice do some light python/bash scripting. They tried to make the SDEs responsible for "productionalizing" the projects which meant adding tests/etc. The engineers who could all left the team in 6 months, the ones who remained were also unable to perform more than light bash scripting/python work.

>reduced their QA or eliminated it wholesale

This is usually the case only for companies that work on low risk applications (I.e. not safety related or critical industries) or have been lulled into complacency (sometimes, ironically, “we haven’t had a major issue so obviously QA isn’t needed” when strong QA is precisely why they didn’t see issues)

  > low risk applications (I.e. not safety related or critical industries)
My anecdotal experience is with Palantir (software used in war zones). Between 2016-2018, they eliminated most of the 150-200 QA people they had. Testing is now done by devs and users with the help of sound CD principles like blue/green deployments.

That’s fair, because there’s process elements in place to mitigate risk. I read your comment to say “get rid of QA” not “get rid of the separate QA team”. My comment was more pointed to those who feel testing is inherently wasteful. In your example, the testing and configuration is tightly controlled to mitigate the same risk as a QA than (although there may be something to be said about the best practice of having the QA team be independent). Where I get nervous is when testing gets cut in hazardous systems because of cost or schedule. I personally wouldn’t want to get on a plane or autonomous car built like that. My own personal anecdotal experience is that organizations that were cavalier about QA on safety critical hardware/software inevitably had their comeuppance

>Without the right blacksmith, there can be no magic. Magicians will be magicians, always hustling (bullshitting)

>management may consult the blacksmith in the selection of magicians

I mean when you put it like that, why hire a magician (bullshitter), if the magic relies on the blacksmith?

And, if management needs to consult someone (a blacksmith) on hiring (another blacksmith or for whatever reason a magician), then arguably management is made of magicians.

Don't get me wrong, I agree with the point your making. It's just the problem is with BS and BS is rampant, or like you say: a dime-a-dozen.

> I mean when you put it like that, why hire a magician (bullshitter), if the magic relies on the blacksmith?

Because magicians are better at the smoke and mirrors that drives funding rounds and closing big sales deals.

I see this same attitude about TDD adoption - teams in my company say things like “testing is for lackeys / that work is beneath us”, I.e. they see that as the responsibility of QA testers who are less important in their view. This is short sighted, arrogant and encourages similar problems with superiority complexes. TDD is still controversial in some circles, but engineers who have a deep understanding of both tests and implementation are far more valuable than those who only understand one side. Anyway, sorry for the somewhat off topic rant, but a lot of what you said resonated with me

TDD is fine when you have a specification to work to. A lot of software development in the real world is quite "experimental". Requirements are poor so devs need to provide what is essentially a prototype and receive feedback until it is good enough.

Agreed and that’s what spiking is for! Once you have an idea of the solution then you can implement properly with TDD, which generally results in a much better design than just going with the original spike. However, I find it bizarre that many engineers think unit testing is beneath them. I can understand that people find it difficult to stick to the test first discipline. But it’s crazy to me that you wouldn’t want to write tests at all.. I feel so much happier with the additional protection.

I think this insight exists across a lot of fields. Basically, if you want to be a really excellent magician you also better be a decent blacksmith. More concretely in this case, if you’re unable to do the data “engineering” yourself then it will close a lot of doors for interesting and novel work on the “science” side. Beyond that, if the scientist’s job just involves gluing sklearn models together I think that job is more on the engineering side of things than the supposed scientist usually wants to admit.

This problem only grows as the company scales and the science and engineering pieces are formally split along some role guideline.

Inevitably if you treat a job role as a support role, you'll attract weaker individuals into that role then you would get if it wasn't considered a support role. The problem with Science oriented teams is that all roles other than the science role morph into science support roles over time. The same pattern used to occur with Engineers and QA, or Engineers and ops.

How do you achieve people like this? From my limited experience (college senior joining a hft firm shortly, so I've recently been in several quant finance SWE interview loops), firms seem to vastly downplay the financial aspects of the job for software engineers. Compounded on top of that, firms don't expect or encourage financial backgrounds for engineers (at least new grads)- the expectation is that whatever limited financial background we'll need to work will be given to us when it becomes necessary.

Is this because it's easier (obviously) to teach a quant engineering than it is to teach an engineer quant finance? Or rather because it's expected now that traders will become the bridge between researcher models and implementation, and engineers will simply provide the underlying infrastructure to power these implementations?

As I see it you need people who have shallow knowledge of many areas and deep knowledge of one area. That lets you have a group of experts but ones that know enough about other areas of expertise to work with those other experts.

this perception of classes within engineering is the greatest frustration of my career. People with a PhD or “scientist” in their title are not more valuable than engineers who end up being the ones to get things to work.

The "scientists" are often far less valuable or have outright negative contribution.

From experience, the magician will take every chance to make this divide greater, and sell their expertise, rather than grow with (and help grow in domain) the blacksmith skills. You end up with magic: closed siloed knowledge. “How would the blacksmith ever understand magic?” was often something thrown around by magicians at meetings.

The (repetitive) blacksmith role is not an interesting one, digital revolution needs to come into place. Architects that build tools, self service systems are much more interesting.

That's interesting - I just completed book on Jim Simon/Renaissance (The Man Solved The Market). One of their early advantages was having a person who was just focused on acquiring and cleaning data. I expect that advantage has largely gone away at this point due to wide availability of market data but I thought it was interesting in the context of this article.

Same for CFM too, they have an entire team working on alternative data and they feed it to a modeling team.

maybe hedge funds would be able to find more people if they didn't only hire "guys".

Can you show me a job ad where it is specific to guys?

How does compensation tends to differ? And the education levels?


The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.

We need more Data Engineers involved at time zero in projects to help:

1. Plan out what data should be produced/captured by the product

2. Instrument systems to actually generate data consistently and effectively

3. Build ETL pipelines and data management systems

4. Manage enterprise data sharing and resiliency


What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team. That is typically missing data or poorly formatted and they spend 90% of their time cleaning it up then running some basic regression with numpy or whatever.

Need better understanding of the data lifecycle by organizations and investment in instrumentation and data management.

> What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team

Not to disparage the amazing data scientists I've worked with, but I've been on teams where this is very much the approach to operationalizing models. It's basically, "Here's the sklearn model and some fragile featurization scripts we built. Can you take this to prod ASAP?"

The problem I've seen is that DS & DE teams were in different parts of the org and had their own sprints that were in no way connected. So they kept chucking models over the wall and we kept trying to faithfully operationalize. Once we convinced leadership that we had to collaborate from the get-go, things went a whole lot better. It also improved the working relationship of engineers and scientists.

I learned a hell of a lot from the scientists; they learned how to write better code. They also learned what code they didn't need to write because I could do it faster or better than them, leaving them to focus on more important things. It was pretty amazing to find what manual processes they would setup in lieu of proper (or even any) engineering support. Again, these are amazingly smart people, but they were being square-pegged into a lot of round-hole engineering tasks.

Now, the much more frustrating issue I had was being in a very data-heavy organization and being told by a distinguished engineer (my skip-level) plus my direct manager that, "data engineering isn't a real discipline." I left that org very shortly thereafter.

Not only that. If you have DS & DE in different orgs, often DE is in an IT org that also has to support legacy systems that sometimes become very time intensive. So then the DS org says they can't get stuff done because DE does not deliver, and DE can't get things done because the "elder engineers" in their org are not allowing reformation. So they are stuck between a rock and a hard place.

This is 100% my experience as a data scientist. The engineering support we get is restricted to submitting a ticket for database access or moving data from one system to another. Wouldn't dream of involving an engineer in a data science project team, because I have no evidence that they have any experience or expertise in anything other than tickets to move data around.

That's first line support not engineering

>> moving data from one system to another > That's first line support not engineering

Assuming the OP meant "setting up a pipeline for moving data from one system to another" and not a one-time copy, it is definitely engineering.

Yeah, it's usually a pipeline

Yeah, because moving data around (which is hardly the entire responsibility of data engineer) is not useful at all

>The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.

Reading this thread has made me realize just how lucky I am to work very closely with strong a very strong Data Scientist, who is complemented by a very strong Data Engineer. Conversations with the Data Scientist are always about strategy, product alignment, and ensuring we're optimizing what we build for learning. The Data Engineer works very closely to ensure we're actually capturing the data we think we are, getting it to analysis systems, and making sure those data pipelines stay healthy.

We have a sister company with many data scientists, and very few (actually I don't think that they ever hired any with the specific title) data engineers.

And, their production alleged "machine learning" (it's pretty much standard linear regression, but calling it ML is sexy) systems are slow motion train-wrecks. If the string and duct tape holds, then it works, but it's unfortunately continually breaking.

Hell, in Slack, I watch their data scientists continuously wrestle with how to actually make their Jupyter notebooks work in production.

Whereas my company has far more data engineers than data scientists. The plan from higher up the corporate food chain was always that we'd give them our data to do their data science voodoo on, but we ended up getting a few data scientists of our own for specific projects.

So, we focused on ensuring our data stream was reliable, consistent and sufficiently timely, for them to work on. But as soon as it hits their systems, it's a forest fire of hacks upon hacks, which inevitably break.

In the end, we had to send in our data engineers to stabilise their flagship "real time reporting" product that corporate was so amped about.

So yeah, I think that there's probably a happy ratio of data scientists to data engineers, of about, say, 1:5 or 1:10, because the maths generally scales O(1), it's the beauty of maths, but the actual engineering to get clean data delivered timely without breaking anything scales very differently indeed.

>Hell, in Slack, I watch their data scientists continuously wrestle with how to actually make their Jupyter notebooks work in production.

Could you go into more details on what their struggles are? We had many problems as a company doing machine learning projects, and we built our internal platform (https://iko.ai) to keep our sanity. I'm always interested in problems others may be having.

Python hell, basically. Getting the right Python version, right dependency versions etc.

We didn't have it any better wit Scala. We'd run sbt and it would download the internet.

Our belief was that there are some odd behaviors in every tool and we had to figure out a way around.

This same sentiment (which I personally agree with) applies to software engineering. As in: engineers deliver more practical value than comp scientists. Now you can down-vote me to oblivion.

I think generally, Computer Science is a degree and Software Engineer is a job description. So many people get Computer Science degrees, then have a career as a Software Engineer.

Yes, there are Software Engineering degrees. But I think a minority of Software Engineers have a Software Engineering degree.

What this means in practice, is that Computer Science majors need to learn the engineering skills on the job or on their own after they graduate. Although some programs help students pick up some of those skills as part of the degree program.

Indeed, I think it's possible that the majority of people with degrees, have jobs that are not identical to their college major.

Certainly, most of us with unemployable majors. ;-)

Another phenomenon is employers applying the "engineer" title to any technical worker, such as designers, programmers, technicians, and so forth.

Part of it is that businesses evolve towards having caste systems. When this happens, then folks in the lower castes will try to rearrange their job titles to resemble the upper castes, or change jobs.

Anecdotal: University of Washington considers (considered?) them two separate degrees, holding CS as more theory and research-driven, and CE as more practice and career-driven.

I think it would apply if companies were hiring large number of computer scientists and using them to try to build usable software. I don't see many making that mistake. Most recognize that computer scientists belong in a research or academic setting.

Is this sentiment perhaps due to someone "practicing CS" on your engineering schedule? What's the real harm you're describing?

You and me both, friend. Except I normally get down-voted to oblivion for saying the opposite.

I guess depends on what valuable means. I imagine most comp scientists are less replaceable than most software engineers, so point for compsci.

The two are complimentary. Engineers can't do anything without the fundamental insights scientists provide. But scientists don't have the practical experience of writing end products that real users use.

Obviously this is a huge generalization but I think it's a useful way to think about it. And when I say scientist, I mean "Professor of CS" not "24 year old with a BS in CS".

Depends if you have a computer scientist doing a software engineers job

I agree, if anything the data engineers (folks with engineering backgrounds) should be doing the applied work while a department of data scientists works on the theoretical or novel data analysis methods.

Right now our product has accumulated a lot of technical debt on the data validation side because data scientists designed the test code in a way that dramatically slows the development process.

> novel data analysis methods

Many "data scientists" (not all, but many) have little to no ability to do anything other than apply "recipes" of algorithms or classification methods or logistic regressions, etc. Asking them to develop a "novel" method would be fruitless. Asking them to clean and scrub the source data set is like telling an amateur pie-baker the store was out of pie crusts, you'll have to make your own from scratch -- it's not going to happen, they just don't have that skill, the instructions on the box don't account for that possibility. As soon as the task diverges from the simple step 1, step 2, step 3 that they were originally taught, you realize they have very little ability to adapt. YMMV of course.

> Many "data scientists" (not all, but many) have little to no ability to do anything other than apply "recipes" of algorithms or classification methods or logistic regressions, etc.

This is because they rarely hire people with scientific thinking ability. They just hire people who can code and program from set recipes. Once you hire such people you can not expect them to do non-recipe work. If you don't want recipe work, don't hire people will recipe skills. Do not have job interviews that select for recipe people. But, that is exactly what most companies do.


Yep. The key is really software skills. If you’re unable to even filter the data yourself, you’re also probably unlikely to be able do implement novel analysis techniques, especially if the analysis algorithm has many complicated steps or is computationally expensive.

In all fairness, it’s basically impossible for a new grad to have those skills. 4 years of a bachelors in any field isn’t enough to cover such a wide area. Even for people with graduate degrees it’s a stretch.

The hope is that the 4 year degree gave you the ability to quickly pick up those skills on your own.

If your four year degree didn't give you the ability to learn and expand your knowledge on your own, its a colossal waste of your time and money.

Sure, but depending on what you’re doing, “quick” might be years. You can get a PhD in understanding the theory, a PhD in designing fast numerical algorithms, or spend many years becoming a strong software engineer. I think the willingness to learn a diverse set of things is much more important than learning narrow areas fast. The short length of a bachelors usually isn’t enough to get this diversity.

>Asking them to clean and scrub the source data set...it's not going to happen, they just don't have that skill

I think you've been working with conmen/conwomen. I've never seen a data science project that doesn't involve data cleaning or wrangling of some sort.

Have you read through the comment thread? Did you read the article? Most everyone is in agreement that projects require a lot of cleaning & wrangling and a lot more -- the point is that data scientists are generally not doing that stuff, they expect academic-quality, pre-processed, pristine data, so it's data engineers who are stuck preparing the data, and who are in high demand.

Yes I read both the article and the comments.

I meant a data science project in terms of a project completed by data scientists. In my experience, all data scientists are accustomed to doing extensive cleaning etc.

This feels especially true when you have access to things like BigQuery ML.

It's very easy for an average engineer (like me) to start using ML using these tools, but a lot harder to explain how it works, or exactly which type of models to use.

In my mind a DS would be really useful to just point us in the right direction and check work. Like a super specialist QA...

>The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.

This matches my observations as well. I'm an engineer (The non software kind) at an Industrial plant, I have noticed similar in my involvement with data scientists.

I think in a lot of cases it needs to be acknowledged that data scientists are not domain subject matter experts. Very often the data scientists we have worked with lack knowledge I take for granted as an engineer such as knowledge of basic chemistry, physics etc. I can sanity check plant data almost instantly. For example I will know if a material reacts in an endothermic or exothermic manner and can verify that its effect on a temperature prediction model make sense.

As a result I often feel like Data Scientists are not empowered to bring their full expertise to bear, they don't understand our process fully and lack a lot "engineering" knowledge to make value added inferences about what their models are demonstrating. Often they can deliver a model and show that a particular term is significant but they have a very shallow understanding of what the term actually represents and can't provide concrete recommendations as to how we could modify our plant to benefit from what their model is demonstrating.

Sometimes I feel like we need an additional translator sitting between who can speak both "Data science" and "Engineer" I don't think this is quite what "Data Engineer" as suggested by parent article is but possibly the role could be expanded to incorporate this.

> you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team.

I feel seen. At a previous job, our output after some cleaning and transforming was a pg_dump for the data scientists to load. We had little visibility of what they did to that database once they got it.

I suspect in rare cases this is by design, because engineers would object to the behavior of the Business Intelligence department on ethical grounds.

If anything, we would have objected to the quality of the code they were writing.

This is a systemic problem. We ask non software engineers to write code, and then we expect them to apply a level of robustness and long term planning that even we have difficulty achieving. Not because we're being picky, but because we know the failure modes that are likely, and we know that people convince themselves that they aren't.

We've been through this with installer writers, database admins, test automation, operations people, and now 'devops' people who were supposed to be the answer to these problems. It never stops.

It's a two way street. SWE need to learn some data practices and data folks need to learn some SWE practices.

Oh absolutely. We'll build completely the wrong thing, but build it well (which just makes it all the harder to throw it away).

As a Data Scientist, I do everything you mentioned because I came from a SWE background. I think Data Scientists, even if they are only interesting in the "fun" part of the job, should know how and why data are captured their way, so that they understand which models are better suited, which saves a lot of time.

Who will do the proper cleaning then?

The aspiration that GP was getting to was that less cleaning is required as a result of better data engineering, I believe.

Correct. If you build your instrumentation correctly, then you don't really need to do any "cleaning."

Doesn't mean you might not need to do transformation for different uses but ideally wouldn't need to, for example change data types like turning a bool into an int.

The problem is that data engineers that are geared towards analytics very very rarely control the systems that create the data. If you're lucky, you have the task of hounding a team within your company to get their data management practices in order. And the conversation there is whether they should make their job harder in order to make your job easier.

Unfortunately, data engineers rarely deal with purely in-house data. You're gonna be pulling data from a variety of data sources. I can assure you that if you're pulling from government data sources, you're gonna have a hell of a time. Speaking from direct experience, my team is probably going to spend $10M/year just trying to keep a government dataset in order, because they won't do it themselves. I'm talking lawyers, legal analysts, data engineers, data scientists, data entry personnel, etc.. just to fix data that should have never been broken in the first place.

It shouldn't be a shock that cleaning the data is the path of least resistance for many.

Hence why I said DE need to be involved as early as possible. Aspirational sure, but that's what I've seen work the best and repeatably. It's the only scalable solution IMO otherwise you're perpetually playing catch-up.

On the point about the govt I literally built a completely new contract type and civilian hiring practices for the DoD to bring in Data Engineers so they could do exactly what I describe to make your life easier.

Do data engineers have good analysis skills? Do business analysts have good engineering skills? I don't think either of them can fill the data scientist role.

The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.

There is a reason why so many PhDs get into the field, because they were trained in the exploratory/research mindset that no engineering or analytics skills can fill. Correct me if I am wrong.

> Do data engineers have good analysis skills?


> Do business analysts have good engineering skills?

Depends on the analyst.

> I don't think either of them can fill the data scientist role.

> The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.

It's not about replacing data scientists with data engineers, it's about both roles working together to make everything more efficient.

The hiring rate for data scientists has plateaued. The industry doesn't need any more of them. Why? Because data scientists often can't solve problems fast enough. It's a commonly quoted statistic that 70% of any data science task is data cleansing and/or etl. A data engineer's job is to take that 70% and turn it into 10%. The data engineer saves the data scientist time, meaning they can focus on what they're supposed to do -- build models.

If we only had to use 1st party data, that might be easier. But then again, if you’re building your product incrementally, you’re still going to have instrumentation holes that you may or may not be able to partially backfill.

It doesn't matter, as long as you don't make the person with the PhD in biostatistics spend their time writing ETL pipelines, which is a wildly inefficient use of a very expensive resource.

Do people with PhDs in biostatistics earn significantly more than programmers? I honestly know nothing about the market for biostatisticians, but my impression was that advanced degrees in the natural sciences don't really pay that well compared to software engineers, especially given that they're much more educated.

If they work e.g. in a hedge fund / trading firm, then - yea. And you see lots of PhDs from unrelated fields working as quants there.

Not to worry, corporate will just outsource to firm which hires Data Janitors

In specific research areas such as biomedical science it is certainly tricky to get involved because of the data governance / confidentiality issue... so we have to do both roles to some extent

I have a dream - and it looks like this!

I can't recommend the Data Engineer career enough for junior developers. It's how I started and what I pursued for 6 years (and I would love doing it again), and I feel like it gave me such an incredible foundation for future roles :

- Actually big data (so, not something you could grep...) will trigger your code in every possible way. You quickly learn that with trillions of input, the probabily to reach a bug is either 0% or 100%. In turn, you quickly learn to write good tests.

- You will learn distributed processing at a macro level, which in turn enlighten your thinking at a micro level. For example, even though the order of magnitudes are different, hitting data over network versus on disk is very much like hitting data on disk versus in cache. Except that when the difference ends up being in hours or days, you become much more sensible to that, so it's good training for your thoughts.

- Data engineering is full of product decisions. What's often called data "cleaning" is in fact one of the import product decisions made in a company, and a data engineer will be consistently exposed to his company product, which I think makes for great personal development

- Data engineering is fascinating. In adtech for example, logs of where ads are displayed are an unfiltered window on the rest of humanity, for the better or the worse. But it definitely expands your views on what the "average" person actually does on its computer (spoiler : it's mainly watching porn...), and challenges quite a bit what you might think is "normal"

- You'll be plumbing technologies from all over the web, which might or might not be good news for you.

So yeah, data engineering is great ! It's not harder than other specialties for developers, but imo, it's one of the fun ones !

The other thing I'd emphasize here is dealing with "state". Data is effectively state.

As application engineers build increasingly "stateless" code (e.g. pure functions, serverless deployments, etc), that state gets pushed elsewhere. Someone has to manage the queues, file versions/locations, logs, databases, configurations and so on. That is all "data".

State management is a tricky problem even in a single-threaded application. It's doubly so in distributed systems, where state can be inconsistent between all the moving pieces. This is the source of endless data integrity issues. I think data engineering is a great way to get some exposure to all of this.

> As application engineers build increasingly "stateless" code (e.g. pure functions, serverless deployments, etc), that state gets pushed elsewhere.

Exactly. You can't magically make a stateful problem stateless, you can merely move that state around. Sometimes moving state around means moving it somewhere that is appropriate and capable of expertly handling that data. But if you make those choices wrong, it makes every aspect of your application more complex.

UI programming tried going down this idea of stateless programming, and for a while it was trendy to do so stuff like redux. The problem is that UIs are state machines. That's not an analogy, that is a literal statement. And it is true of all UI's...it's just as true of the transmission lever in your car as it is for your saas dashboard. You can't program stateless UIs...they would cease to be a UI. So at best, you can move that state around. And with most of these solutions (eg. redux), you end up pushing that state into a massive global singleton, where even simple things like the state of a single radio button needs to be fed through dozens of tightly coupled components in order to "statelessly" render. And even worse, you lose the extremely helpful distinction between UI state and domain state, mixing them both together into a gigantic shit stew.

>The other thing I'd emphasize here is dealing with "state". Data is effectively state.

It gets even more complicated. It’s not just the current state that matters, but also the history (sometimes the entire history) up to that state.

And where would you recommend someone to start a data engineering path. Any book, learning source?

The book "designing data-intensive applications" is really really good, and covers all the concepts (although not per sé the tools) you need to understand.

Long time DE here. I recommend trying to build your own data warehouse around something you're interested in. Don't worry about teh scaling - focus on the core engineering, taking data from different places, combining it into a sensible data model, update it automatically every day. Add in more sources.

It's shockingly difficult, and something that only experience can teach.

I have the same question and I believe the answer is in the same vein as someone who asks about software engineering. Books/courses are great for the concepts, but your goal should be to build something ASAP since that's where actual learning will come from.

Work at a company that has a good data engineering discipline. Shopify is hiring: https://www.shopify.ca/careers/2021

A lot of these are just "garden variety" (distributed) systems problems. Dealing with systems with differing latency distributions, recovering from failure, acceptable tradeoffs between speed and accuracy, etc

I wonder how: 1. one finds organizations that have data engineering 2. gets hired to said organization with software engineering background.

Nearly any field of computational science likely needs skilled data engineers. You could search for topics that interest you online and contact people accordingly.

I cold-emailed my current lab's P.I. and just asked for work. Search for "research software engineer" or "scientific computing professional" positions. Plenty of data engineering goes on in many fields (environmental science, climate modeling, high energy physics, physical chemistry, etc), and plenty of fields desperately need to develop an engineering culture (e.g., plant biology, my field), whatever interests you. Availability and compensation will vary by discipline.

Shopify is hiring 2,021 engineers (not just data engineers) in 2021: https://www.shopify.ca/careers/2021

Indeed, adtech is a great place to work for anyone interesting in working with data. And yes, people working in adtech hate, and block, ads too.

Any recommendations on how to get started? Books, courses?

A couple of us inherited a machine learning project a while back. The code was horrible. Riddled with copy pasta (nearly half of the entire thing was copy paste and no code reuse). We basically refactored everything, standardized input and output file names. We put up a small Flask service to allow outside services hit it easily and wrapped it up in a Docker container so it was ultimately easy to deploy. Yes it was all the plumbing. However we also looked at the code, and the ML strategies, and while there was "some" level of competence, it was nothing more than word2vec add and divide. Totally horrible for actually finding key phrases that matter to the subject we're matching. So we started tackling that too with LSTM but our time got cut short and shifted off to another area. So not only was the "scientist" they hired completely crappy at the engineering, they weren't really helpful in the ML either.

This is obviously of lesser value to the topic at hand, and more about making sure you hire good people I think.

This is 100% my experience. I got hired as a ML engineer to bring a data scientists models into production. I did the same as you by tearing the whole thing apart and engineering it properly. I also look at the models, and oh boy... that data scientist had no idea what he was doing. Couldn't explain why he chose the model, didn't have any performance metrics (or even knew what metric to use to measure the performance) and just generally did not understand the basic concepts of his fields. I had to try really hard to drag answers out of him, but in the end I came out dissatisfied.

I am curious what your take is on things like this article:


It has been my experience too. Basically, ML / DS engineers are thrown under the bus for being poor general software engineers, but in practice it’s totally the opposite.

The problem is that ML engineers are not the people who wrote GP's garbage code. Data scientists wrote it, and I know at least a few of my very intelligent, high-functioning data scientist colleagues who are alarmingly, astoundingly bad programmers.

For me it's just the one experience since I haven't had any other interactions with an ML / DS person since.

The phrase "If you can't dazzle them with brilliance, baffle them with bullshit" comes to mind.

I always felt that tech-focused data scientists should also be required to know how process data end-to-end; at minimum, from a SQL database to deployed model, but knowing how to collect & clean data is important too. It seems like the industry is trying fill the gap that was created by a glut of people without math/cs backgrounds going into 5-week data science courses who then need hand-holding when they get real jobs.

Data science & engineering should be treated as a single collection of skill-sets. Lacking ETL experience is a major deficit, considering how prevalent that kind of work is.

This might just be my personal biases coming through. I consider myself a "full-stack" data scientist & engineer. But because data scientists who can work on the backends are rare, I always end up doing the plumbing while other people do the fun analysis work.

I think companies that are data "science" heavy are going to be at huge disadvantage soon. Tools like Rekognition and Google AI APIs are making the model training & deployment aspect almost trivial. At some point, the only real work involved in this space will be the data "engineering."

> Data science & engineering should be treated as a single collection of skill-sets.

This can be tough because there could be a lot in that skill set. You can't realistically expect someone to have solid knowledge of statistics including specialising in the sub-field and type of algorithms that your product needs, and also be able to write good code and act as a developer, and also have solid knowledge of all the tools for data streaming/processing/ETL. There is a point at which you're just stretching yourself too thin if you try to do all of these at once.

Of course, stuff like knowing how to interact with a database or employing good software development practices should be a very basic prerequisite and some scientists certainly shift things too far in the other direction and use their academic knowledge as an excuse to write poor code and not learn new tools.

I guess what I'm trying to say is that they are distinct skills but you still need all of them to some extent and striking the correct balance in one's skillset is really difficult.

These are all skills taught in standard computer science programs. Granted, some are electives, like high-level stats. But even back in 2010, data science electives were available to fill the gaps. I took three DS&E classes in college with projects that were end-to-end platforms, where you'd have to collect, clean, and analyze the data, then build, test, and deploy models from it.

I would certainly hope that college courses are even more comprehensive after 10 years and an explosion in interest for the field.

Also, much like being a full stack developer, a full stack data engineer doesn't need to know everything at a master level. But that you can at least handle tasks at most points in the chain.

I teach engineers for a living. I struggle to see how this is not just a straw man argument based on colloquial usage of terms. It is just inferences drawn based on job ads that are rarely written by people doing the job and instead are effectively human-as-seo-optimized so the best candidates can find the job they hopefully fit for and not be too confused to apply for it.

It's not a straw man, I've seen it clear as day in several companies. When it comes to data science, it's "garbage in, garbage out". I've seen companies do lots of "data science" with a bunch of data scientists skilled in python and jupyter notebooks, only to discover a ton of work was useless because the incoming event data was tagged incorrectly due to a bug.

The actual process of collecting, aggregating, cleaning and verifying data is a hugely important skill, and not one I've really seen typical data scientists possess.

>The actual process of collecting, aggregating, cleaning and verifying data is a hugely important skill, and not one I've really seen typical data scientists possess.

Then they are not scientists. They have a label "scientist" but lack of rigor of actual science.

I don't see why changing the label to "engineer" would suddenly make them have rigor.


This is sort of the meta failure of the argument. They are arguing that people's data skillsets are wrong. To make that argument they are analyzing based on the wrong variable in a data set.

I have experienced the same thing...but I just don't think it has anything to do with whether the positions are labeled data scientist or data engineer.

And I would warn you from my experience teaching statistics to undergraduate engineers...they are not going to be much better. Regularly get 'hey we have this data what test can we run?' 'what are you trying to show?' 'we don't care we just need to run a statistical test' conversations.

To be clear, I totally agree with you. I wasn't just arguing for changing labels, I was arguing that there is one set of "engineering" focused skills (e.g. building data pipelines, data warehouses, tagging events, etc.) and a different set of analysis skills (e.g. machine learning, statistical tests, etc.) and you shouldn't over-index on the latter without having enough of the former.

I suspect this may actually be an issue of school vs real world rather than scientist vs engineer.

Data in the classroom setting is pristine and beautiful; data in the real world is messy and buggy. You have to get burned by buggy data a few times (or maybe a bunch of times) in the real world to learn to look for bad data smells -- I don't think schools effectively teach this kind of intuition, regardless of whether the students are training as data engineers or data scientists.

If data scientists are spending more time in school getting advanced degrees, they're not getting as much exposure to buggy data, whereas data engineers with a BS and a few years of industry experience would already have built up this skill.

>Data in the classroom setting is pristine and beautiful; data in the real world is messy and buggy.

I got to take over our department's undergraduate statistics course a few years back.

The first change I made was all homework, tests, and projects used real data set. I intentionally have them collect bad data (they don't know its bad before hand). First day of class we collect data using the board game operation...I give basic instructions and then halfway through ask everyone to stop and agree on how they are entering data for the variable of 'success or failure' of the surgery. Oops...

In my experience teaching the course, the reason the students (engineers) find statistical reasoning hard is:

* They have never been given anything 'broken', everything is curated to avoid things not working. The result is they think data has inherent meaning. A right answer.

* Their entire learning experience has been stripped of context and the need to make decisions with information. They can give me a p value but are terrified (not unable, just unwilling) to interpret it or give it meaning.

* They have never encountered the concept of variability...everything is presented as systems with exact inputs and outputs.

When I work with postdocs, I sometimes (less frequently) encounter many of the same challenges. Data is treated as sacred and external and inherent. It's wild to me.

So I think that the delineation between the scientist working with the content, and the Engineers who actually provide the mechanics for it is very fair.

If there is a question mark here - it's really how much value are we deriving from all of these data people?

Where is all the ML that's changing our lives? Search, Alexa and TikTok, I can see it.

In the future obviously vision systems for autonomous cars etc..

But I'm really wary about the heavily decreasing marginal returns after that.

It will surely change the world, but I think in specific areas. Most of the entire field seems like an optimization on something rather than anything new.

Washing Machines feed up immense amount of labour and toil. Alexa telling me the weather is not.

I used to work at a legacy automaker and you’d be shocked at how much ML has changed certain areas of the business. It used to take an entire department to sort warranty claims and it’s now mostly automated. Aluminum part defects are now spotted automatically on the plant floor. Don’t even get me started with telematics data.

Most software isn’t consumer facing but just because you don’t see it doesn’t mean it’s not changing things around you. ML tends to be overhyped but your assessment is too pessimistic.

I wouldn't think of a system that would automate the processing of warranty claims as ML. That's mostly applying the policy/rules to each claim.

However, finding defects in aluminum parts that involves using computer vision, would absolutely be a ML solution.

There's millions of claims and thousands of car parts with all sorts of underlying issues. Unfortunately a rule based approach isn't feasible.

Most engineering and science jobs aren't a binary as much as they are a spectrum.

If the article is trying to make a point about skill development and diversification, I'm totally on board. Bifurcating the roles instead is going to be less effective.

To the value point...my sense has been we are seeing the Webcommerce 1.0 bubble Machine Learning edition. Lots of uses of it, not all of them have value. I am excited for where we will be in 10 or 15 years, but I suspect the difference will be huge. If you put me to a guess, I would say better data handling practices and ethics will likely be the linchpins of value creation vs. using tools for the sake of tools.

The vast majority of applications of machine learning that is changing the world isn't happening on a consumer level. Its happening in factories, warehouses, farms, logistics chains, etc.

The article is so true, my latest mantra at work is “engineering is more important than data science”.

Everyone is buzzing about the latter, and few even realize what is the former.

eh...I think this can be analogized to what we already see in code...

You need architecture, you need backends, you need a front end, you need product design...all with data.

Why are computer scientists computer scientists not engineers? Why is computer science about the code side? Why did computer engineering end up being more on the hardware end of the spectrum?

Words, especially newly coined terms are pointers to meaning. That meaning is socially mediated, it is not inherent.

You're saying this (adn I think the author is too) because there is a need for this group of people to look beyond titles to skillsets, and the existing titles carry linguistic baggage of the difference between science and engineering that has existed for decades.

I'm late to the comment party, but: this is classic "commoditize your complement".

This guy would have you believe that Pytorch has Solved the entire, vast field of data analysis as inherited from Newton, de Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage, Jaynes, Breiman, Pearl.

This is a lot like saying that photography has Solved art, and now we need people who can climb ladders and glue the posters on them big billboards. It would be delusional if it didn't have a self-interested angle.

What, we with math degrees are fully confident that the plumbing problem is easier to commoditize than the problem of making sense of data.

It’s about diminishing returns.

E.g. getting a model from 0% accuracy to 70% accuracy might be a couple Pytorch library calls that any dummy who watched Andrew Ng’s course can do. But getting that same model from 70% to 75% accuracy might be deeply mathematical and require the latest and greatest mathmeticians, statisticians.

But in this hypothetical example, an engineer who stands up the 70% model and keeps it running 99.9% of the time, with high uptime; is more valuable to the bottom line of the business than the 75% accuracy model hacked together with scripts with 50% uptime.

One annoying thing about being a generalist is that domain experts in any given area that you need familiarity with can't help but complain about how little you know about that domain, ignoring the fact that your job requires equally deep knowledge of several other domains simultaneously.

In the case of data scientists, I think the business folks that want them to understand the business domain better generally have the strongest argument, followed by the statisticians - good data scientists need to personally understand both of those things well, while the engineering and ops stuff that data scientists are also expected to do is easier to compartmentalize on other teams. So I agree that we should have more data engineers, but apparently for the opposite reason as most people in this thread.

Having to deal with data scientists, I absolutely agree. The thing that I've seen that lands in the "lab" vs production distinction is that these people expect their data to be pristine. They flip out when the world isn't as perfect as their models want. Leads to me as just a normal software developer having to do the data analysis and figure out how to clean it up.

I also end up having to be the one to talk to data vendors to understand their data feeds and essentially translate that for the data scientists. Having to sit in the middle is annoying for me and suboptimal for the business.

The data science field has been flooded with PhDs with nowhere else to go that have no background in engineering, and sadly often have a very poor understanding of both machine learning and statistics.

Companies were in a rush hire "data scientists" and boot camps like Insight were more than happy to pump out very impressive PhDs with just enough understanding to build a Keras model.

I've worked in industry awhile doing DS work and have been astounded at the number of PhDs that both don't know how to write Python that doesn't live in a notebook and throw away years of disciplined experimentation experiences to just throw keras models at data until the needle moves.

There do exist excellent data scientists out there, who are both very solid software engineers and really know their stuff mathematically, but I've found most of these people can't reliably find jobs because the people interviewing them know so little that good data scientists will be penalized for answer a stock question correctly.

The field has been so flooded with amateurs that have no idea what they're doing, that potential mentors have been driven out, and now it's just a mess. To get a job doing DS if you do know what you are doing you have to play a weird game where you guess the incorrect answer the interviewer has in mind.

Not to mention the dark pattern of giving data scientist candidates an unsolved industry problem as their interview take-home task, and then telling them to only spend 4 hours on it. Data science hiring often feels like a competition where the winner is the one who has the most free time and willingness to do other people's work without compensation.

It's kind of a fucked up field right now.

I work at a place with a very high count of PhDs. Some of them write code. All of them view writing code as something menial and unimportant and its shows in the resulting work, which from my experience is atrocious.

Of course I understand that YMV, but I will forever be skeptical of anyone writing code with a PhD after working here.

Are they CS/EE PhDs?

Think back to your own CS professors, were any of them particular good software engineers?

I've found that Physics PhDs tend to have the highest probability of being good coders since a certain subset of them get bit by the software bug when they need to write non-trivial amounts of code to solve research problems.

I got my physics PhD in the early 90s. Physics has had a tradition of interest in programming that goes back decades. We've always had "big data," meaning big relative to the tools available at any given moment. We ran out of problems that could be solved by pencil and paper in the 1930s.

Every physics student at my college had to take FORTRAN, plus programming was assumed in many of the other courses, and we also took an electronics course that included digital techniques. And maybe the main thing was simply that programming was interesting and fun.

We've also had a tradition of learning to do everything ourselves, for better or worse. I had no access to a professional programmer.

The ones who's work I've seen are EE

I do an introductory Python lab course at my university. It's targeted at engineers who still create graphs from Excel and then normally level up to MATLAB, if things get complicated (think insets, ...). I guess about 30% of the people previously did at least some of the YT/Udemy "courses" on datascience. It's really horrifying for me (not being an engineer myself, but imo having a relatively engineering-like mindset) to see these people horrified at simple tasks like writing a variadic function. "What do I need this for?". Well, it's using the programming environment. And then let them code up a simple version of Levenberg-Marquardt. The level of "why do I need to do this" is astonishing again...

why do I need to do this

IMO this is the number one problem of our modern culture around education. Popular culture makes it popular to treat education as pointless, and this even affects students who are pursuing difficult degrees. "Why do I need to study humanities? Why should I learn to code if I think I am born to be someone else's boss?"

On the other hand, many teachers in K12 and early university have no ability to connect the "what" with the "why." "The curriculum is the curriculum. The test is the test."

If we can solve these problems, our societies will be much better off.

If the educator cannot explain why the knowledge is useful, then he is unfit to teach it.

> The data science field has been flooded with PhDs with nowhere else to go that have no background in engineering, and sadly often have a very poor understanding of both machine learning and statistics.

I am a PhD student in a non-engineering field. I've been taking as many math and stats courses as I can, but what other courses should I be trying to take if I want to excel as a data scientist? Software engineering CS type courses?

My question is: "Why are you pursuing a PhD if you want to end up as a data scientist?"

I've known a surprisingly large number of people that are mid-phd thinking about data science as a career. Don't pursue 5+ years of learning to master the world of academic research if your goal is to help people sell t-shirts or whatever.

Certainly there are some people pursuing specific PhDs, such as those in computer vision and nlp where there are some industry options that might offer more challenging/interesting research than academia. It makes sense if you're a PhD at NYU or Stanford in CS fields related to neural networks to go work for Yann Lecun at Facebook or Geoffrey Hinton at Google.

But if you're, say a biologist that wants to sell clothes online... why spend 6 years working in academia to do that? Is your dream really to optimize clothing sales? If so don't be a biologist. If your dream is biology, why in the world would you set your course on selling clothes?

I get it if your dream is biology but you can't find a tenure track job and so you pivot to industry... but if you are mid-phd, what are you doing there? If you love your subject, try to find a way to work in that and if you don't, don't waste your time.

Data Science is not a glamorous job, and the vast majority of companies it is literally bullshit. The people solving mind-bendingly hard problems are already in programs specializing in those problems because that's what they are passionate about. On top of that DS is way over indexed at most companies. If you're mid-phd now I would expect a serious contraction in DS jobs in the next 5 years. DS will be a niche job after the next market "correction"

I mean you have to think about cause and effect here. DS will contract because many/some data scientists simply aren't good enough, and most DS just doesn't do what it is supposed too.

First, like you said, there are the stray PhDs who do it since they know research and some statistical applications. Second, there are hordes and hordes of DS people who "learned" their skill with some bootcamps or online courses, which means they know enough to write notebooks and glue together functions. Their understanding of theory is often shallow. In either case, it is hard to "blame" someone for taking an attractive job. But it isn't good for the discipline.

The appeal of DS is clear for companies. But the problems it promises to solve are much more complex than we collectively recognize - or are willing to admit. In my opinion, doing causal inference is a difficult, unsolved, and deep topic and no single course would equip to you to tackle it. It takes domain knowledge and multiple years of stats/math/ML (all of them, not one of them). And yet, causal inference is what 90% of people want ML to be. A model that works on some dataset is not a model that is useful in light of the true latent DGP. Yet, when we want to sell T-Shirts, what do we really want?

Hence, when I look at the problems that ML is supposed to solve, I think that most people calling themselves DS on linkedin are not really equipped for it. And there is a case to be made that some fields where PhD researchers train to solve such causal inquiries indeed are better equipped to tackle the issue.

For example, if it's about selling shirts, I would take an econometrician with some data engineering skills over a coursera superstar any day of the week. I think if you do a PhD in ML/Stats/Biostats/Econometrics/etc., it is reasonable to pursue a career in DS. It's what statistics _is_ now.

If you have some other PhD and know some Anova, OLS and Stata - or if you have CS background but know some Jupyter and Keras - then it's essentially career change. It might work, but probably not without a hitch.

So I agree with you, but I'd reframe it: It's unclear to me whether we need a contraction, or whether we instead need a quality update.

I disagree with you in one point: I do not think we will make progress in DS (getting it to work in more use cases) by treating it like a solved problem, a skill like milling that needs talent and experience, but not academic education. If we do that, I think DS will contract because it will stagnate in usefulness.

My point here is not to accuse anyone of being a bad DS. I am sure there are many ways to become efficient. But even the theory of causal inference with simple linear models goes far, far beyond what I saw in ML hiring tests, online courses and so forth. And solving the problems it tackles is not accomplished by throwing more layers at it. For other ML algos, we aren't even close yet at understanding these issues on a similar level.

In the end, what we need are actual ML scientists. They should neither be pure statisticians, nor pure subject-matter experts, nor pure computer scientists - as we mostly have now. We also need more than the current ML programs that are mostly clobbered together from other areas. For example, people who publish in ML research are probably very useful in a company that has to deal with that exact problem. Any scientist knows, of course, that even a fairly adjacent question may already require tons of different knowledge. DS is, will remain, and probably should be an academic field, because there are more open than solved problems right now.

Do you have any suggestions for where to start looking for good places to apply that don't suffer from this?

I personally have given up and turned mercenary. Even if you're passionate about statistics, machine learning, or any ds related discipline don't think of work as being a bigger part of your identity than the average star bucks barista does. Find a team/company that pays you well and isn't too opinionated, with low ego (if possible). Don't look for challenging work, the few places it exists already have the people they need, whereas the companies that pretend they have challenging problems tend to be insufferable. Look for a team where you can check-in and check-out without too much stress and get paid well.

Just want to say that while the data science profession definitely includes a wide range of people and skillsets, a good data scientist should be practical and able to work with the available data in whatever state it's in.

No good data scientist should ever expect data to be pristine. And a good data scientist, even if they don't have quite the engineering chops necessary to build a production-quality ETL, should know enough about the process to help guide it. If they aren't a part of that process, they're not being a good DS. They can't expect someone not involved with their problem to know what tradeoffs to make, and if they don't know exactly how their data went from raw form to the ETL-ed form, they're probably going to make bad assumptions, and those assumptions may very well make their architected solution a complete pile of garbage. Not to mention, how can a DS offer suggestions for solutions if they aren't deeply familiar with the raw data that's available?

To me, a good data scientist should, at bare minimum, have several skills.

* They should first and foremost (but not solely) be an in house expert in statistics and machine learning to know what can be done with data, and what can't be done with data. They should arrive with that knowledge. Engineers I think have a tendency to trivialize this, but true expertise in this domain comes only with years of experience.

* They should strive to find modeling solutions that are right for a particular business problem. If they seem to be only applying the hottest research regardless of the tradeoffs for the particular business problem, that's a red flag.

* Their focus should be on integrating themselves with the product/business as much as possible, and with the engineering team as much as possible. If they're expecting to be handed directives, that's a recipe for a ton of wasted time.

DS should never, ever be siloed into their own little DS world. They will be useless without a deeply intimate knowledge of the business goals, the needs of product, and the capabilities of the engineering team.

As they progress, they should become more and more "full-stack", otherwise they are stagnating.

A good data scientist should also be good at science. Otherwise, you can simply hire people with engineering skills - you don't need scientists. If you hire scientists and then are surprised they aren't good at engineering, the hiring process needs a reality check.

Statistics is a science as well. Unfortunately it’s overloaded in business terms and can mean anything from “knows means and regressions” to “has a copy of _Meyn and Tweedie_ on their shelf”.

Instead of sneering at "having to deal with" data scientists, consider that the data scientists themselves would often much rather have data engineers and dev ops people involved in the process.

Data scientists like to quip that 80% of the job is data cleaning, with the remaining 20% divided up arbitrarily among other tasks as suited the joke. In some shops nowadays, it's more like 45% data cleaning, 45% data engineering/ops/programming just trying to make your results available to the rest of your organization, and 10% research.

If I can spend less time learning/doing software engineering and devops and more time doing actual data science, that's great. At a previous job, my team was clamoring for more data engineer hiring, and part of the reason our projects were slipping and starting to fail was lack of data engineering support. Our tooling was shit, our processes were shit, our code was shit, and access to (and trust of) our data sources was especially wet and stinky shit.

It made the daily work of doing data science a miserable slog of ad-hoc duct-tape solutions, and it contributed to us being generally ineffective as a team.

All of this would have been fixed if we had one competent data engineer with some actual real-world data/ML engineering experience and good communication/advocacy skills. Let alone two or three!

If the DE tooling was shit and you couldn't hire more fast enough, why didn't your team members start addressing these problems? Surely spending half the time cleaning up the pipes would increase the value of what you do with the other half?

This implies a lack of rigorous training. In the physical sciences, one wouldn't become an applied scientist without conducting an experiment to test a phenomenon, and the teeth gnashing that goes with making that experiment work.

Those who have been fed pristine data without having to undergo the trials and tribulations of actually having to collect the data have missed a crucial part of scientific training. Like you, I find this lack of rigour is rather common among data scientists. Not all, but quite a few.

That's what I was wondering reading this thread. Much of science is dirty work in other fields and I think that is a good thing.

How ridiculous to assume that a scientist doesn't clean their tools and set up their experiments.

(Surely as one gets more experienced and older, the job likely becomes less manual, more about teaching and coordinating.)

I think that this is more of a problem with the specific people that you have worked with and it isn't inherent to the role of a data scientist.

It’s becoming more inherent, especially as the field is populated with people who have no experience with the “science” part. That is, with the very real and ubiquitous problem of collecting and cleaning data to make it fit for scientific study. Even theoretical physicists, for example, participate in and rely on empirical data collection, and understand deeply how messy and fraught with error it is.

I don’t see the same appreciation or consideration in general in the field of data “science.”

I remember working with some one who has PhD in Physics and who worked at CERN - and one comment I loved "a key skill is knowing how to place the legend, so it obscures that annoying outlier data point"

> with people who have no experience with the “science” part

It's interesting that you put it that way because a lot of the other complaints in this thread are that the people who expect their data to be ready for use are exactly the people with science experience but without the relevant technical background.

Doesn't sound like a modern Data Scientist, sounds more like a statistician with 30+ years of experience.

Genuine question: why is there so much pure teeming hatred for data scientists in this comment thread? Almost every comment comes off as full of snark and vitriol against data scientists.

I'm guessing just venting personal frustrations due to their own experiences, plus maybe poor hiring and guidance of data scientists in their own teams? I have definitely seen DS people in my experience that fit some of the descriptions here, but I think it's a mistake to trivialize the DS position itself. A good DS is a valuable asset, but depending on your company/data, maybe not worth the cost. Plus there is no "single" DS candidate or role, a lot of these roles (data engineer, DS, analyst, swe) blend together at times, and it's about finding the right balance of skills.

Sometimes I think a company (not having the DS experience themselves) mistakenly over-hire DS roles in today's hype of "AI" when their data is mostly run-of-the-mill and only requires simple linear models that can be architected an understood by a stats/math-savvy engineer. Even then, a good DS is still useful (even linear models can be complex: e.g. what priors do you want to use? Do you want a multi-task solution? etc.), but maybe not worth the cost.

It tends to happen any time something becomes trendy. Data science has/had a lot of hype over the last decade, and people seem to have an inherent tend to want trendy things to fail. Combined with that, you have a bunch of people hopping on the data science bandwagon, so you get a lot of grifters, snake oil salesmen, or simply individuals whose output is poor quality. Seems to have created a feedback loop, where there is always a new example of some AI solution failing, or a data science initiative that didn't work out that everyone can point at and say "See! I always knew this trend was dumb!".

Reality is that data science is here to stay. It's coming out of the honeymoon period, and things may never be as hyped up as it has been the last decade, but that's probably a good thing for the field. Everyone will probably move on to hating the next up and coming thing. I have a hunch it could be something in data engineering because, while not exactly new, it is absolutely the next "data science" in terms of demand, and with products like Snowflake having so much hype behind them, it seems the backlash will be inevitable.

I think a lot of people feel like data scientists get all the credit and the fun work, while we have to do all the heavy lifting and boring stuff that they need. There's this idea that data science is a special and unique skillset that couldn't possibly be possessed by a simple software engineer. Despite being largely a subdomain of CS.

I remember an era before "data scientist" was a job title. When we (programmers) would analyze data to see if we had enough information available to identify the problem, if not, fix that, then come up with a strategy to solve it, test, and finally deploy the model. The fun part was trying different solutions and analyzing the data. It also felt awesome to deploy a product that worked like "magic." Product owners didn't know or care what a neural net was, they were just happy it worked.

Now there are tons of data scientist out there who take the easy, fun, rewarding work and try to skip over the nitty gritty implementation details. Then management thinks engineering is incapable of doing such work, and the only time we get the opportunity to do something fun is to do so behind the scenes.

I would guess that it's a reaction to job title hype.

There's a huge variety in DS responsibility and background between companies.

Probably just people blowing off steam and the target de jour are data scientists. If people had to sit down with their company's data scientists to air their grievances face to face I doubt they would be so condescending.

Yes, the tide is turning now... who came up with the term "data scientist" anyway? It's a made up profession. If you need someone who understands statistics, get a statistician, or maybe a mathematician. If you need someone that designs and writes computer programs, get a computer programmer. But a "data scientist"? No, thanks.


I came to this thread interested in the discussion, but I feel now like the homer simpson meme retreating into the hedge.

Maybe I'll come back in a few hours, but for now I'll stay away.

My view is from a small startup with little to no room for single purpose employees.

When I first started hiring and working with data scientist my view was this: If you can only manipulate data and run it through pipelines to generate models then you can't do enough to be highly valuable. You either need to have a strong enough background in CS to build the pipelines / tools or a strong enough mathematics background to be able to propose cutting edge new ideas. From my experience it is hard to find someone who has one of these skill just from a University "data science" program. At a small company (at least ones that I have worked with) being only proficient in R and basic Python isn't enough. That being said, I have met and handful of Data Scientist who were very smart and self motivated enough to pick up on the lacking skills when given the chance.

My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?

I would be cautious about that. I've worked in the startup space for over 10 years now as a data scientist, often the first one hired on, working on the pipes.

From my experience, there are two types of data scientists who work who do infrastructure work: 1) Those who do not make the best data scientist because their skill set is too far in engineering land, leaving them weak where it counts. If the startup is relying on the data scientist to be profitable, I'd be cautious with these types. or 2) Someone who is senior, beyond senior really, who has worked both jobs, and doesn't mind doing both jobs. This unicorn is so rare it is mythical. The joke when the terminology was created is they're so rare no one has ever seen one, hence unicorn.

Me, I can not do the work I need to do if I'm on call. That is where I draw the line. That means hiring someone to monitor the infrastructure. Furthermore, I'm an okay architect, but you really do want to hire a specialist if you can help it for that. Do I help them with the infrastructure? Absolutely, but they're on call if a server is on fire. They have the admin login credentials, not me.

I get wearing multiple hats, but keep in mind to be a data scientist you're already wearing multiple hats. Being a data scientist is like double majoring and getting a phd. At what point are they stretched too thin? The consensus in the industry is they're already stretched too thin and should be broken up into different specialized roles.

>My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?

That is the standard role, even at startups. However, the industry consensus these days is data scientists should have more responsibility when it comes to deploying models than previous standards.[1] So data scientists are being pushed in a more engineering direction, not with hosting sql servers and infrastructure, but with working with engineers to make sure the models are monitored properly. This change comes from model deployment being further automated as time goes on, making it easier for the data scientist to have more responsibility during this stage.

[1] source: https://www.dominodatalab.com/static/gfx/uploads/domino-mana... page 9. Suboptimal organization and incentive structures.

Thanks for the feedback! Seems like you and I both have had a bit of experience being first engineering hires at startups but have had very different experiences when it comes to rolls or a data scientist. I appreciate that.

Np. There is a common trend in the industry where a company hires on a data scientist, doesn't know the data prerequisites (specifically labeled data), the data scientist struggles, after a while the company fires the data scientist. This leaves the company with a bad taste in their mouth. In recent years I tend to get hired on as a specialist to help fix this. (And yes, I've been the first engineer hired on too.)

What's interesting is they tend to struggle in two different ways: 1) The data scientist that is gung ho about infrastructure work, jumps in, and then ends up doing a bad job, because it's not their strength. They end up getting let go for not being ideal at that work. 2) The data scientist who struggles with the idea of infrastructure work at all, jumps into other roles they're good at like data analyst work, helps the company in that way, but ultimately because they did not push to get an infrastructure engineer hired, they end up let go as well.

Me, I go out of my way to get an infrastructure engineer / data engineer hired early on. Also, I have worked as an engineer, so I tend to do a lot of the "hard" stuff most software engineers struggle with early on, if applicable. Eg, at one job I wrote a compression format to reduce battery drain on our devices that were collecting data.

Most data scientists struggle when it comes to CS/engineering skills (4/5th of them), so it's not uncommon for them early one while the pipes are being built to do data analyst and BI work. BI work to automate reports, which management loves, and DA work to show some amazing future service the company might be able provide to its customers. It's selling the sun and the moon really, but it gets management inspired, and helps them know what data to collect. It's not unheard of to need a minimum of two years of collected data before building a model that can be deployed becomes feasible. This can be hard on the data scientist, because there is a lot of down time before that. Many get fired during this time even when they're doing a good job. They have to wear multiple hats, but it's analyst roles (like BI work). Technically a data scientist is a kind of analyst, not engineer, so it makes sense that wearing multiple hats for them tilts in the analyst direction, not the engineering direction.

I've been writing code since I was 8 years old, so I'm one of the unusual ones that tilts in the engineering direction, but I think it is unreasonable to expect that from the average data scientist. Let them do what they do best, and hire someone else who can round everything out and you'll be in a good place. Unicorns aside, you'll need a minimum of two professionals for a data project to succeed.

Thank you for your comments! They are very insightful. To piggyback a bit:

Assuming you are a competent data "analyst" who wants to become a data engineer, how would you go about it? Is "go back to school and get a CS degree" the answer? I suppose this question is very broad, but I am curious if a practitioner like you has an opinion.


To give some context:

I recently graduated with a STEM PhD, and looking to move into data science. Reading the comments, I feel like I fall into the "pointless data scientist" cohort derided in this thread. Eg: I am very comfortable doing typical analytical work & occasionally training models inside a notebook, but I am neither a cutting-edge theoretical statistician nor a data engineer.

I've been trying to improve on the engineering side. For example, I did a project recently where I set up a rudimentary pipeline that continuously pings an API, uploads the data to a cloud database, then serves up the analysis via a Flask app. For me this was a big step up from just doing notebooks on a csv file :)

But moving beyond the basics, I am not sure what to study next. Hence my question. If you have any suggestions, I would greatly appreciate it!

We have colleagues who are similar to you, PhD. What we ended up doing is building an internal machine learning platform to reduce the number of "taps on the shoulder". They had trouble with setting up the environments, dealing with libraries, systems dependencies, etc. In addition to that, they relied on others to get data, fix their environment, deploy their models, or showcase their work to clients.

It wasn't optimal because we were having bottlenecks and variance: some people could move through the stack and do it all, but you either had them or you had to train them and it took time.

- [0]: https://iko.ai

I'm confused, do you want to become a data engineer or a data scientist? A data scientist is a type of senior data analyst. Data engineering is farther away to a data scientist than a data analyst is.

I'm going to answer both questions just in case:

To become a data engineer / infrastructure engineer, there are multiple paths forward. I recommend doing BI work aka Business Intelligence Analyst Engineer. It's typically Tableau related work, so making dashboards and reports for management. It's still a data related role so you should feel comfortable and at home. However, it is also an engineering role. If you're the first BI at a company you'll often find yourself setting up an SQL server and doing certain data engineer-light type work to get data into the server. You'll need to set all of this up, so BI is a blend of data engineering and data analyst type work.

Once you've gotten familiar with BI work it's very easy to transfer to data engineer / infrastructure engineer type work. This is especially true if you end up setting up a data warehouse (as an alternative to MySQL) or data lake on AWS to do your BI work. You don't have to, but if you go that far, you're pretty much doing data engineering at that point. The line between the two is fuzzy. Data engineers and infrastructure engineers are expected to be architects, and by that I mean they are expected to future proof the schema of the SQL server / data warehouse (future proof setting up the database for new data so it doesn't become a mess). A BI is not expected to be an architect and imo the only way to gain that skill is through first hand experience playing with databases, so a BI is a good way to get that experience.

At the current company I'm at the infrastructure engineers are expected to do BI work. This is unusual as the data scientists typically do it (roughly 60% of data scientists do BI work), because one of the data engineers I currently work with was a BI at his previous job. (He was on the sales team, helping them with more than just dashboards, like helping with their Excel spreadsheet algorithms and what not.)

I'm sure others could paint another path forward. Data engineers are highly in demand so it could be as simple as applying. If you can pass a white board interview (leetcode style interview) you can skip this step and dive right on in. Just like any technical white collar job, you're expected to self-learn what is required for the job before going in, so absolutely read guides / take classes / read books / etc on the topic to learn more.


To get a job as a data scientist:

BI work is a good bridge too, but not in learning setting up database skills, but creating dashboard skills. Around 60% of data scientists in the industry do BI work. Me, I've had to create internal dashboards for diagnosing problems, so it streams in live data in a visual way. This is not BI work, but there is clearly a bridge between BI dashboards and internal diagnostics dashboards.

Technically a data scientist is a kind of data analyst so many people go from data analyst directly to data scientist. Around 30% or so of data scientists do only data analyst work but have the data science title. (This 30% number is a bit of an estimate.) It's that strong of an overlap.

Data scientists tend to specialize. There sales data scientist, a marketing data scientist, an engineering data scientist, ops data science, and so on. Often times, but not always, they sit on the team they specialize in, instead of on a data science / data analyst team. At smaller companies they tend to hire a data scientist and expect them to do one kind of role. So it comes down to what kind of data science work you want to do. Sales data science roles tend to be BI heavy. Marketing data science roles tend to be data analyst heavy. Engineering data science roles tend to be the heavy model building roles that are the most challenging out of the bunch. Ops data scientists tend to specialize in malware detection and self-reporting. Eg, if someone is hacking the company's servers, they might get notified of an alert, and then they analyze it and report on it. There are other kinds of data scientists, like ones at super market companies and restaurants that specialize in forecasting warehouse items.

Me, I'm a specialist that specializes in robotics and sensor analysis. I'm not going to lie, it's probably the hardest out of every kind of data science role. It's very heavy on the engineering side, not data engineering, but software engineering, because there is a lot of advanced feature engineering.

Most feature engineering is simple stuff like deleting missing values, performing the medium over the dataset or other kinds of cleaning and minor modifications like normalizing the data. Then it gets spit into an ml library that identifies the pattern in the data, so when new data comes in it can identify if it recognizes that pattern. Each pattern is called a category and most ML work is categorization, so maybe you're categorizing different kinds of customers and if you can identify a pattern in their shopping habits, you might be able to predict what they will buy next.

Advanced feature engineering might need to be used when your patterns are so complex ML can't pattern match it well, so you have to give it a helping hand and manually do some of the pattern matching. I've also had to invent new forms of ML too, but it's been a while since I've had to go that far. What I do is the farthest from normal for data science.

Most data scientists do not know advanced feature engineering, but it's one of the bridges between software engineering and data science, so leveling up software engineering can help on that front. (Which is also why I bring it up.)

A data scientist shouldn't be expected to know much or any data engineering skills. Instead, gaining managing upward skills helps. How to do an sql query to get data and how to write a join is enough. You should do fine, just try learning data science itself instead of learning data engineering, unless you're curious. (A lot of universities and bootcamps teach machine learning engineer skills and call it data science. If it doesn't have data cleaning and feature engineer, it's probably not data science. Likewise if it has tensorflow or pytorch in it, it's not data science.)

Thank you for the detailed response. It makes things a lot clearer.

I notice that you mention "Machine Learning Engineer" as a separate role. If in the idealized world, data scientists do analytics and train models, and data engineers take care of data, then what do Machine Leaning Engineers do? Are they, basically, software engineers who specialize in putting other peoples' models into production?

And you are right in sensing my confusion. There seems to be an abundance or data-related titles, which seem to overlap in their functions a lot, but are also very different when you examine them closely. So thank you again for your responses, they are very helpful.

>Are they, basically, software engineers who specialize in putting other peoples' models into production?

It depends on the company. Traditionally, yes, but deployment into production can be automated, so typically today it is something different.

An MLE is someone who specializes in Tensorflow or PyTorch. They write deep neural networks, reinforcement learning, and more. Often times the data scientist will make a model, specializing in feature engineering and domain experience, and use a generic ML like a generic DNN or xgboost or whatever it may be. It then gets handed off to an MLE who writes ML specific for the problem to get every last drop of accuracy out of the model. They then hand it off to prod. I don't think they're on call (I could be wrong on this.) so today they're not really deploying models much. They're more an inbetween.

I work at small companies and startups so I've never worked with an MLE, but I do have friends who are managers at Google who told me about it, so that's where this information is coming from, telephone game. In other words, I'd take this with a grain of salt. ymmv.

Starting in 2018 big name companies couldn't get enough MLEs and they pay higher than DS', but many bootcamps and universities center around ML skills, so companies started renaming MLE positions to DS positions. This way they get more applicants and they pay them lower. Win-win for them. Too bad it messes up the industry. Today about 1 in 3 data science jobs are ML heavy. They may be MLE exclusive or a hybrid wearing multiple hats light DS to light MLE type jobs.

You can identify which is which if they give you a white board coding problem. Traditional data science work will never have a white board problem.

>So thank you again for your responses, they are very helpful.

You're very welcome. I hope it helps.

There are certainly roles out there for a Data Scientist who just crunches numbers. A good friend of mine does exactly that for a large traditional retail corporation. Just by using standard ML tools he replaced a whole team of analysts for pricing items. Maybe not in cutting edge tech companies, but roles like that are all over the economy still.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact