Yes, I do admit there can be some specialization in terms of time spent on science vs engineering.
But you really need people who understand both. Particularly if you have a strategist who thinks his job is just to dream up profitable models, he ends up carving that role out in a way that's detrimental to the rest of the team. You get people who just don't appreciate that there's other work to do than finding models, and that models depend on that other work to function.
You also get a huge prestige gap, because inevitably management will think that there's a magician and a blacksmith. One guy needs to be paid a lot, and the other guy needs to be paid enough.
These effects feed each other. Magician will say "where's my data" and expect blacksmith to make it, promptly. He won't do it himself, because spending time on mundane stuff makes the magic disappear. And not doing it yourself, or taking the time to understand it, will eventually lead to problems with the magic.
My god, this. These people make me bonkers. Especially because I feel like I have a bit of this tendency myself, the desire just to think big thoughts and do no actual work. Happily, I long ago learned that ideas were approximately worthless without labor, and that I anyway had much better ideas when laboring because it forced me to engage with the details.
And yes, those people can poison a team. My best working experiences have all been with people who a) all valued actual work and b) believed that everybody could have good ideas.
Implementation is a long hard road. And where you learn your idea was vague enough that it had almost no value. And only through painstaking iteration can you turn it into something with value.
I doubt this.
Consider it was easy to bring an idea in this world, and the hard part was the initial first thought; writing a paper/article painstakingly rigorously would be unnecessary.
Writing a book would be a breeze and no author would ever go through more than a single draft.
The idea was born beforehand, was complete correct and perfect, so putting everything down with words is just a matter of transcribing.
An organization would not usually hire engineers with multiple degrees, but simply writers or an automated system that would listen and transcribe the idea.
An idea is truly born and exists through a lot of effort and iteration and redefinement and refinement.
P.S. There is a hard and subjective issue of where the line is drawn between ideation and minor uninteresting and menial maintainance/get the money in the bank work.
We have to recall however that a parent brings their child to life and tags along through all the effort and work. A child is the result of years of high to low level of unpleasant work. Drawing an arbitrary line of when you shall stop giving as a parent is naive and egotistical.
I also think, I have another definition of 'idea' than most people here. It is common especially in software development, to see the way as the destination and incremental change as development caused by some magical good-ending evolutionary process. This includes a quite unsubstantiated belief in getting the right ideas automagically along the way.
You need both things. A good idea does not descent from heaven. Also, it has history in the person, creating it. This history is hard work in its own, inner fighting against the common and the environment. Jumping out of the box, all these pure implementers are unable to do.
I am working for 25 years in the industry. I have written real code from the beginning. But I am also a mathematician and can say I had a few good ideas along the way. I am proud of it.
If it's greater than zero, let me know. I have notebooks full of them. Business ideas. Project ideas. Political ideas. Social ideas. I generally can't give 'em away, much less sell them. Why? Everybody has their own ideas, and they like 'em better. And the ones in my notebooks don't have what really matters: validation.
To demonstrate: try to imagine a new color you've never seen before that's not in any way associated to any of the colors that you've seen. (it's impossible)
Or think of it this way: You could explain a car to someone who has never seen a car before, but only so long as the ideas used to explain the idea of a car (e.g. wheels, doors, windows), have already been familiarized to the other person. Otherwise if those ideas weren't familiar, such as a wheel, you'd need to also explain what a wheel is. And if the concepts used to explain the wheel wasn't familiar (and so on), you'd eventually hit a point where you must expose the idea(s) directly to their senses (eg show them), otherwise they will never understand what you're talking about.
So all ideas come from the senses, and your minds ability to combine these "pure" ideas that you've sensed. "Pure" ideas are cheap and easy because they're the simplest ideas - they're what you directly sensed. To have good ideas, you need to combine many "pure" ideas together, hence why those who have experience working closely and thoroughly with something, will often have the best ideas associated to that something...
Idea people that just blurb out thoughts that go no where are all over the place but there are a few that convince others that their idea is useful. They don't necessarily need to do the work to make their ideas a reality but they need to convince others that their ideas have value.
We need both dreamers and workers.
If you are an idea person figure out how to get others to believe in what you are dreaming and the idea will become a reality.
Think Steve Jobs, he was the idea guy that made his dreams a reality. People want to believe that he was some kind of super engineer or programmer but he was the one that was able to get all the super engineers to do their best to develop his ideas.
What really bothers me, though, is not the randos with this attitude. It's that some of them will grab enough money or power that they'll be able to live out their fantasy. And woe be unto any who hop on board. Quibi being the latest big example.
1) They probably aren't going to produce good models since they're not sensitive to data nuances, but now they've taken over ALL the modeling work.
2) They bring down the job satisfaction of everyone else on the team who would like to be doing at least some modeling.
3) They're sucking up the prestige that should be distributed over the entire team and management thinks they should be paid more for work that it turns out everybody thinks is more fun anyway.
My number one advice to entry level data scientists is to not be this guy. Don't give your interviewers the impression that you won't do your own engineering work because they won't want someone who brings negative value to the team.
I love your post; I agree with your post; but it takes a 90 degree turn at the end:
"My number one advice to entry level data scientists is to not be this guy. "
Everything most people are saying here indicates it's GREAT to be that guy. You're paid, you're respected, you get the fun parts, you love your job and it's pretty safe. It just happens to suck for everybody else including team and business... but it feels that in a practical sense, gist of everybody's actual unwitting message is "BE that guy, if you can" :-<<<
In every other place, your job is on the line to be erased because people will soon realize no one wants a wise-ass who doesn't actually contribute much to the bottom-line in the end.
Depending on the work environment it's not a stretch to see software engineers complaining to management, sometimes going as far to create rumors to get the jr data scientist fired.
So, no the grass is not greener. It's best to not be that person. This is why I go out of my way to prevent that scenario when I lead a team.
You just get seen as the product owner/project manager.
I tend to be seen as a product lead / owner / stakeholder, so I feel like I'm being called out. lol
I think one difference is the software engineers see me as someone who is helping them by making their life easier. I'm not just throwing work at them blindly. I'm working with them. Also, they like it when I include them in the data science brainstorming sessions to solve difficult problems. I guess it's seen as exotic or something, but whatever the reason, they really love to be apart of it.
my personal data points are from folks on buyside. trading margins have been downward trending for years
On the quant side bonuses are distributed to the team.
Companies understand that you can't hire five of that guy and get things done. If you have 5-8 years of experience as a technical product manager/data science combo then you are very happy as the magician. But very few magicians are being hired out of college, and a lot of "software engineers in data"
I went into DE because I was kind of forced into the space, but I'd strongly prefer doing full-stack DE. Anymore, I still have the opportunity to build models, they just aren't client-facing stuff, but instead are kind of Data Plumber Bots that help me do my job better so I can waste more time building other fun bots that I can't otherwise be paid for.
Seems like a waste of resources, but my manager could have another DS tomorrow, but my role would take months to fill.
This process is horrible, and not just because it doubles the work, but because it introduces bugs. When the version up in the cloud does not work as intended, is it a bug in productionizing or is it in the original model? Fixing bugs in this space can take longer than the initial model development and the initial productionization. Many companies have failed over this.
So what's the solution? In recent years the industry has turned to deployment over productionization. The idea is you deploy the model to the cloud directly. Both engineers and scientists work together on the process. The scientist defines what cells in the notebook get called for the final algorithm (as there are EDA / plotting cells and documentation cells too). The engineer sets up the amazon IO stuff, database login stuff, and monitoring services. The scientist works with them to create tests and what to monitor so they get notified if there is a problem with the service.
No more mystery bugs. The model gets directly deployed, the work load is minimal, and it brings people together. The downside is often the engineers and scientists are on different teams, and sometimes companies will not let them merge for a while, so it becomes a telephone game instead of everyone feeling like they're on the same team working together. imo moving the scientist to the engineering team during this time can be helpful, or moving the engineer to the data team.
Some companies have services where entire notebooks get put up into the cloud and all of it gets called, so the scientist has to write the notebook in a way that works for the cloud. It's rarer, but how I prefer it is a wrapper py file is created that calls just the relevant parts of the notebook, kind of like a header file. This process works well for me, but it as far as I know it is not standardized in the industry yet.
In short, if you end up in this situation, there is a better way. Import the notebook into a .py file or into the cloud, don't rewrite it. This (hopefully) will remove this scenario you're describing (comment this is replying to) so those issues will become a historical footnote.
The way I view it is frictions and impedance mismatch. People lived in several universes and there were many "taps on shoulders". Data scientist tries to work on a project but the system upgrade messed up their compute environment and their GPU isn't working anymore. Data scientsits ssh'ing into a "powerful workstation" to have their notebooks run on more RAM or more powerful GPUs, having a certain convention to start their notebook servers with specific ports.
Building models and then wanting to show results to the client and asking a colleague. Set up a VM on GCP, write a small application, scp the model to the machine, create an environment with the same dependencies to load the model, set up authentication on the machine. Email the client. Client doesn't reply in time. You have a bunch of VMs.
Meanwhile the data scientist has produced another model with a notebook and they want the engineer to deploy it. Others want to reproduce it but have the same trouble with running the notebook (libraries, etc.).
A complete mess. We ended up building our platform. We wanted our PhDs to do what they were good at, and we wanted to handle a lot for them. In the same time, we wanted our more engineering inclined colleagues not to do that work themselves, and we let the platform do many of these things (building images, deploying, scheduling notebooks, etc).
- : https://iko.ai
This is great for making a dashboard, a report, or some other kind of analytics, but when it comes to a service the customer uses, you typically never want to load the whole notebook. This is where the industry standard way of loading the whole notebook tends to fall on its face.
What we do is the cells that will end up in prod are written as functions inside of the notebook. This helps reduce globals when writing the notebook, so it is good form when prototyping, but also it allows just those functions to be called from the notebook, instead of running the entire notebook.
You will probably want to write your own library to do this, but in the mean time there is one that works for this purpose https://github.com/grst/nbimporter (Ironically the author doesn't recognize this use case.)
Using nbimporter you can import a notebook without loading it. You can then call functions within that notebook and only those functions get loaded and called.
In my notebooks I have a process function which is like main(), but for for feature engineering. On the prod side the process function is called from the notebook. Process calls all of the necessary cells/functions for me in the correct order. This way the py wrapper only has to call one function, then the ML predict function gets called, so it's pretty small on the .py wrapper side. There are tests written on the .py side, IO functions and what not too.
Data engineers love their classes, so it's easy to write a class that calls the notebook, and best of all calling a single function this way does not load globals, so the data engineers are happy. It's a nice library, because otherwisw you'd have to write your own (which you may end up wanting to do).
This way if the model doesn't work as intended in production it's my fault. We log everything, so I can run the instance prod caught on my local machine, figure out what is going on, update the model, and then it can be deployed instantly.
Version numbers on the engineering side I can't comment on as they have their own method, but on my end the second the model writes to a database then I strongly push for having a version number column or a version number metadata table in the database, so it's easy for me to access for future analysis.
Or is it infeasible for some other reason?
I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.
I've had projects where the model doesn't perform as intended. Because one person was making the model and another productionizing it, it was hard to identify where the performance difference was coming from. Was the bug in the model itself or in the productionization process itself? It took longer to figure it out than writing the model or productionizing it the first time.
It takes so long to deal with these bugs because the model gets changed, so then prod gets changed to match it. Changing prod (rewriting functions) has the potential to create a new bug, so you solved one but added another, and still can't identify if it is in the initial model or from prod. This continues over and over again, problem after problem.
It's noteworthy to mention if one person is doing both the model building and converting to production this problem is significantly reduced, but is still a problem. The problem is exasperated from the lack of domain knowledge, being that both people are in the dark from the other person's process.
Furthermore, what if you need to update the model? Do you rewrite prod doubling or tripling your work? Do you take that risk to introduce another potential hard to diagnose bug, even if you're the one doing both roles?
Or do you automate the process, so the same code being developed on is the same code running on the server at the end of the day? No more bugs, half to 1/3rd the amount of the work. Why not do it this way? It's soo much easier to debug a problem in prod this way. You can take the log data and spit it into the local machine and know what you're seeing is what the user saw. No more guessing where the problem is.
One way to think of it is software engineers would think it is absurd to write their code, then hand it off to someone who doesn't completely understand it, to rewrite it in another language and put it up on a server. "Why would you ever want to do that?" they would think, and I agree with this sentiment. It is absurd to have someone (even you) rewrite your work unless you have no other option, and you do have other options. Transpilers are a thing if prod needs to be in another language. I've written models that have to go onto embedded environments. I know these challenges all too well.
>I'd imagine many data scientists want to publish their work as python packages or libraries during their PhD, so they should be familiar with writing classes or functions that work at a bare minimum.
It depends if you're writing a library, like doing ML / machine learning engineer type work, or you're solving a domain challenge and writing an end to end solution for that problem, and are using standard cookie cutter ML for your phd, aka data science type work.
One leads to an engineering role, and not surprisingly writing a library for it is ideal, so other people can use it. Another leads to a data science type role and not surprisingly showing code snippets in your paper with plots / EDA and all, the same way you'd write a notebook at work, is ideal.
I'm a data scientist, not an ML specialist (though I have invented a new form of ML for work once, but that was just once and not my primary thing). I specialize in end-to-end domain problems I'm solving. I'll write a notebook to solve it, not that I have to. I've been in the industry longer than notebooks were a thing, so I'm fine doing it the old fashioned way. What I am not is an MLE. I don't need to write libraries for other users to use. I don't need to write custom ML. I don't need to do that engineering bit. To be fair, I have, and I know when it's the right tool for the job. On stackoverflow all of my points come from helping people with the glue parts between C++ and R, so they too can write libraries for R. I'm proficient in modern C++ too. I can do the library ML type work, and I have enjoyed it, but I really do enjoy solving domain problems more, so it's what I'm doing, and it's what the previous comments in this chain you're responding to are all about.
Engineers are in high demand all over the world. But most companies do not profit enough from technology to justify similar paying SV salaries.
And, then there are those somewhat rare occasions where a project is not intended to increase revenue, and may even decrease it. At my last employer, we guesstimated that a project I worked on for months could possibly have ended up costing us $2M per year in revenue. That was both accepted and expected, because we were doing it to gain goodwill with users, but in such a way that it might end up pissing off a small minority of our customers.
I really wish, just once, I could work on a project and put underneath it on my resume “Increased revenue by X%,” because I’ve never worked on anything that was so easy to directly trace back to the top line.
Cost savings are another story, because engineers can fairly easily quantify how much less money is being spent by doing $THING a bit more efficiently....
I'm in industrial automation, but it's much the same. Projects where someone developed a strategy but has never been involved in the details of a machine are doomed to failure (or at best to be unreliable and producing low quality parts). Projects built by machine fabricators are over-engineered, frequently late, and sometimes unprofitable, but damn if they don't work well.
The main trouble, I think, is that when a shiny new contraption is brought to the king, it's too often the magicians doing the talking - whether they're speaking words of power or Common, their job is to talk. Meanwhile, the blacksmith is probably busy at in his workshop some ornate scroll work for the next thing, or repairing the previous gizmo, because he'd rather be hammering away at his anvil than talking.
The higher you go in an org chart, the fewer the number of people who understand the work their company actually does, and the more voices you have between the workers and the decision-makers to take some of the credit for work as it passes up the chain.
One common issue I run into is that when the blacksmiths start talking, nobody listens.
Yet even when I do this, I somehow become the arbiter and authority for all problems and questions on X. 5 years go by and everyone thinks X was all my genius. And I hate it, because personally I do not like X created by Jim - even if everyone else does...
The best quants are 1/3 statistician, 1/3 developer and 1/3 trader, in my view.
The trick is to figure out how to work effectively with those people. Build infrastructure that keeps them on the rails, refactor their code, push them in the right direction, tell them when they've fucked up, teach them little things with high leverage. As long as that doesn't turn into being their slave, that's fine.
Researchers do not need to have deep programming experience, but they have to be comfortable enough to use an environment that can lend itself itself to the problem at hand. On the quant side, unlike on the data science side, the barrier of entry on the programming side is a bit higher. To solve this problem many firms have their own internal programming language.
>On the quant side, *unlike on the data science side*,
Vision scientist is on the data science side. You're not dealing with monetary values where floating point error compounds on itself to the point your models become garbage. Quant work is it's own unique field with its own unique prerequisites.
I’m not a quant and this isn’t my area of expertise, but, for example, I’m pretty sure various differential equation solving methods depend on variables taking on continuous values, so floating point basically must be used. Understanding the impact of that is definitely very important. Analogously, I frequently run into numerical precision issues in image processing. Understanding how numbers are represented on a computer isn’t unique to being a quant. Understanding how the choice of representation can impact prod is also not unique to being a quant. The dynamicness of the language isn’t particularly relevant, either.
You would be surprised. The second you use pandas with a custom data type (let alone any other library you'd want to use) it can randomly auto convert it to a float. Furthermore identifying when it randomly converts the type on you is a pain.
>so floating point basically must be used.
Quants tend to use fixed precision types. It is like a float in every way, except base 10 instead of base 2 so there is no floating point error.
That's a pandas (and maybe numpy) issue, not a dynamic language issue. (If you want to generalize from the specific libraries more accurately than “dynamic language”, it's “using a low-level library whose type system doesn't match the host language type system” issue.
> Quants tend to use fixed precision types. It is like a float in every way, except base 10 instead of base 2 so there is no floating point error.
No, a type that is like binary floating point in every way except base 10 instead of base 2 would be decimal floating point, not fixed point. Decimal fixed point is different from binary floating point in more ways than base.
And yet, Q.
I think this is an inaccurate take. No one in finance is doing accounting or model estimation using Python's floats; they are using numpy's float32 (or float64) type instead. I think a more accurate version of what you're saying is that static type checking is useful when modeling complicated contracts; this might be true, but I think it's not that important, as those things aren't that liquid anyway.
Jane Street's decision to use OCaml is almost as much about hiring and history as it is about language features.
We are. When your input data only has five significant figures, and probably less than that of real information, numerical accuracy is the least of your worries.
Any examples other than Jane Street?
How is being a trader different from being a statistician? Curious as I've never worked in finance before.
1/3 product person who can learn the user's domain-specific needs
Magicians will be magicians, always hustling (bullshitting), but they will never have the value and job security of the blacksmith. The blacksmith can see the fruits of her own labour, whilst the magician must lie to herself and others in order to claim the blacksmith's value as her own.
If the blacksmith is good enough, she will earn the trust of management and management may consult the blacksmith in the selection of magicians. Management may ask the blacksmith to interview magicians and seek her advice on the final hiring decision.
The blacksmith may not carry the "prestige" of the hustling, bullshitting magician but she can command a high salary and dictate her own working conditions. This is only if management understands her value. What the magician thinks of the blacksmith is irrelevant.
Reliable blacksmiths are hard to find. Magicians are a dime-a-dozen.
> It is easy to find other magicians. It is not easy to find another blacksmith. Without the right blacksmith, there can be no magic.
Data engineers ("blacksmiths"): Blacksmiths are paid less. People think of them as less highly educated. Their work is less creative. When they are successful, their work is mostly invisible. They are interchangeable. People think of what blacksmiths do as more like scripting than writing code. Blacksmiths mostly work on configuring systems they didn't build. Blacksmiths do more troubleshooting than building. Their roles are focused on support.
Data scientists ("magicians"): Magicians are paid more. Much more. People think of them as more highly educated. By definition, what they do is magic. They work on prominent projects. Their successes are highly visible. They build large systems that only they can comprehend. They use support staff to clear away mundane obstacles so they can focus on unique, highly creative aspects of work.
Saying that we need more data engineers than data scientists is like saying that we need more janitors than CEOs. That's true, but it's true because we made it true by structuring projects around one prominent, well-paid person supported by a staff of invisible drudges.
This smacks of the positive self-talk that QA and software testers used to give each other: "We are indispensable! We take pride in our craft! Nothing can ship without our signoff!" And then lots of companies reduced their QA or eliminated it wholesale by focusing on continuous delivery and changing consumer expectations of what "broken" or "acceptable" means. The same fate awaits data engineers.
Anecdotally, the former head of QA for Palantir UK is now the head of data engineering for Palantir UK, and Palantir does have an out-of-the-box, end-to-end, it-just-works product that handles 80% of ML workflows. You're betting your career that they won't put it in a box and sell it at commodity software prices?
Caveat is that many scientists are expected to publish novel research to advance their career. "Infrastructure" and "data management" do not tend to produce the kinds of sexy projects which are attractive to publish.
There's a big incentive for companies that need to hire more people to give the "better" title out for the same kind of work. If managers do maintain a gold/silver role on their team all of the folks in the silver role will look at the gold role as their next move. Worse there is a net-negative productivity drag where the gold/silver role constantly debate what's in-scope vs. out of scope for their work.
I once saw a team where the scientists were meant to be equivalent to SDEs in coding skill, but the scientists could only in practice do some light python/bash scripting. They tried to make the SDEs responsible for "productionalizing" the projects which meant adding tests/etc. The engineers who could all left the team in 6 months, the ones who remained were also unable to perform more than light bash scripting/python work.
This is usually the case only for companies that work on low risk applications (I.e. not safety related or critical industries) or have been lulled into complacency (sometimes, ironically, “we haven’t had a major issue so obviously QA isn’t needed” when strong QA is precisely why they didn’t see issues)
> low risk applications (I.e. not safety related or critical industries)
>management may consult the blacksmith in the selection of magicians
I mean when you put it like that, why hire a magician (bullshitter), if the magic relies on the blacksmith?
And, if management needs to consult someone (a blacksmith) on hiring (another blacksmith or for whatever reason a magician), then arguably management is made of magicians.
Don't get me wrong, I agree with the point your making. It's just the problem is with BS and BS is rampant, or like you say: a dime-a-dozen.
Because magicians are better at the smoke and mirrors that drives funding rounds and closing big sales deals.
Inevitably if you treat a job role as a support role, you'll attract weaker individuals into that role then you would get if it wasn't considered a support role. The problem with Science oriented teams is that all roles other than the science role morph into science support roles over time. The same pattern used to occur with Engineers and QA, or Engineers and ops.
Is this because it's easier (obviously) to teach a quant engineering than it is to teach an engineer quant finance? Or rather because it's expected now that traders will become the bridge between researcher models and implementation, and engineers will simply provide the underlying infrastructure to power these implementations?
The (repetitive) blacksmith role is not an interesting one, digital revolution needs to come into place. Architects that build tools, self service systems are much more interesting.
The data lifecycle is waaay overpopulated with Data Scientists who are not empowered or knowledgeable enough to work with product designers and engineers to do everything that empowers Data Science and ML.
We need more Data Engineers involved at time zero in projects to help:
1. Plan out what data should be produced/captured by the product
2. Instrument systems to actually generate data consistently and effectively
3. Build ETL pipelines and data management systems
4. Manage enterprise data sharing and resiliency
What ends up happening is you have a bunch of Data Scientists just handed a pg_dump or flat file from some ops team. That is typically missing data or poorly formatted and they spend 90% of their time cleaning it up then running some basic regression with numpy or whatever.
Need better understanding of the data lifecycle by organizations and investment in instrumentation and data management.
Not to disparage the amazing data scientists I've worked with, but I've been on teams where this is very much the approach to operationalizing models. It's basically, "Here's the sklearn model and some fragile featurization scripts we built. Can you take this to prod ASAP?"
The problem I've seen is that DS & DE teams were in different parts of the org and had their own sprints that were in no way connected. So they kept chucking models over the wall and we kept trying to faithfully operationalize. Once we convinced leadership that we had to collaborate from the get-go, things went a whole lot better. It also improved the working relationship of engineers and scientists.
I learned a hell of a lot from the scientists; they learned how to write better code. They also learned what code they didn't need to write because I could do it faster or better than them, leaving them to focus on more important things. It was pretty amazing to find what manual processes they would setup in lieu of proper (or even any) engineering support. Again, these are amazingly smart people, but they were being square-pegged into a lot of round-hole engineering tasks.
Now, the much more frustrating issue I had was being in a very data-heavy organization and being told by a distinguished engineer (my skip-level) plus my direct manager that, "data engineering isn't a real discipline." I left that org very shortly thereafter.
Assuming the OP meant "setting up a pipeline for moving data from one system to another" and not a one-time copy, it is definitely engineering.
Reading this thread has made me realize just how lucky I am to work very closely with strong a very strong Data Scientist, who is complemented by a very strong Data Engineer. Conversations with the Data Scientist are always about strategy, product alignment, and ensuring we're optimizing what we build for learning. The Data Engineer works very closely to ensure we're actually capturing the data we think we are, getting it to analysis systems, and making sure those data pipelines stay healthy.
And, their production alleged "machine learning" (it's pretty much standard linear regression, but calling it ML is sexy) systems are slow motion train-wrecks. If the string and duct tape holds, then it works, but it's unfortunately continually breaking.
Hell, in Slack, I watch their data scientists continuously wrestle with how to actually make their Jupyter notebooks work in production.
Whereas my company has far more data engineers than data scientists. The plan from higher up the corporate food chain was always that we'd give them our data to do their data science voodoo on, but we ended up getting a few data scientists of our own for specific projects.
So, we focused on ensuring our data stream was reliable, consistent and sufficiently timely, for them to work on. But as soon as it hits their systems, it's a forest fire of hacks upon hacks, which inevitably break.
In the end, we had to send in our data engineers to stabilise their flagship "real time reporting" product that corporate was so amped about.
So yeah, I think that there's probably a happy ratio of data scientists to data engineers, of about, say, 1:5 or 1:10, because the maths generally scales O(1), it's the beauty of maths, but the actual engineering to get clean data delivered timely without breaking anything scales very differently indeed.
Could you go into more details on what their struggles are? We had many problems as a company doing machine learning projects, and we built our internal platform (https://iko.ai) to keep our sanity. I'm always interested in problems others may be having.
Our belief was that there are some odd behaviors in every tool and we had to figure out a way around.
Yes, there are Software Engineering degrees. But I think a minority of Software Engineers have a Software Engineering degree.
What this means in practice, is that Computer Science majors need to learn the engineering skills on the job or on their own after they graduate. Although some programs help students pick up some of those skills as part of the degree program.
Certainly, most of us with unemployable majors. ;-)
Another phenomenon is employers applying the "engineer" title to any technical worker, such as designers, programmers, technicians, and so forth.
Part of it is that businesses evolve towards having caste systems. When this happens, then folks in the lower castes will try to rearrange their job titles to resemble the upper castes, or change jobs.
Obviously this is a huge generalization but I think it's a useful way to think about it. And when I say scientist, I mean "Professor of CS" not "24 year old with a BS in CS".
Right now our product has accumulated a lot of technical debt on the data validation side because data scientists designed the test code in a way that dramatically slows the development process.
Many "data scientists" (not all, but many) have little to no ability to do anything other than apply "recipes" of algorithms or classification methods or logistic regressions, etc. Asking them to develop a "novel" method would be fruitless. Asking them to clean and scrub the source data set is like telling an amateur pie-baker the store was out of pie crusts, you'll have to make your own from scratch -- it's not going to happen, they just don't have that skill, the instructions on the box don't account for that possibility. As soon as the task diverges from the simple step 1, step 2, step 3 that they were originally taught, you realize they have very little ability to adapt. YMMV of course.
This is because they rarely hire people with scientific thinking ability. They just hire people who can code and program from set recipes. Once you hire such people you can not expect them to do non-recipe work. If you don't want recipe work, don't hire people will recipe skills. Do not have job interviews that select for recipe people. But, that is exactly what most companies do.
In all fairness, it’s basically impossible for a new grad to have those skills. 4 years of a bachelors in any field isn’t enough to cover such a wide area. Even for people with graduate degrees it’s a stretch.
If your four year degree didn't give you the ability to learn and expand your knowledge on your own, its a colossal waste of your time and money.
I think you've been working with conmen/conwomen. I've never seen a data science project that doesn't involve data cleaning or wrangling of some sort.
I meant a data science project in terms of a project completed by data scientists. In my experience, all data scientists are accustomed to doing extensive cleaning etc.
It's very easy for an average engineer (like me) to start using ML using these tools, but a lot harder to explain how it works, or exactly which type of models to use.
In my mind a DS would be really useful to just point us in the right direction and check work. Like a super specialist QA...
This matches my observations as well. I'm an engineer (The non software kind) at an Industrial plant, I have noticed similar in my involvement with data scientists.
I think in a lot of cases it needs to be acknowledged that data scientists are not domain subject matter experts. Very often the data scientists we have worked with lack knowledge I take for granted as an engineer such as knowledge of basic chemistry, physics etc. I can sanity check plant data almost instantly. For example I will know if a material reacts in an endothermic or exothermic manner and can verify that its effect on a temperature prediction model make sense.
As a result I often feel like Data Scientists are not empowered to bring their full expertise to bear, they don't understand our process fully and lack a lot "engineering" knowledge to make value added inferences about what their models are demonstrating. Often they can deliver a model and show that a particular term is significant but they have a very shallow understanding of what the term actually represents and can't provide concrete recommendations as to how we could modify our plant to benefit from what their model is demonstrating.
Sometimes I feel like we need an additional translator sitting between who can speak both "Data science" and "Engineer" I don't think this is quite what "Data Engineer" as suggested by parent article is but possibly the role could be expanded to incorporate this.
I feel seen. At a previous job, our output after some cleaning and transforming was a pg_dump for the data scientists to load. We had little visibility of what they did to that database once they got it.
We've been through this with installer writers, database admins, test automation, operations people, and now 'devops' people who were supposed to be the answer to these problems. It never stops.
Doesn't mean you might not need to do transformation for different uses but ideally wouldn't need to, for example change data types like turning a bool into an int.
Unfortunately, data engineers rarely deal with purely in-house data. You're gonna be pulling data from a variety of data sources. I can assure you that if you're pulling from government data sources, you're gonna have a hell of a time. Speaking from direct experience, my team is probably going to spend $10M/year just trying to keep a government dataset in order, because they won't do it themselves. I'm talking lawyers, legal analysts, data engineers, data scientists, data entry personnel, etc.. just to fix data that should have never been broken in the first place.
It shouldn't be a shock that cleaning the data is the path of least resistance for many.
On the point about the govt I literally built a completely new contract type and civilian hiring practices for the DoD to bring in Data Engineers so they could do exactly what I describe to make your life easier.
The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.
There is a reason why so many PhDs get into the field, because they were trained in the exploratory/research mindset that no engineering or analytics skills can fill. Correct me if I am wrong.
> Do business analysts have good engineering skills?
Depends on the analyst.
> I don't think either of them can fill the data scientist role.
> The scientific training and mindset (scientific method, hypothesis, experiment setup, etc.) to even create an accurate model is an undervalued skill here no? Even if data cleaning is automated, these skills cannot be easily learned.
It's not about replacing data scientists with data engineers, it's about both roles working together to make everything more efficient.
The hiring rate for data scientists has plateaued. The industry doesn't need any more of them. Why? Because data scientists often can't solve problems fast enough. It's a commonly quoted statistic that 70% of any data science task is data cleansing and/or etl. A data engineer's job is to take that 70% and turn it into 10%. The data engineer saves the data scientist time, meaning they can focus on what they're supposed to do -- build models.
- Actually big data (so, not something you could grep...) will trigger your code in every possible way. You quickly learn that with trillions of input, the probabily to reach a bug is either 0% or 100%. In turn, you quickly learn to write good tests.
- You will learn distributed processing at a macro level, which in turn enlighten your thinking at a micro level. For example, even though the order of magnitudes are different, hitting data over network versus on disk is very much like hitting data on disk versus in cache. Except that when the difference ends up being in hours or days, you become much more sensible to that, so it's good training for your thoughts.
- Data engineering is full of product decisions. What's often called data "cleaning" is in fact one of the import product decisions made in a company, and a data engineer will be consistently exposed to his company product, which I think makes for great personal development
- Data engineering is fascinating. In adtech for example, logs of where ads are displayed are an unfiltered window on the rest of humanity, for the better or the worse. But it definitely expands your views on what the "average" person actually does on its computer (spoiler : it's mainly watching porn...), and challenges quite a bit what you might think is "normal"
- You'll be plumbing technologies from all over the web, which might or might not be good news for you.
So yeah, data engineering is great ! It's not harder than other specialties for developers, but imo, it's one of the fun ones !
As application engineers build increasingly "stateless" code (e.g. pure functions, serverless deployments, etc), that state gets pushed elsewhere. Someone has to manage the queues, file versions/locations, logs, databases, configurations and so on. That is all "data".
State management is a tricky problem even in a single-threaded application. It's doubly so in distributed systems, where state can be inconsistent between all the moving pieces. This is the source of endless data integrity issues. I think data engineering is a great way to get some exposure to all of this.
Exactly. You can't magically make a stateful problem stateless, you can merely move that state around. Sometimes moving state around means moving it somewhere that is appropriate and capable of expertly handling that data. But if you make those choices wrong, it makes every aspect of your application more complex.
UI programming tried going down this idea of stateless programming, and for a while it was trendy to do so stuff like redux. The problem is that UIs are state machines. That's not an analogy, that is a literal statement. And it is true of all UI's...it's just as true of the transmission lever in your car as it is for your saas dashboard. You can't program stateless UIs...they would cease to be a UI. So at best, you can move that state around. And with most of these solutions (eg. redux), you end up pushing that state into a massive global singleton, where even simple things like the state of a single radio button needs to be fed through dozens of tightly coupled components in order to "statelessly" render. And even worse, you lose the extremely helpful distinction between UI state and domain state, mixing them both together into a gigantic shit stew.
It gets even more complicated. It’s not just the current state that matters, but also the history (sometimes the entire history) up to that state.
It's shockingly difficult, and something that only experience can teach.
I cold-emailed my current lab's P.I. and just asked for work. Search for "research software engineer" or "scientific computing professional" positions. Plenty of data engineering goes on in many fields (environmental science, climate modeling, high energy physics, physical chemistry, etc), and plenty of fields desperately need to develop an engineering culture (e.g., plant biology, my field), whatever interests you. Availability and compensation will vary by discipline.
This is obviously of lesser value to the topic at hand, and more about making sure you hire good people I think.
It has been my experience too. Basically, ML / DS engineers are thrown under the bus for being poor general software engineers, but in practice it’s totally the opposite.
Data science & engineering should be treated as a single collection of skill-sets. Lacking ETL experience is a major deficit, considering how prevalent that kind of work is.
This might just be my personal biases coming through. I consider myself a "full-stack" data scientist & engineer. But because data scientists who can work on the backends are rare, I always end up doing the plumbing while other people do the fun analysis work.
I think companies that are data "science" heavy are going to be at huge disadvantage soon. Tools like Rekognition and Google AI APIs are making the model training & deployment aspect almost trivial. At some point, the only real work involved in this space will be the data "engineering."
This can be tough because there could be a lot in that skill set. You can't realistically expect someone to have solid knowledge of statistics including specialising in the sub-field and type of algorithms that your product needs, and also be able to write good code and act as a developer, and also have solid knowledge of all the tools for data streaming/processing/ETL. There is a point at which you're just stretching yourself too thin if you try to do all of these at once.
Of course, stuff like knowing how to interact with a database or employing good software development practices should be a very basic prerequisite and some scientists certainly shift things too far in the other direction and use their academic knowledge as an excuse to write poor code and not learn new tools.
I guess what I'm trying to say is that they are distinct skills but you still need all of them to some extent and striking the correct balance in one's skillset is really difficult.
I would certainly hope that college courses are even more comprehensive after 10 years and an explosion in interest for the field.
Also, much like being a full stack developer, a full stack data engineer doesn't need to know everything at a master level. But that you can at least handle tasks at most points in the chain.
The actual process of collecting, aggregating, cleaning and verifying data is a hugely important skill, and not one I've really seen typical data scientists possess.
Then they are not scientists. They have a label "scientist" but lack of rigor of actual science.
I don't see why changing the label to "engineer" would suddenly make them have rigor.
This is sort of the meta failure of the argument. They are arguing that people's data skillsets are wrong. To make that argument they are analyzing based on the wrong variable in a data set.
And I would warn you from my experience teaching statistics to undergraduate engineers...they are not going to be much better. Regularly get 'hey we have this data what test can we run?' 'what are you trying to show?' 'we don't care we just need to run a statistical test' conversations.
Data in the classroom setting is pristine and beautiful; data in the real world is messy and buggy. You have to get burned by buggy data a few times (or maybe a bunch of times) in the real world to learn to look for bad data smells -- I don't think schools effectively teach this kind of intuition, regardless of whether the students are training as data engineers or data scientists.
If data scientists are spending more time in school getting advanced degrees, they're not getting as much exposure to buggy data, whereas data engineers with a BS and a few years of industry experience would already have built up this skill.
I got to take over our department's undergraduate statistics course a few years back.
The first change I made was all homework, tests, and projects used real data set. I intentionally have them collect bad data (they don't know its bad before hand). First day of class we collect data using the board game operation...I give basic instructions and then halfway through ask everyone to stop and agree on how they are entering data for the variable of 'success or failure' of the surgery. Oops...
In my experience teaching the course, the reason the students (engineers) find statistical reasoning hard is:
* They have never been given anything 'broken', everything is curated to avoid things not working. The result is they think data has inherent meaning. A right answer.
* Their entire learning experience has been stripped of context and the need to make decisions with information. They can give me a p value but are terrified (not unable, just unwilling) to interpret it or give it meaning.
* They have never encountered the concept of variability...everything is presented as systems with exact inputs and outputs.
When I work with postdocs, I sometimes (less frequently) encounter many of the same challenges. Data is treated as sacred and external and inherent. It's wild to me.
If there is a question mark here - it's really how much value are we deriving from all of these data people?
Where is all the ML that's changing our lives? Search, Alexa and TikTok, I can see it.
In the future obviously vision systems for autonomous cars etc..
But I'm really wary about the heavily decreasing marginal returns after that.
It will surely change the world, but I think in specific areas. Most of the entire field seems like an optimization on something rather than anything new.
Washing Machines feed up immense amount of labour and toil. Alexa telling me the weather is not.
Most software isn’t consumer facing but just because you don’t see it doesn’t mean it’s not changing things around you. ML tends to be overhyped but your assessment is too pessimistic.
However, finding defects in aluminum parts that involves using computer vision, would absolutely be a ML solution.
If the article is trying to make a point about skill development and diversification, I'm totally on board. Bifurcating the roles instead is going to be less effective.
To the value point...my sense has been we are seeing the Webcommerce 1.0 bubble Machine Learning edition. Lots of uses of it, not all of them have value. I am excited for where we will be in 10 or 15 years, but I suspect the difference will be huge. If you put me to a guess, I would say better data handling practices and ethics will likely be the linchpins of value creation vs. using tools for the sake of tools.
Everyone is buzzing about the latter, and few even realize what is the former.
You need architecture, you need backends, you need a front end, you need product design...all with data.
Why are computer scientists computer scientists not engineers? Why is computer science about the code side? Why did computer engineering end up being more on the hardware end of the spectrum?
Words, especially newly coined terms are pointers to meaning. That meaning is socially mediated, it is not inherent.
You're saying this (adn I think the author is too) because there is a need for this group of people to look beyond titles to skillsets, and the existing titles carry linguistic baggage of the difference between science and engineering that has existed for decades.
This guy would have you believe that Pytorch has Solved the entire, vast field of data analysis as inherited from Newton, de Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage, Jaynes, Breiman, Pearl.
This is a lot like saying that photography has Solved art, and now we need people who can climb ladders and glue the posters on them big billboards. It would be delusional if it didn't have a self-interested angle.
What, we with math degrees are fully confident that the plumbing problem is easier to commoditize than the problem of making sense of data.
E.g. getting a model from 0% accuracy to 70% accuracy might be a couple Pytorch library calls that any dummy who watched Andrew Ng’s course can do. But getting that same model from 70% to 75% accuracy might be deeply mathematical and require the latest and greatest mathmeticians, statisticians.
But in this hypothetical example, an engineer who stands up the 70% model and keeps it running 99.9% of the time, with high uptime; is more valuable to the bottom line of the business than the 75% accuracy model hacked together with scripts with 50% uptime.
In the case of data scientists, I think the business folks that want them to understand the business domain better generally have the strongest argument, followed by the statisticians - good data scientists need to personally understand both of those things well, while the engineering and ops stuff that data scientists are also expected to do is easier to compartmentalize on other teams. So I agree that we should have more data engineers, but apparently for the opposite reason as most people in this thread.
I also end up having to be the one to talk to data vendors to understand their data feeds and essentially translate that for the data scientists. Having to sit in the middle is annoying for me and suboptimal for the business.
Companies were in a rush hire "data scientists" and boot camps like Insight were more than happy to pump out very impressive PhDs with just enough understanding to build a Keras model.
I've worked in industry awhile doing DS work and have been astounded at the number of PhDs that both don't know how to write Python that doesn't live in a notebook and throw away years of disciplined experimentation experiences to just throw keras models at data until the needle moves.
There do exist excellent data scientists out there, who are both very solid software engineers and really know their stuff mathematically, but I've found most of these people can't reliably find jobs because the people interviewing them know so little that good data scientists will be penalized for answer a stock question correctly.
The field has been so flooded with amateurs that have no idea what they're doing, that potential mentors have been driven out, and now it's just a mess. To get a job doing DS if you do know what you are doing you have to play a weird game where you guess the incorrect answer the interviewer has in mind.
It's kind of a fucked up field right now.
Of course I understand that YMV, but I will forever be skeptical of anyone writing code with a PhD after working here.
I've found that Physics PhDs tend to have the highest probability of being good coders since a certain subset of them get bit by the software bug when they need to write non-trivial amounts of code to solve research problems.
Every physics student at my college had to take FORTRAN, plus programming was assumed in many of the other courses, and we also took an electronics course that included digital techniques. And maybe the main thing was simply that programming was interesting and fun.
We've also had a tradition of learning to do everything ourselves, for better or worse. I had no access to a professional programmer.
IMO this is the number one problem of our modern culture around education. Popular culture makes it popular to treat education as pointless, and this even affects students who are pursuing difficult degrees. "Why do I need to study humanities? Why should I learn to code if I think I am born to be someone else's boss?"
On the other hand, many teachers in K12 and early university have no ability to connect the "what" with the "why." "The curriculum is the curriculum. The test is the test."
If we can solve these problems, our societies will be much better off.
I am a PhD student in a non-engineering field. I've been taking as many math and stats courses as I can, but what other courses should I be trying to take if I want to excel as a data scientist? Software engineering CS type courses?
I've known a surprisingly large number of people that are mid-phd thinking about data science as a career. Don't pursue 5+ years of learning to master the world of academic research if your goal is to help people sell t-shirts or whatever.
Certainly there are some people pursuing specific PhDs, such as those in computer vision and nlp where there are some industry options that might offer more challenging/interesting research than academia. It makes sense if you're a PhD at NYU or Stanford in CS fields related to neural networks to go work for Yann Lecun at Facebook or Geoffrey Hinton at Google.
But if you're, say a biologist that wants to sell clothes online... why spend 6 years working in academia to do that? Is your dream really to optimize clothing sales? If so don't be a biologist. If your dream is biology, why in the world would you set your course on selling clothes?
I get it if your dream is biology but you can't find a tenure track job and so you pivot to industry... but if you are mid-phd, what are you doing there? If you love your subject, try to find a way to work in that and if you don't, don't waste your time.
Data Science is not a glamorous job, and the vast majority of companies it is literally bullshit. The people solving mind-bendingly hard problems are already in programs specializing in those problems because that's what they are passionate about. On top of that DS is way over indexed at most companies. If you're mid-phd now I would expect a serious contraction in DS jobs in the next 5 years. DS will be a niche job after the next market "correction"
First, like you said, there are the stray PhDs who do it since they know research and some statistical applications.
Second, there are hordes and hordes of DS people who "learned" their skill with some bootcamps or online courses, which means they know enough to write notebooks and glue together functions. Their understanding of theory is often shallow.
In either case, it is hard to "blame" someone for taking an attractive job. But it isn't good for the discipline.
The appeal of DS is clear for companies. But the problems it promises to solve are much more complex than we collectively recognize - or are willing to admit. In my opinion, doing causal inference is a difficult, unsolved, and deep topic and no single course would equip to you to tackle it. It takes domain knowledge and multiple years of stats/math/ML (all of them, not one of them). And yet, causal inference is what 90% of people want ML to be. A model that works on some dataset is not a model that is useful in light of the true latent DGP. Yet, when we want to sell T-Shirts, what do we really want?
Hence, when I look at the problems that ML is supposed to solve, I think that most people calling themselves DS on linkedin are not really equipped for it. And there is a case to be made that some fields where PhD researchers train to solve such causal inquiries indeed are better equipped to tackle the issue.
For example, if it's about selling shirts, I would take an econometrician with some data engineering skills over a coursera superstar any day of the week.
I think if you do a PhD in ML/Stats/Biostats/Econometrics/etc., it is reasonable to pursue a career in DS. It's what statistics _is_ now.
If you have some other PhD and know some Anova, OLS and Stata - or if you have CS background but know some Jupyter and Keras - then it's essentially career change. It might work, but probably not without a hitch.
So I agree with you, but I'd reframe it:
It's unclear to me whether we need a contraction, or whether we instead need a quality update.
I disagree with you in one point: I do not think we will make progress in DS (getting it to work in more use cases) by treating it like a solved problem, a skill like milling that needs talent and experience, but not academic education.
If we do that, I think DS will contract because it will stagnate in usefulness.
My point here is not to accuse anyone of being a bad DS. I am sure there are many ways to become efficient. But even the theory of causal inference with simple linear models goes far, far beyond what I saw in ML hiring tests, online courses and so forth. And solving the problems it tackles is not accomplished by throwing more layers at it. For other ML algos, we aren't even close yet at understanding these issues on a similar level.
In the end, what we need are actual ML scientists. They should neither be pure statisticians, nor pure subject-matter experts, nor pure computer scientists - as we mostly have now. We also need more than the current ML programs that are mostly clobbered together from other areas. For example, people who publish in ML research are probably very useful in a company that has to deal with that exact problem. Any scientist knows, of course, that even a fairly adjacent question may already require tons of different knowledge. DS is, will remain, and probably should be an academic field, because there are more open than solved problems right now.
No good data scientist should ever expect data to be pristine. And a good data scientist, even if they don't have quite the engineering chops necessary to build a production-quality ETL, should know enough about the process to help guide it. If they aren't a part of that process, they're not being a good DS. They can't expect someone not involved with their problem to know what tradeoffs to make, and if they don't know exactly how their data went from raw form to the ETL-ed form, they're probably going to make bad assumptions, and those assumptions may very well make their architected solution a complete pile of garbage. Not to mention, how can a DS offer suggestions for solutions if they aren't deeply familiar with the raw data that's available?
To me, a good data scientist should, at bare minimum, have several skills.
* They should first and foremost (but not solely) be an in house expert in statistics and machine learning to know what can be done with data, and what can't be done with data. They should arrive with that knowledge. Engineers I think have a tendency to trivialize this, but true expertise in this domain comes only with years of experience.
* They should strive to find modeling solutions that are right for a particular business problem. If they seem to be only applying the hottest research regardless of the tradeoffs for the particular business problem, that's a red flag.
* Their focus should be on integrating themselves with the product/business as much as possible, and with the engineering team as much as possible. If they're expecting to be handed directives, that's a recipe for a ton of wasted time.
DS should never, ever be siloed into their own little DS world. They will be useless without a deeply intimate knowledge of the business goals, the needs of product, and the capabilities of the engineering team.
As they progress, they should become more and more "full-stack", otherwise they are stagnating.
Data scientists like to quip that 80% of the job is data cleaning, with the remaining 20% divided up arbitrarily among other tasks as suited the joke. In some shops nowadays, it's more like 45% data cleaning, 45% data engineering/ops/programming just trying to make your results available to the rest of your organization, and 10% research.
If I can spend less time learning/doing software engineering and devops and more time doing actual data science, that's great. At a previous job, my team was clamoring for more data engineer hiring, and part of the reason our projects were slipping and starting to fail was lack of data engineering support. Our tooling was shit, our processes were shit, our code was shit, and access to (and trust of) our data sources was especially wet and stinky shit.
It made the daily work of doing data science a miserable slog of ad-hoc duct-tape solutions, and it contributed to us being generally ineffective as a team.
All of this would have been fixed if we had one competent data engineer with some actual real-world data/ML engineering experience and good communication/advocacy skills. Let alone two or three!
Those who have been fed pristine data without having to undergo the trials and tribulations of actually having to collect the data have missed a crucial part of scientific training. Like you, I find this lack of rigour is rather common among data scientists. Not all, but quite a few.
How ridiculous to assume that a scientist doesn't clean their tools and set up their experiments.
(Surely as one gets more experienced and older, the job likely becomes less manual, more about teaching and coordinating.)
I don’t see the same appreciation or consideration in general in the field of data “science.”
It's interesting that you put it that way because a lot of the other complaints in this thread are that the people who expect their data to be ready for use are exactly the people with science experience but without the relevant technical background.
Sometimes I think a company (not having the DS experience themselves) mistakenly over-hire DS roles in today's hype of "AI" when their data is mostly run-of-the-mill and only requires simple linear models that can be architected an understood by a stats/math-savvy engineer. Even then, a good DS is still useful (even linear models can be complex: e.g. what priors do you want to use? Do you want a multi-task solution? etc.), but maybe not worth the cost.
Reality is that data science is here to stay. It's coming out of the honeymoon period, and things may never be as hyped up as it has been the last decade, but that's probably a good thing for the field. Everyone will probably move on to hating the next up and coming thing. I have a hunch it could be something in data engineering because, while not exactly new, it is absolutely the next "data science" in terms of demand, and with products like Snowflake having so much hype behind them, it seems the backlash will be inevitable.
I remember an era before "data scientist" was a job title. When we (programmers) would analyze data to see if we had enough information available to identify the problem, if not, fix that, then come up with a strategy to solve it, test, and finally deploy the model. The fun part was trying different solutions and analyzing the data. It also felt awesome to deploy a product that worked like "magic." Product owners didn't know or care what a neural net was, they were just happy it worked.
Now there are tons of data scientist out there who take the easy, fun, rewarding work and try to skip over the nitty gritty implementation details. Then management thinks engineering is incapable of doing such work, and the only time we get the opportunity to do something fun is to do so behind the scenes.
There's a huge variety in DS responsibility and background between companies.
Maybe I'll come back in a few hours, but for now I'll stay away.
When I first started hiring and working with data scientist my view was this: If you can only manipulate data and run it through pipelines to generate models then you can't do enough to be highly valuable. You either need to have a strong enough background in CS to build the pipelines / tools or a strong enough mathematics background to be able to propose cutting edge new ideas. From my experience it is hard to find someone who has one of these skill just from a University "data science" program. At a small company (at least ones that I have worked with) being only proficient in R and basic Python isn't enough. That being said, I have met and handful of Data Scientist who were very smart and self motivated enough to pick up on the lacking skills when given the chance.
My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
From my experience, there are two types of data scientists who work who do infrastructure work: 1) Those who do not make the best data scientist because their skill set is too far in engineering land, leaving them weak where it counts. If the startup is relying on the data scientist to be profitable, I'd be cautious with these types. or 2) Someone who is senior, beyond senior really, who has worked both jobs, and doesn't mind doing both jobs. This unicorn is so rare it is mythical. The joke when the terminology was created is they're so rare no one has ever seen one, hence unicorn.
Me, I can not do the work I need to do if I'm on call. That is where I draw the line. That means hiring someone to monitor the infrastructure. Furthermore, I'm an okay architect, but you really do want to hire a specialist if you can help it for that. Do I help them with the infrastructure? Absolutely, but they're on call if a server is on fire. They have the admin login credentials, not me.
I get wearing multiple hats, but keep in mind to be a data scientist you're already wearing multiple hats. Being a data scientist is like double majoring and getting a phd. At what point are they stretched too thin? The consensus in the industry is they're already stretched too thin and should be broken up into different specialized roles.
>My question to HN is this; are there rolls at these larger companies for a Data Scientist who who primarily just crunches data in R and Python without the ability to actually build the pipelines / tools or conduct research?
That is the standard role, even at startups. However, the industry consensus these days is data scientists should have more responsibility when it comes to deploying models than previous standards. So data scientists are being pushed in a more engineering direction, not with hosting sql servers and infrastructure, but with working with engineers to make sure the models are monitored properly. This change comes from model deployment being further automated as time goes on, making it easier for the data scientist to have more responsibility during this stage.
 source: https://www.dominodatalab.com/static/gfx/uploads/domino-mana... page 9. Suboptimal organization and incentive structures.
What's interesting is they tend to struggle in two different ways: 1) The data scientist that is gung ho about infrastructure work, jumps in, and then ends up doing a bad job, because it's not their strength. They end up getting let go for not being ideal at that work. 2) The data scientist who struggles with the idea of infrastructure work at all, jumps into other roles they're good at like data analyst work, helps the company in that way, but ultimately because they did not push to get an infrastructure engineer hired, they end up let go as well.
Me, I go out of my way to get an infrastructure engineer / data engineer hired early on. Also, I have worked as an engineer, so I tend to do a lot of the "hard" stuff most software engineers struggle with early on, if applicable. Eg, at one job I wrote a compression format to reduce battery drain on our devices that were collecting data.
Most data scientists struggle when it comes to CS/engineering skills (4/5th of them), so it's not uncommon for them early one while the pipes are being built to do data analyst and BI work. BI work to automate reports, which management loves, and DA work to show some amazing future service the company might be able provide to its customers. It's selling the sun and the moon really, but it gets management inspired, and helps them know what data to collect. It's not unheard of to need a minimum of two years of collected data before building a model that can be deployed becomes feasible. This can be hard on the data scientist, because there is a lot of down time before that. Many get fired during this time even when they're doing a good job. They have to wear multiple hats, but it's analyst roles (like BI work). Technically a data scientist is a kind of analyst, not engineer, so it makes sense that wearing multiple hats for them tilts in the analyst direction, not the engineering direction.
I've been writing code since I was 8 years old, so I'm one of the unusual ones that tilts in the engineering direction, but I think it is unreasonable to expect that from the average data scientist. Let them do what they do best, and hire someone else who can round everything out and you'll be in a good place. Unicorns aside, you'll need a minimum of two professionals for a data project to succeed.
Assuming you are a competent data "analyst" who wants to become a data engineer, how would you go about it? Is "go back to school and get a CS degree" the answer? I suppose this question is very broad, but I am curious if a practitioner like you has an opinion.
To give some context:
I recently graduated with a STEM PhD, and looking to move into data science. Reading the comments, I feel like I fall into the "pointless data scientist" cohort derided in this thread. Eg: I am very comfortable doing typical analytical work & occasionally training models inside a notebook, but I am neither a cutting-edge theoretical statistician nor a data engineer.
I've been trying to improve on the engineering side. For example, I did a project recently where I set up a rudimentary pipeline that continuously pings an API, uploads the data to a cloud database, then serves up the analysis via a Flask app. For me this was a big step up from just doing notebooks on a csv file :)
But moving beyond the basics, I am not sure what to study next. Hence my question. If you have any suggestions, I would greatly appreciate it!
It wasn't optimal because we were having bottlenecks and variance: some people could move through the stack and do it all, but you either had them or you had to train them and it took time.
I'm going to answer both questions just in case:
To become a data engineer / infrastructure engineer, there are multiple paths forward. I recommend doing BI work aka Business Intelligence Analyst Engineer. It's typically Tableau related work, so making dashboards and reports for management. It's still a data related role so you should feel comfortable and at home. However, it is also an engineering role. If you're the first BI at a company you'll often find yourself setting up an SQL server and doing certain data engineer-light type work to get data into the server. You'll need to set all of this up, so BI is a blend of data engineering and data analyst type work.
Once you've gotten familiar with BI work it's very easy to transfer to data engineer / infrastructure engineer type work. This is especially true if you end up setting up a data warehouse (as an alternative to MySQL) or data lake on AWS to do your BI work. You don't have to, but if you go that far, you're pretty much doing data engineering at that point. The line between the two is fuzzy. Data engineers and infrastructure engineers are expected to be architects, and by that I mean they are expected to future proof the schema of the SQL server / data warehouse (future proof setting up the database for new data so it doesn't become a mess). A BI is not expected to be an architect and imo the only way to gain that skill is through first hand experience playing with databases, so a BI is a good way to get that experience.
At the current company I'm at the infrastructure engineers are expected to do BI work. This is unusual as the data scientists typically do it (roughly 60% of data scientists do BI work), because one of the data engineers I currently work with was a BI at his previous job. (He was on the sales team, helping them with more than just dashboards, like helping with their Excel spreadsheet algorithms and what not.)
I'm sure others could paint another path forward. Data engineers are highly in demand so it could be as simple as applying. If you can pass a white board interview (leetcode style interview) you can skip this step and dive right on in. Just like any technical white collar job, you're expected to self-learn what is required for the job before going in, so absolutely read guides / take classes / read books / etc on the topic to learn more.
To get a job as a data scientist:
BI work is a good bridge too, but not in learning setting up database skills, but creating dashboard skills. Around 60% of data scientists in the industry do BI work. Me, I've had to create internal dashboards for diagnosing problems, so it streams in live data in a visual way. This is not BI work, but there is clearly a bridge between BI dashboards and internal diagnostics dashboards.
Technically a data scientist is a kind of data analyst so many people go from data analyst directly to data scientist. Around 30% or so of data scientists do only data analyst work but have the data science title. (This 30% number is a bit of an estimate.) It's that strong of an overlap.
Data scientists tend to specialize. There sales data scientist, a marketing data scientist, an engineering data scientist, ops data science, and so on. Often times, but not always, they sit on the team they specialize in, instead of on a data science / data analyst team. At smaller companies they tend to hire a data scientist and expect them to do one kind of role. So it comes down to what kind of data science work you want to do. Sales data science roles tend to be BI heavy. Marketing data science roles tend to be data analyst heavy. Engineering data science roles tend to be the heavy model building roles that are the most challenging out of the bunch. Ops data scientists tend to specialize in malware detection and self-reporting. Eg, if someone is hacking the company's servers, they might get notified of an alert, and then they analyze it and report on it. There are other kinds of data scientists, like ones at super market companies and restaurants that specialize in forecasting warehouse items.
Me, I'm a specialist that specializes in robotics and sensor analysis. I'm not going to lie, it's probably the hardest out of every kind of data science role. It's very heavy on the engineering side, not data engineering, but software engineering, because there is a lot of advanced feature engineering.
Most feature engineering is simple stuff like deleting missing values, performing the medium over the dataset or other kinds of cleaning and minor modifications like normalizing the data. Then it gets spit into an ml library that identifies the pattern in the data, so when new data comes in it can identify if it recognizes that pattern. Each pattern is called a category and most ML work is categorization, so maybe you're categorizing different kinds of customers and if you can identify a pattern in their shopping habits, you might be able to predict what they will buy next.
Advanced feature engineering might need to be used when your patterns are so complex ML can't pattern match it well, so you have to give it a helping hand and manually do some of the pattern matching. I've also had to invent new forms of ML too, but it's been a while since I've had to go that far. What I do is the farthest from normal for data science.
Most data scientists do not know advanced feature engineering, but it's one of the bridges between software engineering and data science, so leveling up software engineering can help on that front. (Which is also why I bring it up.)
A data scientist shouldn't be expected to know much or any data engineering skills. Instead, gaining managing upward skills helps. How to do an sql query to get data and how to write a join is enough. You should do fine, just try learning data science itself instead of learning data engineering, unless you're curious. (A lot of universities and bootcamps teach machine learning engineer skills and call it data science. If it doesn't have data cleaning and feature engineer, it's probably not data science. Likewise if it has tensorflow or pytorch in it, it's not data science.)
I notice that you mention "Machine Learning Engineer" as a separate role. If in the idealized world, data scientists do analytics and train models, and data engineers take care of data, then what do Machine Leaning Engineers do? Are they, basically, software engineers who specialize in putting other peoples' models into production?
And you are right in sensing my confusion. There seems to be an abundance or data-related titles, which seem to overlap in their functions a lot, but are also very different when you examine them closely. So thank you again for your responses, they are very helpful.
It depends on the company. Traditionally, yes, but deployment into production can be automated, so typically today it is something different.
An MLE is someone who specializes in Tensorflow or PyTorch. They write deep neural networks, reinforcement learning, and more. Often times the data scientist will make a model, specializing in feature engineering and domain experience, and use a generic ML like a generic DNN or xgboost or whatever it may be. It then gets handed off to an MLE who writes ML specific for the problem to get every last drop of accuracy out of the model. They then hand it off to prod. I don't think they're on call (I could be wrong on this.) so today they're not really deploying models much. They're more an inbetween.
I work at small companies and startups so I've never worked with an MLE, but I do have friends who are managers at Google who told me about it, so that's where this information is coming from, telephone game. In other words, I'd take this with a grain of salt. ymmv.
Starting in 2018 big name companies couldn't get enough MLEs and they pay higher than DS', but many bootcamps and universities center around ML skills, so companies started renaming MLE positions to DS positions. This way they get more applicants and they pay them lower. Win-win for them. Too bad it messes up the industry. Today about 1 in 3 data science jobs are ML heavy. They may be MLE exclusive or a hybrid wearing multiple hats light DS to light MLE type jobs.
You can identify which is which if they give you a white board coding problem. Traditional data science work will never have a white board problem.
>So thank you again for your responses, they are very helpful.
You're very welcome. I hope it helps.