There's not much consistency across job postings and interviews for this kind of thing.
I just interviewed for one "Data Engineer" position which consisted of nearly 100% stored procedures. No one knew what else to call it, and they didn't want to advertise for a DBA, because there were no real DBA responsibilities. So "Data Engineer" was chosen.
Another "Data Engineer" position was almost entirely Spark. There was no SQL involved - they expected all applicants to be Spark experts, with a deep knowledge of Scala.
It's hard to know what to expect out of "Data Engineer" positions until you walk into a place and start asking questions in the interview.
It was typically backend engineers doing the data pipelines to expose records from their own services, until the company got bigger and the analysts got more demanding, and that became a dedicated role.
I don't see an issue with looking for a "data engineer with deep experience building and maintaining data pipelines based on spark".
What would you call it? "Spark operator"?
I couldn't agree more. That's the exact thing I'm saying is a problem here.
When stuff is wrong, and we don't have enough people, and don't have a well-defined job, but need someone to just handle it .. "Data Engineer".
You get there, and everybody has a different idea as to what you should be doing, and what skills you should have.
But often the hiring manager believes it's probably mostly SQL or mostly python, and advertises that. They really don't know - because someone left, and they are trying to replace them.
So you get there, and talk to the hiring manager, and they're like "are you good in SQL?".
Then you talk to someone closer to the problem, and they're like "most of it's in Java - how's your Java?".
Then someone even closer to the position says, "well, most of the work we need done is really in Scala".
It's your job to just take it over, and make it happen. "Fix the data" - using our existing patchwork of tools ...
This isn't always the case - but given the nature of the problem, this happens a hell of a lot more often than with Soft Eng's.
Most who consider themselves software engineers only want to code and build new, not use existing. Mostly a mindset thing.
In the hiring market, many positions that were previously titled "Data Analyst"/etc were renamed "Data Scientist." Similarly "Database Engineer"/"Data Warehouse Engineer"/"DBA"/etc were renamed "Data Engineer."
Why? Various reasons:
One might be the need to give people a sexier title because they don't want to be "just a data analyst."
Another case might be, hiring managers and recruiting teams changing their title to capture more views for people searching using these terms.
And why are people searching for jobs using these terms? Articles like the ones in the original post, and "Data Scientist: The sexiest job of the 21st century" ;)
How about data or database programmer? I've also heard "Data Analyst", "SQL Developer", etc.
But I agree that "data engineer", "software engineer", "network engineer", etc are too general to really be meaningful.
In my experience data tools don't mesh well with "cloud"-y IAM, monitoring, or auditing frameworks. Data folks ssh to shared cloud workstations and of course use agent forwarding because that's what the tooling expects. They want to use EFS to share data sets even though NFS on machines where people have sudo is a bad idea / EFS is maybe a poor fit if you're thinking about governance / provenance. There's a mix of "notebooks" running locally (or on the shared workstations) and DAGs running in the cloud with bespoke access control that either doesn't map to IAM or else there's no access control so to get to the dashboard you forward a port with SSH.
It's enough to make me want to wall them off in a separate AWS account, but maybe I'm just being a grumpy old SRE. edit: as I mention downthread, this is a knee-jerk reaction and is not likely to "succeed" for whatever definition of "success" your business has.
"By the way, we are chartered to work separately from ____ infrastructure teams, to better enable our freedom and not hinder our own progress. We’re allowed to run our own computing environments, cluster(s), olap database(s), _____, and we’ve been doing it ourselves.
Can you run/manage/provision those for us too?"
From a talk I've given a few times called, "Life of a Data Engineer"
(Google slides link: https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...)
This is spot on.. I don't mean to vilify the data folks, they were hired to do a job and unless an explicit part of that job description is "make your stuff play nice with the existing stuff" (which it never is), why would they swim against the current and spend their cycles doing things that aren't going to get them promoted / that they're not naturally inclined to care about?
The sentiment of the quoted statement really highlights that they want to be in their own world and do things the way that makes sense to them.
Again, I can't fault them for it. It just turns into an absolutely staggering problem later when the company tries to get some certifications. Maybe engineering leadership needs to see this particular iceberg earlier.
The old, oft feared “shadow-IT” is a good term to represent what is actually an opportunity to recognize the opportunity for IT to reframe itself as the trusted service/platform/infrastructure provider and advisor, but perhaps not the only ones who implement and control every aspect of using that infrastructure going forward.
The extent to which digital technology has become fused with work, means people need more access and control over systems and tools within their daily work, but solid infrastructure, governance, etc. is still required. Clearly the trick is finding the balance between centralization and federation.
I think directions such as “citizen developer” and “data democracy” are not just fluff, but a pointer in the right direction of how to think about the newer classes of customer for information systems and technology.
My problem as an infrastructure provider and data architect is how to provide a globally consistent, governed platform and model on top of which different classes of users have different levels of access rights to data in different forms and qualities, through different interfaces.
My 2 cents, I accept silver bars too lol :)
For what it’s worth, I don’t buy the argument that data folks should operate in an isolated infrastructure - we just need to adapt how we serve their needs, which can be quite extensive when you are talking about essentially anything ranging from someone writing highly complex algorithmic code to process large volumes of raw data (high level of support and access may be needed) vs. someone just designing a report or dashboard using just a graphical tool on top of a predefined data model (much lower infrastructure access needed).
* it's counter-productive in a number of ways to create a data ghetto where they can do whatever they need to do. It doesn't engender trust or communication between Infra/SRE/DataEng/DataSci teams, and leads to "throw it over the wall" behavior.
* we need to adapt to how we serve their needs, not the other way around. It's a lot more likely to be successful if we are the ones who start bridging the gap. Data engineers are pretty specialized at enabling data scientists to do their job, they don't necessarily share the same skillset as Infra/SRE engineers.
What more do you really need?
(Google slides link: https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...)
"We’re in the Middle of a Data Engineering Talent Shortage"
I am a data engineer working on a machine learning team with models actively used as part of our product(s).
From my experiences working in various contexts (applied machine learning, analytics, policy research, academics, etc...), there are several of factors that contribute to this shortage: (1) "data engineering" often requires a lot of breadth and knowledge, (2) "data engineering" is often (derisively and naively) referred to as the "janitorial work" of data science, (3) the spectrum of roles and requirements within the "data engineering" domain, in terms of job descriptions, can range from database systems administration, to ETL, to data warehousing, curation of data services / APIs, business intelligence, to the design/deployment/operation of pipelines and distributed data processing and storage systems (these aren't mutually exclusive, but often job descriptions fall into one of these stovepipes).
Some of my quick thoughts and anecdata:
Companies have made large investments in creating 'data science' teams, and many of those companies have trouble realizing value from those investments.
A part of this stems from investments and teams with no tangible vision of how that team will generate value. And there are several other contributing factors…
"Dirty work." People haven't learned how to, and more often don't want to do it. There's a vast number of tutorials and boot camps out there that teach newcomers how to "learn data science" with clean datasets -- this is ideal for learning those basics, but the real world usually does not have clean or ideal datasets -- the dataset may not even exist -- and there are a number of non-ideal constraints.
There are people that wish to call themselves “data scientists” that “don’t want to write code” and would “prefer to do the analysis and storytelling”
Engineering as the application of science with real world constraints: there are a number of factors that we take into account, often acquired through painful experience, that aren’t part of these tutorials, bootcamps, or academic environments.
Many “data scientists” I’ve met have a hard time adapting to and working with these constraints (e.g. we believe that the application of data science would solve/address __ problem, but: how do we know and show that it works and is useful? what are the dependencies, and costs of developing and applying that solution? is it a one-time solution, or is it going to be a recurring application? does the solution require people? who will use it? what are the assumptions or expectations of those operators and users? is it suitable? is it maintainable? is it sustainable? how long will it take? what are the risks involved and how do we manage them? is it re-usable, and can we amortize its costs over time? is it worth doing? This is part of a methodology that comes from experience, versus what is taught in data science)
Larger teams with more people/financial/political resources can specialize and take advantage of these divisions of labor, which helps recognize the process aspects of applying data science and address some of the above
Short story: if you view data engineering as "janitorial work" you're missing the big picture
Anyone else notice that the attributes of a 'unicorn' data scientist include the traits of a 'data engineer?'
I draw the divide in that data scientists discover the features and the methodology, while data engineers take these insights to production. One can argue that data scientists themselves could do that, but this is constrained by the domain expertise on tools(be that the depth of spark internals or whatever) and the number of hours in the day. It's hard enough to deal with the variance of the models to deal with the variance of the system.
A good data engineer is a unicorn.I define three central competencies for a data engineer:
be a good coder: quality, maintainability, efficiency,
know how to explore the data: SQL, R, just eye the damn data feed,
know enough data science to interface with scientists
For a data engineer it's okay not to know probability theory and stats that much, but its a must for a data scientist( running TensorFlow out of the box with no understanding of the underlying math doesn't make a data scientist, just a common butcher).
The main issue with good programmers is that you need to make sure that candidates know what the job entails and are onboard with it. There are definitely complexities involved but by and large it isn’t the type of work that CS programs glorify as “interesting work”.
Reading the comments maybe I was naive.
data engineers would be like linemen for a utility company, setting up the power lines
I think there are 4 buckets of data engineering problems, each with their own challenges and solutions.
Operational Data Engineering
This is the detritus that grows like weeds as parts of other projects and often isn't recognized as a data engineering problem. We need to pull a file off an FTP server or hit an API and do something with it. Next thing you know, there are dozens of these little things that are not individually hard, but having visibility into dependency trees and failure cases becomes difficult because they are spread out everywhere and it's not obvious where to look when things go wrong. Tools like Apache Airflow are a good solution even if you don't use them in other ways because they can centralize monitoring, logging, and graphs. Scaling isn't resource intensive for these tasks because they are discrete. You can fan out. The scaling challenge for this type of data engineering is really about tending your garden and keeping things coherently organized.
Business Logic Data Engineering
This is processing where the data is highly structured and sometimes even ordered or sequenced. It's hard to scale because you can't just throw things into a stream and apply multiple workers. You have to have a managed process and likely shared in-memory state that collects the worker results and applies strict rules to a process. This is the opposite problem from big data. It's small data, rigidly organized, and carefully managed.
Data Science Data Engineering
This is sort of classic ETL with a twist. ETL systems are typically pretty static once the E, T, and L are known quantities. But working with Data Scientists requires that your pipelines have to be pretty flexible because scientists are doing experiments. But they also have to be repeatable and comparable, which means your pipeline has to maintain version. This is also the area where you are most likely to encounter Big Data, so you have to be prepared to change your mental model and be able to use tools like Hadoop and Spark to bring compute to where your data is.
Analytics Data Engineering
This is classic ETL pipelines that move data from point A to data lakes or data warehouses. The key thing to understand here is what you are modeling at the endpoint. If it's a legit data warehouse, you are modeling business processes. If you aren't doing that, you are--by definition--pushing data to a lake. Understanding your endpoint is key to choosing your reporting and analytics tools to lay on top of your data source. Data lakes are a good use case for ad-hoc, SQL-driven reporting tools like MetaBase. But if you are sitting on top of a well-structured fact/dimension type of warehouse, you will want more formal tools like Tableau, Pentaho, or Cognos.
Data Engineers are the folks that show up once the boss says 'yeah that's good enough we want to see the result of that process/model/algorithm on an ongoing basis'...now what was likely a pile of unsystemized jupyter notebooks and excel needs to get cleaned, sytemized and productionalized, preferably in tools designed to handle pipelines and scheduled jobs etc.
Our architect wrote some cute Datadog wrappers that I implement in every pipe I roll out (and is that instrumental in diagnosing bugs). He also programmed in a call to keys in a store that times out from too many requests and hangs the whole function the wrapper is decorated on, that took me a while to diagnose pinpoint
>>>are on call if one of your models breaks and have built the system where you "just drop a container"?
Our motto is ‘you wrote it, you fix it!’ If the container pipe works, and you dropped a bomb in it that doesn’t work, why should the DE have to pickup the DS’s garbage?
>>>If you are, then you are a data engineer too,
>>>otherwise you're standing on the shoulders of your data eng team and they make it look easy for you. Go buy them some donuts first time you go back to the office.
I actually support a few data scientists in the manner you described above where they generate some metrics notebooks or container and I have to diagnose their crap in a DE capacity (recent problem involved optimizing their pyspark code to be less memory intensive)
But Data Engineers will able to make sure that data collection and data cleansing runs optimally and successfully every day. And that's a real skill when you have upstream schemas changing, fluctuations in data volume, instabilities in systems etc.
Next topic "HTML Programmer".
The infographic and article show what skills and tools are relevant for a job as a web developer (more specifically doing Python Web Development) and compares them with similarly important skills and tools for data science. It includes average salary expectations and links to websites where you can both learn, practice and search for a job.
In data scientist job ads I often see companies that want a phd, advanced stats skills, and depending on the role AI related skills. They want to see track record in these, but will happily take a fresh PHD student who did a project that involved these. They don't want a software engineer who did some code camp courses, they want an accademic.
Conversely, data engineering, I see ads wanting a cross over of big data ETL technologies and devops - i.e. pyspark, kubernetes, and depending on the role experience of scaling and productionising AI, without actually needing a deep knowledge of AI algorithms.
This could be more viable for a software engineer who did some online courses, as they specify tools experience not academic background. However, it would be difficult for an experienced software engineer to switch into an experienced data engineer role, because it is expensive to set up data infrastructure at scale, so you can't switch over with a hobby project in the same way you could e.g. switch from experienced front end to experienced full stack by showing a significant webapp. Actually might be affordable in the silicon valley bubble I guess.
You really need to know it inside and out especially how to do ETL at scale using it.
Data Engineer as a term came out of the Data Science space. Which means that you will be expected to have skills around Spark, Data Lakes, ETL at scale, validation, schema management and syncing, data catalogs etc.
It's not some general skill just like you wouldn't say every programmer is a Network Engineer because they use a HTTP client.
Data Lakes: Is this our new fancy term describing "data". So what distinguishes "data lakes" from "data"?
ETL: I would guess 95% of application programs take input, parse it (extract), do some data wrangling (transform) and save the result somewhere else (load).
Validation: Again a broad term. Do you mean validation of statistical models? Without validation your predictions are worthless so I guess it is a standard thing to do if you want to do any kind of machine learning.
Schema Management and data catalogs: Standard DB stuff I would say.
We just like to define new job descriptions. It's the same with DevOps, which seems to be the new term for System Administrator.
No surprise, it is hard to find people who want to do this job and are good at it. I see the demand in the academic world ('scholarly infrastructure' is a very niche place) where it is nearly impossible to hire someone who can do this work, so hearing that it is also impossible in industry means I guess it is time to start training the undergrads :/.
I have an idea for a curriculum that could teach some of the principles for this kind of work (give them the gentoo handbook for a start, and see if they can follow it to get a database up and running from a box of parts), but I suspect that mostly it would act as a way to filter out people who simply don't like the activity, and you also have to have some amount of interpersonal skills in order to understand the use cases of your colleagues ....
Anyone who cracks this problem will have solved a far more general one in the process.