Hacker News new | past | comments | ask | show | jobs | submit login
We’re in the Middle of a Data Engineering Talent Shortage (stitchdata.com)
143 points by hankmh on Sept 8, 2016 | hide | past | web | favorite | 159 comments



Whenever I see these posts I immediate translate them in my head to "we're in the middle of a talent shortage at a price I am willing to pay."

I've worked with very large amounts of data and high performance computing for most of my career; I mostly had finance related jobs in the last decade or so. I have most of the skill you want, including some you don't know you want. However when salary comes up, that is where we start to part ways. If you are really serious about a shortage, you should be really serious about making offers that can be competitive, but I keep seeing the same $150k offers. That isn't a "shortage" kind of offer.


Are they looking for someone who must have every box ticked or are they looking for someone with enough qualifications yet needing work so much they are willing to undercut themselves? Are they justifying their salary offer because you tick 90% of the boxes and not 100%?

I've been looking for work in data engineering and databases for 9 months, and while I'm certainly not as qualified and experienced as you are, I consider myself capable. I've definitely passed the take home and whiteboard tests I've been given, etc.

When I read about a "shortage," I wonder if this is more indicative of unicorn searching than anything else.


That to me is a classic recruiting problem in technical positions, data engineering included. Unless you have a manager handling it themselves, the person doing the initial screen really is ticking boxes because they may not know any better.

Once a resume gets to me, and I'm only speaking for myself here, I'm looking for the challenges you've faced and the problems you've solved. I actually care very little about what tech you used because odds are we'll have something different, but we'll need to solve problems. If someone is solid in some related technical skillset, can think critically, and communicate the details of what they've tackled in the past, learning our specific tech stack is going to be the easy part.

Let me put it another way - when I look for interns or entry level hires, the number of those that can do more than spell SAS or Teradata approaches zero very quickly. But if they've solved challenges of the magnitude that they'd be expected to solve with us initially, the tech is secondary to process and problem solving. As we look more experienced, I'd still be limiting myself to candidates from a set of "legacy" industries that prefer these sorts of tools if I insisted on checking those boxes at the outset. I'd prefer to teach a really smart person to use the things that they don't know yet if I have it my way.


Quite I am sure my experience doing MR back in the early 80's for British Telecom would be usefull today - but I suspect that I might struggle to get past the hr screen.

That was when 17 top of the line supermini's Pr!me 750's was a huge cluster (we where the largest non back user in the UK) - probably about the same as a 10-20k core Hadoop setup would be today.


Of course they must. By demonstrating that they cannot find candidates with 100% of the 'required' skills at the price they are willing to pay, the path is cleared to go the route of 'highly skilled' H1B applicants etc. with a small percentage of these skills. It is not, and has never been, about the skills.


I think it's definitely true. Functionally, I'm a director of data engineering (with a big company, so my real title is way more generic). Usually in the initial screen, we'll talk general dollars, and my number is always out of range. For my level and the fact that I'm reasonably happy where I live now, the number is 200k + relocation (more for Bay Area, but lets not go there), and I don't think that's unreasonable for a director level who is presumably going to also develop your more junior DEs.

I don't fancy it up too much, either. I build teams that make the data move and land it clean so that your PhDs can do the smaaaht stuff with it. I can stack BI and Analytics on top, but a lot of people can do that starting from clean data - and clean data is what I do. But I do get the impression that we're viewed as janitors and plumbers - who you'd be thrilled to see at 3am when your shit(ter) broke, right?


"director of data engineering"

This is already generic.


Oh boy. My HR title would drive you nuts then. It's just director information management. We also have info analysts and info managers. Our department color is gray.


Although your statement is technically true, it is basically meaningless.

Yes, you can always always always find somebody to do a job is your a willing to pay 10 million dollars. That means that "shortages" are impossible. It means that you can never have a shortage in any situation, because you can always pay 10 million dollars for a single visit to the doctor.

But this line of logic isn't very useful when talking about "shortages".

If you had to pay a million dollars for a loaf of bread, is there a shortage of bread? IE, billions of people will starve to death by next week, because they can't afford to buy food.

Most people would say "Yes, there is a shortage of bread".

When people talk about shortages, they are obviously talking about a shortage at a certain price point. There is no other definition of the word shortage that makes sense.

A good definition that I use for the term shortage is "If the government could snap its fingers and instantly produce large amounts of X overnight, would the world be a better place"?

If the answer is "Yes, the world would be in a much much better place", then that means there is a shortage of X. If the answer is "No, the world would only be a little better". Then that means that there is NOT a shortage of X.


There is no analogy to bread or 10miilon dollar salaries.

A company found something it could profit from more if it paid less than current market value. That is all. They are not saying there are no qualified applicants. They are not saying they want 10million dollars.

What is the case is that a business finds a resource (the perfect hire) that they wish to profit from but do not want to pay the market value for it because that would reduce profits. Rather than be satisfied with what would be an erosion of profit (or an admission of an unworkable business model) articles are posted to demand government pressure wages downward.

If you want a bread analogy, it's as if I found a cheap source of bread I can sell elsewhere at a profit but then complain there's a shortage solely because the cheap stuff isn't even cheaper.


If they were offering $300k and still couldn't find attract top talent from other industries, then we could have a discussion about a shortage, but the low six figures doesn't show a shortage situation. It shows a market with plenty of headroom for salaries to grow.


Haven't you just done a 180? I mean, I'm pretty sure the world would be in a much much better place if the government could snap its fingers and instantly produce large amounts of almost anything. Therefore there is a shortage of almost everything.


Given the millions of unemployed Americans, it seems this is not true for at least some occupations.

Wal-Mart greeters can be wonderful people and I'm not saying they aren't valuable as humans. But in labor market terms, there is clearly not a shortage of them.


Of course this is more or less always true - there are only shortages or excesses of things when prices don't or can't adjust freely.

If there was 1 gallon of water left on earth, Bill gates would buy that gallon for $50 billion, and everyone else would die of dehydration.

There has always been a shortage of maids willing to do all my house work for $10.

And there is a shortage of data engineers at $x, but there wouldn't be a shortage at $1M/year (because less companies would want one, and more people would be willing to do the work).


> If there was 1 gallon of water left on earth, Bill gates would buy that gallon for $50 billion, and everyone else would die of dehydration.

really? who would sell the last gallon of water on earth?


Someone with a bunch of hydrogen, oxygen and a rudimentary knowledge of chemistry.


someone with liquid assets. I'll get my coat...


Someone who just drank the last 2nd last gallon. ;)


and what would he use the money for :)?


bullets =)


all the bullets would have been fired by then.


If they needed water to prime the last pump on earth?


Disney would.


Maybe you should start a Data Science & Engineering consultancy. The same people who would offer $150K to an employee often have bosses who would love to spend $500K for a person-year of (contract) work if it comes with a high probability of success.


I have thought about that. Many times. I have a couple of barriers, most of which are temporary:

1. I have student debt from my law degree, and I have a lower risk tolerance until that's paid off. 2. My daughter is 4, it's nice to be around for the early years, and the corporate gig is quite comfortable in terms of hours. 3. I'm in Maine. Most clients would require me to travel, which impacts #2.

I do have a former colleague here that started a data consultancy. I should grab a beer with him and see if we have common ground in the short term. It's not quite starting your own thing, but it might be fun.


Running an individual or small consulting company isn't that hard. Scaling a consulting company, on the other hand, is quite difficult.


+1 for making family important!


This argument comes up all the time on HN, but I don't think it means anything. It seems to me that the ability to fill an opening by offering more salary can't disprove a talent shortage, because it is always possible to do so.

Thought experiment: If 100 companies had openings for a skill set that only one person could deliver, all 100 companies could eventually fill their openings by sequentially outbidding each other for the services of that one person.

So how would we know if a talent shortage really exists for a certain job? I can think of a couple potential hints: if starting salaries are going up much faster than the national average, or if the unemployment rate for that job is much lower than the national unemployment rate. Either would seem to indicate that, relative to the job market as a whole, there was a greater demand than supply for that particular job.


In this case, though, there are way more than 6,600 people in the US that would be able to get do that data engineering job, including:

1. physicists

2. Wall St. quants

3. game programmers

4. PhD statisticians

So, the problem is not that there aren't 6,600 people in the US that can do it, it's that the companies can't pay or don't want to pay the $200,000 + that would be required to hire them.


This comment will sound a bit self-serving, but it supports your point. I have most of the skills necessary to be a data engineer. My degree in biology, but I nearly got a double major in computer science with a minor in math. (I wanted to work in bioinformatics, but it's nearly impossible to make more than a pittance without a PhD) I didn't pursue the double major because I felt taking classes outside of those three fields was more useful to my development.

Instead of working as a data engineer, I'm working at a non-profit doing pretty much everything involving data for them, as well as running their appeals, and doing almost all of the analysis. I'll lead off by saying the biggest downside of working for this particular non-profit is the salary. However, there are a lot of things I like about this job:

1) Location: I want to be located in Chicago. I have 0 interest in moving out to the West Coast. I'm up in the air about working remotely, because I feel like there is a lot of value in working with people in person.

2) The role is very broad. I get to do a lot of exciting things with data, but it is also a marketing and communication role as well. I am included in nearly every strategic discussion, not just those pertaining to data or technology.

3) Work life balance is very good. I am never expected to work more than 40 hours a week. My boss makes sure that everyone is focused on their lives, to the point where he basically kicked me out of the office for a week because I was waffling about taking a vacation. He makes sure that people know they aren't expected to check their email or do work on the VPN during off-hours.

4) The work I do makes a difference. Not in a "I make something people use" difference, but in a "my work has rescued people from being homeless and fed starving kids" difference. My first couple of jobs out of college were totally lacking this aspect, and I didn't realize how much it meant to me until I started working at a place like this.

I've been here a few years now, and so it's approaching the time where I should start looking for a new job if I want to continue to grow, but I'm having trouble visualizing what that would be. From my perspective, the problem with hiring is that job listings really focus on titles rather than roles, even in smaller organizations. I think my best bet of finding an organization matches the first two points, if not all four, is through my network rather than through job postings. So, to your point, the only way I see myself in a narrow-title role like a "data engineer" is if I really need money.


>The work I do makes a difference. Not in a "I make something people use" difference,...

I'd just be happy with that. Most of the work I've done professionally hasn't gone anywhere; it's always "we missed the market window" or "upper management decided on a new strategy". I can't point to that many things I got to work on that actually made it into the market and were used by people for long. One place (a semiconductor company) had a successful though buggy product and large customers in place, with the product already deployed into the field, and the software I wrote got used by some customers, but then suddenly the company decided they weren't making a big enough profit margin on this part (even though the profits were guaranteed and extremely low-risk as the customers had the part designed-in), so they simply quit the market and laid off our entire team.

Making something people use would be a step up. Rescuing people and feeding starving kids is a pipe dream, but then again I work on embedded devices, not big data or analytics or anything like that so that's not exactly a position that'd be easy for me to find if I really wanted it.


> (I wanted to work in bioinformatics, but it's nearly impossible to make more than a pittance without a PhD)

Just wanted to comment on this part - sadly it's difficult to make more than a pittance even with a PhD.


Yeah, totally. The difference is a bioinformaticist with a PhD generally gets to choose what they research, whereas a bioinformaticist without a PhD has to work under someone else's grant. Biology actually has the lowest pay of any major for people working in their field with a four-year degree. You are lucky to make more than minimum wage with a four-year degree, especially if your interest is field ecology or something similar.

If you want to talk about a shortage of labor where it would matter, biology as a field is probably hurting way more for talented software engineers than any company that needs a data engineer. There are so many great applications for programming in biology, and unlike other sciences, say physics, researchers don't tend to pick up on any amount of programming skill on their way to their PhD.

I've tried getting involved in bioinformatics on the side, but it's really difficult to keep up with the field if you don't have thousands of dollars to drop on journal subscriptions. It's also really hard to get access to the data researchers use in general (in any field), but it is made even harder when dealing with researchers involving people due to concerns about privacy. I don't think a focus on privacy is a bad thing, but a lot of publicly available data is sanitized to the point where your sample size would need to be in the billions to draw any inferences. You can request access to less general data, but good luck doing that without the support of a research organization.

Anyways, unless you have a martyr complex, there really isn't any reason to go into bioinformatics.


I happily worked in a wetlab writing stats software to support breast cancer research. I now do better ad targeting. My salary tripled.


I'm going to guess most of those people couldn't set up and scale a Hadoop cluster. Are they smart enough that they could learn this stuff? Sure! But there's still a skill mismatch here.


so you find some real full stack devs ie layers 1-7 or you buckle down and do a CCNA or similar is there a CCIE track for big data?

And I would bet across all of the tech workers in the USA there are well more than 6k that could do this.


So, obviously a LinkedIn search for exact title of "data engineer" isn't exhaustive. And as I understand your point, there's certainly no agreed upon group of skills/certification that qualify someone to be a data engineer (or data scientist, or software engineer, for that matter).

But the GP was particularly amusing to me because of its assertion that 'smart, quantitative people, regardless of industry, can build data infrastructures for startups.' I guess we could also say, there's little incentive to pay to train them (or for them to pay to train) to become a data engineer.


Ah the British Disease ie don't want to pay for training and we don't want those uppity engineers getting above themselves :-(


Or, offer the comparably-compensated part-time job that an academic physicist would accept in parallel with continuing to work in academia.

Source: Am physicist who'd love to find sustainable part-time work at market rates.


This is on-point. For comparison's sake, look at the number of economists, political scientists, and business professors who have side gigs in consulting.


phd statistician can write ETLs and data infrastructure?


There can be a lot of trickiness in this. I worked on A/B testing framework at one of the big software cos and me and all my team had a masters or phd in math or stats. While 95+% of our job was data infrastructure and ETLs there is another dimension to making it work and be correct from a statistical point of view.


Bullshit. I have the skills for an intermediate-level data engineer, but I find it bland and I'd rather work in computer vision. However, offer me enough and I may reconsider, and I don't think I'm alone in this.


I basically wrote the same thing as a reply to your sibling comment. Data engineering would have to pay a lot more than I currently make for it to be an option, and even then I'd probably change fields once I paid off my student loans and saved some money.


You're not alone, but you're just reiterating that it's always possible to fill an opening by running the salary up high enough.


But the pool of people who wouldn't otherwise take the job grows as the salary increases, pulling the people away from careers where they clearly aren't providing as much value to their employers.

Put it this way, the company isn't going to pay the employee more than the value they provide. That is the ceiling on salary. So until that ceiling is reached it is indeed a case of higher bidder takes all, as your thought experiment demonstrated. But once that ceiling is neared the company will make the decision not to bid higher, thus reducing the demand.

Thus, there is no shortage, just a shortage to work at the lower salary of companies with lower ceilings, because they aren't capable of leveraging the employee's talents sufficiently to draw from fields with related skill sets.


Shortage has a specific term in economics, which pretty much only happens because of price controls.

However if people start liking kale, and the price goes up 20% and you start telling people about the massive kale shortage people will think you're being a little histrionic.


Yeah, this is a selling problem. It feels like you're far more likely to gain traction starting a data team than taking an IC-track DE role. It's easier for companies to justify $200k+ for your skillset in that case, even if it takes you away from pure engineering.

Alternatively, you can just join a large tech org. Netflix etc. have no problem paying good DEs north of $200k in total comp.


"we're in the middle of a talent shortage [and don't believe in upskilling]."


Upskilling is one of the most ineffective costly ways to try and "re-program" workers and it mostly doesn't work because it's not about skills it's about talent.


Talent that occurs through the genetic/epigenetic process of having attained a Data Science masters degree after earning a Computer Science degree?

I am a believer in inherent talent but Data Engineering is a skill set.


Engineering is a talent skill there is a world of difference between teaching someone starting from scratch and then starting someone first having to unlearn what they learned to then learn perhaps a completely new way of thinking.

Most of the reskill programs I have heard of failed miserably exactly because the skill isn't enough.


I sort of agree in that orgs can't simply create massive education programs to re-purpose skill-sets/talent. That might have been possible "back in the day" before project managers were breathing down people's necks, but not today.

But the brightside is that talented people will find a way to "upskill" themselves in whatever environment they find themselves in. It is then up to the candidates to sell themselves and for the potential employers to be flexible about considering different backgrounds and nurturing the development of cross-functional skills that are needed for so-called data-engineers.

The skills listed in the article are all fairly common but its hard to find enough of these skills within individuals. For example, its not hard to find folks who can do the care and feeding of sql-server databases, or skilled programmers, or analysts who understand the business domain intimately. The problem is getting all of these together in one individual in a "know-enough-to-be-dangerous" level.


Yeah if up-skilling means "I used to do front end development now I do back end development" the up-skilling is fairly easy to do.

But if you used to work as a plumber and want to up-skill to data analyst (or vice versa) it's not that simple.


That's not always the case. Talent doesn't exist in a vacuum.

Someone with a natural talent for picking up new development skills will still learn data engineering far faster when provided with proper resources and strong internal mentorship.

I can see how you might make this observation after observing a poorly conducted training program.


The problem is that there aren't that many people with natural talents. They exist but it's very hard to sync them up with where demand is.

Also this is not just one poorly conducted training program. Denmark spent billions up-skilling parts of their work force. The results where simply no there. Something like 6 out of every 1000 person or something like that.


How would you distinguish talent from experience?


I would say that talent needs experience to be valuable.


This has been my experience with any "senior" engineering / BI / DS role. There is a particularly high level of price sensitivity to anything above 200k.


In particular, employers whining about lack of X need to ponder raising wages to where employees can afford homes in a city where prices are now within spitting distance of $1k/ft2. When your basic pitch is, "We desperately need [data engineers | machine learning engineers | computer vision engineers | what have you] so desperate to live in CA they'll accept never being able to afford a home unless our lottery tickets pay out", it should be unsurprising they have a hard time finding the talent they claim to need. Or, they could accept remote workers! Even remote workers near sfbay, who just don't want to burn 2.5 hours/day commuting in and out of sf...


My experiences exactly, pinged by companies obsessively for my big data skills, all trying to pay me less than I am currently making.


We all like $400K the investment bankers make. But Finance Industry had developed a business where they could pay their workers $400K and still make a huge profit for their investors. Except for Googles and Facebooks, the average tech startup is not making Finance industry level profits.

Also Finance requires proper education and training. Not so much for App development. So for everyone who complains about getting $150K offers, there are a 100 thousand people right here in US applying for $60K technical analyst jobs.


> Except for Googles and Facebooks, the average tech startup is not making Finance industry level profits.

And they don't have finance/Google/Facebook level needs for data engineers. They can't reasonably claim to need top-level skills and then beggar out on the cost.


>Whenever I see these posts I immediate translate them in my head to "we're in the middle of a talent shortage at a price I am willing to pay."

That's true for just about anything.

"there is no epipen crisis, only a crisis at what you are willing to pay"

"There is no poverty , only poverty at a given income level"

*"there is no crime problem, only crime problem at a given crime level"

what you are saying is self-contradictory. If you (or others) are able to turn down 150K offers...you know what you are.


You must admit that the price of epipens is an artificially inflated one only possible due to government imposed monopoly, not one driving by true market forces.

Poverty is simply a description wealth and is always comparative. We can define poverty as any level we so desire.

One might argue that any crime is a problem, as long as it causes an issue for society or victims.


We should rename this job position to Data Sanity Engineers.

I have been thrown these projects at work before, where I'm the frontend engineer and I need to make some cool D3 visualization, but low behold the data is shit, and I have to help the backend team make the data useable. It's a mind-numbing job, that nobody wants, because it sounds like a one month task to get a good REST API up and working, but it usually takes three months, because you have to go back and forth making sure the data is right, and there is always 10 tricky edge cases that you have to work some magic on. Not only that but you need to have smart people cleaning the data, so that you don't make some big mistake down the line or your REST API is super slow, and you have to add another couple weeks or month to rework the data again. So that one month becomes three months, and most likely a year, because somebody will say that looks great but can we also add this, and it goes on and on. It's literally a mind-numbing job that most nobody wants. I have found that products like Tableau are the best for this, you still have to clean the data, but it helps speed up the process.

Data cleaning is a super golden problem to solve.


As a contradiction to this point, some people (me) really enjoy working with data, from cleaning, munging, creating, sorting, pipelining, etc, and find front-end visualization production excessively boring and mind-numbing.

Give me emacs and a command line, and I have all the truth I need, which is far more honest, in my mind, than anything that can be created with D3 or Tableau. Beauty is in the eye of the beholder, and it doesn't really do anyone service to look down on the work others find enjoyable. If doing D3 makes you happy, that is awesome, and I can only congratulate you for your passion and your ability to look forward to work I don't "get," and I wish the feelings would be mutual.


So I guess you are a data engineer? What makes it fun for you? How do work with your customers to give them what they need in a timely matter? I would be interested to know what stack you use to go from dirty data to customer consumption.


Closer to an aspiring data engineer, though I've done my fair share of ETL, cleaning, database building / rebuilding, admin. Prior jobs have been database engineer, probably closer to DBA.

I just enjoy working with raw data and raw code more than I enjoy writing something that launches a graphic. I enjoy writing a script that finds a bad piece of data, or a script that fixes up everything, or writing something that was once unable to run at all get converted to something that runs in 500ms. Perhaps it is that journey of constant discovery, and seeing that every situation is a unique little puzzle. It is seeing the world as it is with no one reinterpreting what the data means for me. I can explore it and discover what it really means. It is hollow truth, a mess of ideas converted to sets of ideas layered on sets of ideas, and when it is finally drawn down, converted, and passing all tests, it is self-evident and self-reflecting, and true. Hard to explain, but I suppose I like all the things people hate about it.

The tools matter about as much as it matters what CSS framework you are using. You have the ability to logic through UI and UX, whereas I do not. I have zero hope of ever doing well at what you do, since I simply don't have the foundation, but if it matters, I know most jobs I've applied to and worked at tend to be more ad hoc, using PL, Python, Ruby, etc.


I'm not comparing frontend to backend. I also think data is fun and I don't mean to be little the job, but in a real world scenario its detail intensive, under appreciated, tons of edge cases and extremely complex if you plan to make it scalable and fast. So if you are an aspiring data engineer be aware of these pitfalls, because the first couple times you do it you will think its fun to try something new and create some fun useful analytics, but customers will often complain at how long it takes and want more. It starts to wear away at ones drive and passion for data. Its not the data aspect its the job/deadline aspect.


You're getting very close to the root cause - customers and even colleagues don't really care about the work that goes into the data. They care about the end deliverable, because that's what creates value for them, and fairly so. That gets at why data engineering as a discipline isn't (IMHO) very well respected.

I know this isn't reddit, so I'll point you to reddit. Check out /r/datascience where those folks talk about what it takes to be a data scientist. Some folks are honest about data engineering, but most handwave past it, or talk about it like it's beneath them. Their role would not be possible without solid data engineering, rather than a complementary and equally important discipline. Good luck doing "data science" or "analytics" or "machine learning" or every other buzzword without clean data, and for us data engineers, good luck ever demonstrating value without the analytics folks working with us.


There's nothing aspiring about what you wrote. I think you're fine calling yourself a data engineer if those are the types of challenges you've been solving.

Don't sell yourself short or select yourself out of an opportunity (within reason). That's someone else's job!


    sed -i 's/emacs/sublime-text/g' what_u_said.txt


more like Ctrl-H, tab, 'emacs', tab, 'sublime-text', tab, enter, esc, Ctrl-S


you are right that is more coherent.


Not only that but you need to have smart people cleaning the data,

Which are difficult to find when you think of them as "janitors", and treat them accordingly.


Data Sanitation Engineers


I do it for a living. It seems underappreciated in the industry.


I agree. I enjoyed doing it the first couple times, but people would often complain why I wasn't done sooner and didn't appreciate the level of complexity that went in to doing it. Once the appreciation was gone, I believe that's when it turned into a mind-numbing task for me. I don't mean to be little the job, I think I have just become sour to it because of the lack of appreciation.


A big part of my role is getting out there in front of business partners to keep the things that we do well front of mind. If you manage this work in the traditional sense, you'll be invisible when things go well and shat-upon as soon as anything goes wrong. At my current organization, I've really had to work at this. Here's a story:

Once upon a time I managed (and, frankly, also wrote a lot of the code for) a project integrating half a dozen sources each managing a block of our business (billing, coverage, claims). The data was awful coming in and we managed to get a bunch of business processes changed in addition to some pretty heavy cleansing steps that we wrote. In any case, this big fragmented mess of monthly and weekly stacked data became my integrated, clean warehouse. For the first time ever at this organization, I had coverage and claims records tying up at a rate of 100% without any manual intervention. We did this so that we could implement a modern finance ops process on top (being intentionally vague) that would allow us to manage this block more efficiently, save time, and even let us better invest - it was a 2 year project including my data work. A handful of actuaries and analysts got promoted out of this as it was a BFD to the company. Yet, at the end of the year, when I got my review I got our equivalent of the average rating, 3 of 5, etc, and like a 3% raise, and a shitty budget for my people too. From then on, I spent almost as much time out there promoting our team's work as we did doing the work. We did considerably better the next year, and that's been the way I've operated ever since. I market the work.

This kind of work requires a manager who will actively market it within the organization.


From the article: "Data engineers are the janitors who keep your data clean and flowing."

Hm, I wonder why he's having problems hiring janitors.


Bizarrely, I remember a recent HN discussion where a poster was arguing that any software developer who is not working in machine learning is like a plumber.

I guess this means that the entire profession consists of janitors and plumbers.


Considering that plumbers and janitors have likely, in the entire history of human civilization, done more for health and longevity than doctors and scientists...I'm kind of ok with this analogy.


Doctors, maybe, but it was the scientists who told them about the germ theory of disease, for instance.

I've read, but not confirmed for myself, that in the US the biggest gains in health came in the post-Civil War period, when "plumbers and janitors" made the difference. Of course, that's really starting with, after the science, the civil engineers who designed the public works systems that supplied clean water and took away sewage, and let's not forget that politicians and like who found it worthwhile to buy votes that way (now, they take our infrastructure for granted and buy votes more directly...).


Sure, that's true of recent (< 200 yr ago) history, but plumbing's contribution to health and longevity goes all the way back to ancient Rome. (Somewhat ironically, plumbing, from the Latin word for lead, "plumbus", may have also contributed indirectly to Rome's eventual decline.)



Thanks! There was a delay after the Civil War as you'd expect from all the chaos and disruption that caused (e.g. MIT got its charter before the outbreak, but wasn't able to start up until after), but it's pretty clear, and gets really dramatic the further you go forward.


I've been studying the period (mostly the Industrial Revolution and onward, though the accelleration of the late 19th / early 20th century is staggering), and it's pretty phenomenal.

There was a lot going on. Germ theory, of course, was part of it. But public health measures, especially sewerage systems, clean drinking water, and municipal waste removal, were all massive contributors. Note that the decline in mortality occurs well in advance of antibiotics and even most vaccinations.

For all the recent debate on vaccinations, it's interesting to note that the peak period of their impace (roughly 1930 - 1960) saw relatively little reduction in mortality, though there was a large decrease in disease incidence. It turns out that with septic control, antibiotics, food quality, and nutrition, many viral diseases weren't killers, but did present quality-of-life issues. And yes, often quite severe -- polio was no joke, and I know people who've suffered lameness from it myself. Measles and smallpox are similarly scarring and have long-term impacts.

But the major impacts of virtually all medicine are front-loaded to the period before 1950, with much the gains since attributable to either greater access (especially for the disadvantaged) and removal of environmental agonists (lead, tobacco, alcohol, asbestos, miscellaneous poisons, safety hazards).


In the minds of middle management, I think this is precisely correct.


and as pointed out so much, is entirely why nobody wants to work for them. Respect these very bright people znd you have a starting negotiation position.


I recently had a plumber do some work on a >100-year-old apartment. I was lucky: he's a very good plumber.

The job didn't involve too many "pipelines" but the knowledge and creativity required to make them work was well above what I see from most software developers.

"Plumber" is not the put-down that poster thought it was.


Janitors? They are certainly more than janitors! More like plumbers... getting your data safely from point a to point b without plugging things up while passing through [process] boundary's. How much does a plumber cost? $140 / hr? Sounds about right.


Data engineers are the janitors who keep your data clean and flowing.

In a boldface font, no less. The cockiness behind that language is really quite astounding.


It's really true, though. It's brutal, ugly work with no hope of an end.

Edit: Favorite paper on the topic: http://research.google.com/pubs/pub43146.html


So is the work that doctors, lawyers, and other highly-skilled people do, by and large. Everyone knows that day-to-day aspects of these jobs are hardly glamorous (or even cerebral), the vast majority of the time. Yet somehow we manage to accord these people with their due degree of respect, and wouldn't think of referring to them as "janitors".


I don't see what's so bad about janitors, though. They do very thankless jobs for not much money, whereas doctors and lawyers and other high-skilled individuals are often well renumerated or offered certain social prestige that your post shows is quite lacking when a humble janitor is considered.


There's nothing wrong with janitors, but the work they do can be done by any dumb monkey with almost no training. That's why the pay is so low for the job: anyone can do it, as long as they can lift trash cans and push a vacuum cleaner.

Plumbers are entirely different. They have to get their hands dirty working on some awful systems, but they actually have to know what they're doing, get specialized training, etc. Soldering a proper joint with copper pipes isn't that easy, and if you screw it up, it'll leak later and cause a lot of property damage. Knowing which pipes and fittings to use where is specialized knowledge. It's not something you can just grab someone off the street and train them to do in 30 minutes. Of course, plumbers also cost a lot too, and the ones who are self-employed (rather than their assistants) generally do pretty well financially.


A perfectly valid point. But getting back to the original article -- it pretty much takes a SV alpha-nerd (or aspiring CEO seeking to cater to them) to come up with language like that.


Hey I'm the author of this blog post and the CEO of the company that did the benchmark report. That was a very poor choice of words on my part, and I appreciate you flagging it. I reworked the paragraph to remove the janitor comment and (hopefully) make it clearer.


You should also not use "janitor" as a disparaging term. That would be another good takeaway from all of this.


I agree that it's a bad idea to use "janitor" as a disparaging term, and that was very far away from my intention. If that was what you took away from reading it, then that's more evidence that I didn't do a great job with writing the original draft.

Here's the original paragraph for reference:

Data engineers are the janitors who keep your data clean and flowing. Insights are great, and you need them. But to deliver insights at scale, you need data infrastructure. That’s delivered by data engineering. It’s not as fun to talk about as D3 visualizations and business intelligence dashboards, but it’s every bit as important.


Ignoring the breathless nature of the article, this is a buzzword label for a commodity skill set that pays a commodity salary in tech. It is also the commodity skill set that my employers have all paid me for.

There has been for a long time hype around new technology and labels for business intelligence, data warehousing, big data, and now data engineering/science. I'm not saying there are not some roles in this space that return huge value to organizations, but that these opportunities are much rarer than the buzz indicates.

I wonder if the perceived shortage is mainly hype as the shift to new cloud technologies makes many of the older ideas a little less useful - if you are plowing data into BigQuery, you probably aren't so worried about your star schema data model for reporting.

I would strongly advise people that look at these types of articles to look at the roles in question and ask "Is this role on the critical path to customers paying us?" My experience has been that the answer is often "No." This is bad. I have also seen situations where businesses that do rely on smart data integration can show that they are selling dollar bills for ten cents that still have trouble getting customers on board with spending that ten cents. Business is weird.


I'm trying to switch careers into "Data Engineering" now, as a full stack developer who is more interested in ML, and I've found almost no traction internally at my company or externally. It looks like I may just accept a full stack position at a good company that does a lot of data science for now, but though I would ask - Where are all these jobs?


"Data Engineering" is most of the work that needs to be done, but I think companies haven't identified it as a category.

From my P.O.V., "Full Stack Engineer" is a place you don't want to be because it means putting out fires with whatever junk javascript is in the front end. It seems like everybody who's built a serious javascript application has invented their own Virtual DOM because none of the popular Virtual DOM libraries are good for much other than wasting time and CPU cycles.

"Data Scientist" is a bad title in it's own way, in the sense that "Computer Science" is bad, but worse. To a lot of people there is a Brahmin kind of attitude associated with "Scientist" -- i.e. an aversion to getting your hands dirty. Real world data is pretty dirty and you aren't going to get far in getting value out of it unless you spend 80-90% of your time dealing with the dirt.


There are "Full Stack Engineer" doing pure native applications, which is what I have been doing the last three years after escaping the web back into native land.


You are correct. I thought full stack meant before building the app start to finish, but the reality is often closer to putting out other people's fires in every layer. It does pay well though and you learn a lot of what can go wrong.


The fact that it pays well makes it a job you're likely to get laid off from. Most managers would rather hire two junior developers so they can screw it up faster or better yet hire some people in another country who are really fast and cheap at screwing it up.


That may be true but I'm not worried about that, I worry more about getting comfortable doing useless work. If I got fired it would be so much easier to go back to school, as the dream of lots of money while learning on the side would evaporate.


My official title is "Data Scientist" although I'm closer to the "ML Engineer" someone else mentions in a child comment.

Frankly speaking, if your company doesn't need a data engineer, it won't hire one or move you into that role. They likely don't, either, if you're experiencing this pushback -- data engineers often develop ETL pipelines or data warehouses, both of which are very useful if your company has a data team and very useless if it does not.

That said, you may want to move closer to my role. There's actually a shortage of data-savvy people who can also write production software, and you would nicely complement a more research-inclined data scientist or analyst -- someone with far more experience with research/analysis than development.


> There's actually a shortage of data-savvy people who can also write production software, and you would nicely complement a more research-inclined data scientist or analyst -- someone with far more experience with research/analysis than development.

I experience the same problem with shortage-at-price-X in the field you describe. I'm a machine learning engineer with experience in MCMC methods, but I also have a lot of low-level Python and Cython experience, some intermediate experience with database internals, and lots of experience writing well-crafted code for production systems.

There are basically zero companies willing to pay what I'm seeking (which is a salary based on my previous job and a few offers I got around the time I took that job). In fact, in some of the more expensive cities, the real wage offered is far lower than other markets.

I've seen reputable, multi-billion dollar companies offering in the $140k range for this type of role in New York. That's wildly below anything reasonable for this sort of thing in New York. I've seen companies in Minneapolis offering $130k for the same kind of job -- and even that is still too low for Minneapolis! The same has been true in San Francisco as well.

Because these companies value you more for simply looking good on paper and looking good as a piece of office ornamentation when investors stroll through, and they view you as an arbitrary work receptacle closer to a software janitor than a statistical specialist, their whole mindset is about how to drive wage down.

Frankly, given the stresses of the job and the risk of burnout, I think it's actually a terrible time to be in the machine learning / computational stats employment field, despite all of the interesting new work and advances being made. The intellectual side is good, but the quality of jobs is through the floor.


"I've seen reputable, multi-billion dollar companies offering in the $140k range for this type of role in New York. That's wildly below anything reasonable for this sort of thing [in NY/SF"]

Man, do I ever agree. This is where the "shortage" argument falls apart.

This is why I'm so uninterested in the abstract arguments happening elsewhere on this topic about whether markets are failing and basic laws of supply and demand no longer apply at theoretical salary levels (10 million was offered as an example).

Why are we bothering with this debate, when it's so far from reality? I'd say that if you're trying to hire a very high skilled and critical tech worker in SF, and you just can't find one no matter how hard you try, and then I find out that you're only offering 140k a year?

In San Francisco and New York (and anywhere else in the US, really), that's nowhere close to the kind of pay where we should start scratching our heads about a shortage and start wondering why the usual laws of supply and demand aren't working anymore.


Yeah, I strongly believe companies haven't (or aren't willing to) figure(d) out the IC track problem for data people in the way they've figured it out for engineers. Part of me wonders if it even makes sense for them to figure it out, if they're not an Uber/Netflix/Amazon with a strong need for advanced ML abilities.

It sounds like you're a principal/lead/post-senior ML engineer; at that level, you can easily command more than $140k but you have fewer options to apply those skills at companies that really need them (because few companies actually need them).

I don't know. It's tough. I agree that it might be a terrible time to work in ML/computational stats because of stuff like this.


I suspect the reason is those companies offering $140k frankly don't need that level of expertise. With that kind of background it would be fairly easy to get 200-300k as an infrastructure engineer at a quant shop.


Oh, also: if you're in NYC I'd be happy to meet over a coffee/beer to swap stories. Feel free to use the contact info in my profile.


I think the company does need data engineers but wants someone with a graduate degree from Stanford or CMU in that position, even though the actual work is in building up infrastructure for those people. And I understand. I've only really got software engineering skills to contribute at this point and I'm picking up the ML from kaggles on the side; I am looking for a position that can increase my overlap between those, because learning at home while working on unrelated stuff is making me move slowly and painfully. Your experience sounds exactly like what I'm looking for - data-savvy writing production code, complementing a research-heavy team I can learn from. How did you get started in that?


I honestly fell into it by luck. I moved to NYC, studied machine learning in grad school, networked my ass off, and landed an internship.

From there I went full time as something of an ML engineer at a company with a strong tech culture, and learned as much as I could in both tech and ML/statistics. The rest is history (although I'm by no means a rockstar or whatever).

My path is hard to reproduce -- it starts with being in NYC or SF at a specific point in time, before the labor market became saturated with data science bootcamps and PhDs furiously learning Python while working on their dissertations.

Your best bet at this point is to produce a few data-related projects (maybe work on open source like scikit-learn and pandas?) and network like crazy. Someone somewhere will have a need for someone like you.


Thanks! I guess it's somewhat reassuring that it's hard to break into for everyone and I'm not just dumb :) I'll keep kaggin'


>There's actually a shortage of data-savvy people who can also write production software

Well no kidding, that's one person doing two jobs. That's easily a 5-10 year training time depending on how high a quality you demand from their production software.


We (Kaggle) run a data science jobs board (https://www.kaggle.com/jobs) that gets a few data engineer listings from time to time. Not all of these are active, but you may find a few interested companies via - https://www.google.com/#q=site:https://www.kaggle.com/jobs+%...


Thank you guys! Doing Kaggle competitions is what got me interested in seriously pursuing ML in the first place. You are all seriously awesome.

I'll look again at the board but, I didn't see anything there before that wanted software engineering skills (which I have with industry experience), and not a graduate degree (which I don't), and happened to be commutable from my place just south of the bay. But I will keep looking!


I see tons of them. If you're interested in ML, you're probably more looking towards data science. Data engineering (in general) is more about getting the data in a state where it can be used (extracted, cleaned, moved, transformed, etc.) at least from what i've commonly seen in the industry. A decent breakdown is here: https://blog.insightdatascience.com/data-science-vs-data-eng...


You might want to look at "Machine Learning Engineer" positions if you want to do ML in practice, it's starting to be a title I see somewhat often now.

As others have pointed out Data Engineering is more about building data pipelines, making architecture decisions for your ML stack, things like that. Less about model building, prototyping and training, which is what I think of when somebody says they 'do' ML.


Right, I'm not picky about the title. I'm looking at those positions too. The main thing is, I want to be able to contribute using my existing software engineering skills from day 1, while picking up the ML stuff. It's been really hard to basically work an unrelated job during the day and go home and do kaggles for practice, so I am hoping to get more of an intersection as a launching place. Anything touching the data or the models will do :)


ML falls more under a Data Science role than Data Engineering, although ML is much more difficult without proper Data Engineering.


You should put your email in your profile. If you're in Seattle, send me an email.


I've heard more than one CTO/Sr. Engineer refer to people in these roles as 'data grunts' or something similarly dismissive. Then they're mystified as to why solid engineers are so quick to move up or out, year after year.


Every time something comes up on HN about a talent shortage in a field related to software engineering, it hurts. I have been unsuccessfully looking for a full time position since my last start up (I was not a founder) folded six months ago. I have been on over 25 in person interviews and gone through untold degrading whiteboard interviews, code tests, trick questions, and take home projects; all have ended in rejection. This industry has a need to torture candidates because we are all considered to be liars by default. Much is said about combating impostor syndrome in ourselves but we are too eager to engender it in others.

It seems people in this industry refuse to understand that some people are not perfect. I never graduated college because I hated it with the very fiber of my being, so I am not particularly great at white boarding answers to algorithm questions off the top of my head in a high pressure environment. If I need them during my job, I look up answers and learn from people who are much smarter than I am.

My personal identity has been shattered, as I thought my ~5-10 year history of success in the industry indicated I was in demand and talented. I saw posts like this and thought that if the worst happened I'd still be able to find a job. The idea that there is a talent shortage is a lie, or candidates like me wouldn't be treated as I have been. I'm not asking for a free job, or a handout. I have had a successful career so far and am capable of doing good work. But I'm not a specialist in Big Data Machine Learning Neural Networks.

I have struggled with bipolar disorder and suicidal ideation most of my life. I've dealt with the death of my beloved grandmother and my father who was instrumental in my choosing to be an engineer with only minor lapses in control. Nothing has caused me to consider taking my own life as much as the past 6 months. It seems there is no future for me in the only career I have any skill in and which is a huge part of my identity. And to constantly be told that there is such a shortage of engineers only salts the wound.


" I have been on over 25 in person interviews and gone through untold degrading whiteboard interviews, code tests, trick questions, and take home projects; all have ended in rejection."

The fact that you pulled through 25 of them is already commendable. Unfortunately as a labor provider you'll be subjected to all kinds of crap for the privilege of working.

Every single person on here needs to have a secondary business going on right now. Doesn't have to be a highly skilled industry either, selling hand made stuff on Etsy can be a lifeline in these situations.


Hey, I'm going through something similar. I had to quit an amazing job because my wife and I pursued a dream and moved to Europe (no remote).

I had always had an easy time getting a job before but this time it was different. Granted I knew it'd be tougher since for remote jobs, the world is the competition. But it was a summer of endless shitty timed hackerrank-style tests (virtual whiteboard hazing). I would tell my co-workers about them and they'd laugh in bewilderment at the questions that were asked in what should be a technical screener, and these are extremely smart and productive software guys that have started companies, written books, give conference talks. One funny question I got for a frontend React job: write a function that takes a sequence of bits that represent a negative-binary number (not a base-2 number that is negative, but a base-(-2) number) and return its negated value in base-2. For a frontend job. It was one of 4 questions to be answered in 90 minutes. gtfo.

A few companies would reply, most strung me along while -- I realize now -- they were keeping me as a backup(-backup) incase their "A-player" turned them down. Countless interviews, hours on takehome projects, it was tough. I learned to cut bait if the company was slow to move forward, had weeklong periods of no communication, etc.

I (just very recently) found it's easier to land small contract gigs because the barrier to entry seems to be lower, demonstrate value, and keep getting work from those guys after the initial project was done. It is different but so far I actually like the freedom that comes with contracting. I haven't been at it long enough to experience the downsides.

There's definitely not a shortage of talent. It's that every company thinks they need "A-players", when the vast, vast majority are doing a damn basic CRUD app.

Just wanted to say I hear you brother and share my story in some solidarity. You will find something, just keep plugging away. Each "failed" attempt makes you better no matter how many attempts it takes. Cliche of course but it is true. I am very lucky in that I don't face the mental demons you do, even then this job search hit me pretty hard. Please be proactive and take care of yourself, body and mind (body goes a long way toward mind also).


anything and everything is marketed as "data science" and "data engineering" these days becasue this is the buzzword of the day.

I've been dealing with large data even before "big data" was a word but i dont call myself "data scientist" or "data engineer". I am still a software engineer working on what benefits my organization.

"Serial Entrepreneur" is the same these days, claimed by anyone who had a lemonade stand as a kid.


> I am still a software engineer working on what benefits my organization

But if you saw a nearby local maximum that's higher than your current local maximum, wouldn't you change what you call yourself, if it means being paid more but doing the same work?

This is similar to how the average "software engineer" makes about $30k/year more than the average "programmer".


It's digital Charlie Work [0], that's why.

I really enjoy that kind of work but it is difficult to articulate your business value in that environment. The best thing is working closely with a data scientist/front-end dev who can deliver products to the analysts and executives that need the data and make sure that you get the credit for enabling new streams of data. But most of the time you are putting out someone else's dumpster fire.

One advantage of data engineering: unlike front-end work, there are few non-technical people who will have an opinion on how you are doing things and burden you with bikeshedding.

[0] - http://www.avclub.com/tvclub/its-always-sunny-philadelphia-c...


The fact that the original, unmodified article referred to data engineers as "janitors" pretty much says it all.

It's very analogous to front-office and back-office work in Investment Banking. "Data Scientist" are the front-office, with all the prestige, and "Data Engineers" are the back-office, doing a lot of the heavy lifting without nearly as much recognition.

In my opinion there shouldn't be a delineation. You shouldn't be a data scientist if you can't gather, process, and clean up your own data.


Ideally you'd have a symbiosis, and each side would recognize the importance of the other.

Even if you require your data scientists to be able to do engineering work, it's probably way more efficient to have some good generalist Software Engineers doing all the "pre-math" work and freeing your statisticians up for what they're (hopefully) good at.

Plus as a side effect, your software will probably be better.


There are 6600 jobs listed and 6500 individuals on LinkedIn with that particular title, and therefore there's a shortage? Seriously?

* How many aren't on LinkedIn?

* Since the whole article is about how the job title is poorly defined and growing in prevalence, why would you assume that people who don't already have such a job would use the term?

* The "growth" charts on the full study are just as bad - how much of that is just from renaming existing generic developer positions, since "data engineer" is clearly a relatively new term?


6500 data engineers on all of Linkedin, but 6600 job openings in the bay area. so there are more job openings in one area than all data engineers on linkedin


Data engineering sounds much better than "data plumbing", but in my experience the latter is a more accurate description of the work of a data engineer: Building -and often unclogging- pipes that transport data from A to B, and putting in filters to clean it and extract the useful bits.

So why not change your LinkedIn job title to "data plumber", which is sure to get you some serious recruiter attention ;)


Ahh the ol' write a post about a not well understood distinction and then proceed to not explain the distinction.

Looks like we need more English engineers too.


I'm puzzled at the omission of Scala and Spark in this report.


I worked for about 10 years doing exactly what they want, but I ended up having to write a lot of the tools which means I'm not able to check the boxes on some tool you require which gets me punted by HR.

I'm starting to think that the message is if HR is going to do checklists then developers should really make sure they work mostly with contracts that use popular checklist items.


As a data person I would really like to put some numbers on how much the typical HR hiring process costs a business. I don't know anybody that says they are happy with how hiring works in he tech industry but I've also never seen an HR person try and improve the process.


That's because the system is already optimised for the needs of HR people.


Quick sidenote, anyone know where the databases / distributed systems engineering jobs are at? E.g. if one wanted to not use these tools but also go help build these tools?

I can think of Facebook, Google, Microsoft, IBM (which locations and groups within these companies / where?). I can also think of Confluent, CitusDB, Databricks, etc.


Market Research is a $40B industry that depends almost completely on these concepts. I'm not sure how prevalent distributed systems are with MR companies, but that's an implementation detail anyway.


> that's an implementation detail anyway.

Which is what the poster was asking for.


Weirdly the problem is most hires have it backwards.

Before going out to the market and discovering what talent exists and consequently what salary it will take to get them to join (ie negotiate) most organisations decide on a salary range, usually reflecting the current internal structure not the current external market.

The longer an organisation has existed the more out of whack with the market its internal set up is.

As such companies decide on their price point first, then go looking. Which is of course backwards.


Am I the only one who thinks there will be a ton of people changing their job title on LinkedIn to "Data Engineer" as a result of this article?


I am thinking about it. Actually a friend recommended that I change my title to Data Engineer a few months back.


We surely need data mechanics.


These "shortage" stories always make me roll my eyes, because they're usually about money more than anything. And money is usually about cost of living more than anything.

If you choose to locate your company in one of the highest cost of living regions in the world, then you are complicit in the "shortage". Supply and demand - pay up. Or don't.


It was only 20 years ago that companies hired a "web master" or a generalist to do everything. But pieces of those jobs became specialized. Now we need UX, UI programmer, general engineers, dev ops, data engineers, a data scientist, etc.

And how many companies are still interviewing with fizzbuzz?


I am a data engineer working on a machine learning team with models actively used as part of our product(s).

From my experiences working in various contexts (applied machine learning, analytics, policy research, academics, etc...), there are several of factors that contribute to this shortage: (1) "data engineering" often requires a lot of breadth and knowledge, (2) "data engineering" is often (derisively and naively) referred to as the "janitorial work" of data science, (3) the spectrum of roles and requirements within the "data engineering" domain, in terms of job descriptions, can range from database systems administration, to ETL, to data warehousing, curation of data services / APIs, business intelligence, to the design/deployment/operation of pipelines and distributed data processing and storage systems (these aren't mutually exclusive, but often job descriptions fall into one of these stovepipes).

Some of my quick thoughts and anecdata:

Companies have made large investments in creating 'data science' teams, and many of those companies have trouble realizing value from those investments.

A part of this stems from investments and teams with no tangible vision of how that team will generate value. And there are several other contributing factors…

"Dirty work." People haven't learned how to, and more often don't want to do it. There's a vast number of tutorials and boot camps out there that teach newcomers how to "learn data science" with clean datasets -- this is ideal for learning those basics, but the real world usually does not have clean or ideal datasets -- the dataset may not even exist -- and there are a number of non-ideal constraints.

There are people that wish to call themselves “data scientists” that “don’t want to write code” and would “prefer to do the analysis and storytelling”

Engineering as the application of science with real world constraints: there are a number of factors that we take into account, often acquired through painful experience, that aren’t part of these tutorials, bootcamps, or academic environments.

Many “data scientists” I’ve met have a hard time adapting to and working with these constraints (e.g. we believe that the application of data science would solve/address __ problem, but: how do we know and show that it works and is useful? what are the dependencies, and costs of developing and applying that solution? is it a one-time solution, or is it going to be a recurring application? does the solution require people? who will use it? what are the assumptions or expectations of those operators and users? is it suitable? is it maintainable? is it sustainable? how long will it take? what are the risks involved and how do we manage them? is it re-usable, and can we amortize its costs over time? is it worth doing? This is part of a methodology that comes from experience, versus what is taught in data science)

Larger teams with more people/financial/political resources can specialize and take advantage of these divisions of labor, which helps recognize the process aspects of applying data science and address some of the above

Short story: if you view data engineering as "janitorial work" you're missing the big picture

Anyone else notice that the attributes of a 'unicorn' data scientist include the traits of a 'data engineer?'


How does one get started with this? I suppose a lot of people who hang out at HN are competent devs good in programming and databases, but probably beginners in math, ML, AI etc. How does such a person get started and find a job in this field?


in my mind the problem is really simple: most executives aren't smart enough to understand how any of this shit works, or build a compelling business case around it. they just know they need a 'big data' team, so it just dies on the vine.

someone with enough smarts to build/lead a team, sell to executive management, and have an actual business application is just too rare compared to the prevalence of the engineering talent.


So I know SQL, Python, Django, Java (though its been a while), Javascrit, Linux, some cloud computing and a bit of devops. Am I a data engineer? Software engineer, with a lot of database background? What makes a data engineer different from a software engineer?


- The challenge for an organization is to recognize that there is a significant difference between the 'data engineer' working on a vertical project and the 'data engineer' responsible for integrating data across the enterprise.

- The project 'data engineer', in today's world, most likely will be a software developer responsible for ETL, etc. The data design will be more or less up to the software developer.

- An enterprise 'data engineer' is more concerned with data that affects the enterprise. This typically involves some sort of data integration. For example, how to integrate relevant data from N projects (e.g. A,B,C .. Z) where each project has its own idea of how to represent similar concepts (e.g. person, user, customer), with different provenance, truth assertions, access rules, data retention periods, granularity of metadata (e.g. at the attribute level vs entity level), etc. The enterprise is interested in questions like 'What did we know and when did we know it?", etc. The enterprise 'data engineer' will probably levy requirements on the project 'data engineer' to meet the enterprise's needs.


Just checked, the # of data engineers rose to 9,246 (42%) in the last six months. So, the shortage is at least being addressed by people changing their job titles on LinkedIn.


What I've learned from the comments: If something is valuable, there is a shortage of it.

I'm not even sure if I'm being sarcastic.


We hire only the best! We only hire the top 1% of candidates.

But only 1 out of 100 are qualified :(




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: