This was wonderfully written and if your gonna start a data team, this is how you do it. But I can see that I’m the only one who thought it was crazy to start a data team in the first place.
This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?
A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.
I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.
I’ve heard people talk about data as the new oil but for most companies it’s a lot closer uranium. Hard to find people who can to handle / process it correctly, nontrivial security/liabilities if PII is involved, expensive to store and a generally underwhelming return on effort relative to the anticipated utility.
My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built. Obviously there are a couple of exceptions such regulatory reasons like hippa compliance for which building in-house can be the right choice if no vendor fits your use case.
As someone who reaches for code if they need to blow their nose, what is a 3rd party vendor going to supply that a “English-to-SQL translators” wont do?
(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)
So my assumption is that for a given business model, like e-commerce or Saas business much of the highest value analysis is fairly standardized and can be templated. For example breaking down conversion rate by weekly cohort is something that can be pretty easily be done in google analytics.
The problem with English to sql translators or most coders in general are the assumptions we make, in particular about the underlying data. For example, say we want a join two tables, so we write a query to join on two columns and often call it correct which it is from a logical or schema perspective it is. However, null values, defaults like 0, many to one relationships vs one to one relationships, issues with instrumentation such as networking timeouts or bot detection, etc all can impact the down stream metrics. My point is that when there are 500 lines of sql in a query such as those mentioned the article, there’s a lot of ways to be mostly correct but to cumulatively be wrong.
Like many popular enough open source tools, 3rd party vendors get battle tested, issues get found before you, and they can justify devoting more resources to rigorously ensure correctness than the average analyst has the time or energy todo because their business depend on you trusting the outputs.
I’m not saying you couldn’t do all this yourself. But given the sheer number of analytics tools that are reasonably priced, you might have chosen to spend your time on something more specialized like a recommendation system.
Heap might be good but they are crazy expensive. We were quoted something like a quarter million dollars. Good luck getting that signed off, plus you still need quite technical analysts to run the thing.
I've found https://contentsquare.com/ to be much better received by juniors and seniors alike, and it's a fraction of the cost of heap.
That’s generally how pricing works for SAAS products - most later stage customers have stricter or more customized needs. Think support SLAs, SSO, ACLs for their employees, etc.
So for example, the author saw that supply chain team had difficulty managing the complexity and scale of their analysis in large part due to the scalability of their spreadsheet solution. I would have pushed them to use Airtable which is basically a more scalable spreadsheet. By choosing the data pipeline route, the people who understand how to improve the supply chain model and the history of decisions that went into it, as well as previous missteps, now have limited ability to experiment with improving it. In my experience, every rewrite of a system has something lost in translation which makes me think that in the authors example that the life of the analysts got better but may have made the quality of supply chain model worse.
In the long run, there is plenty of useful logistics software that should do everything they want but the most important thing is to empower the people with domain expertise in the data to be as close to the solution as possible. Better decisions are often a result of better information/experience than better analysis. Unfortunately I haven’t studied these vendors well enough to make any suggestions though I believe that the solutions are well defined enough to write textbooks on them, which suggests to me that existing software and I would mostly implement similar methodologies.
On the marketing and product analytics tools, I think 80% of the problems boil down to measuring conversion rates and the comparing those rates across different contexts to select for the contexts which improves those rates.
Another user mentioned heap, which is great product if you know you don’t know what contextual data is meaningful but you suspect that it’s partially in how they interact with other parts of your website. Personally I’d use heap judiciously since I suspect there will be limitations to how useful the historical data will be in the future and collecting everything is expensive. One limitation is that site interactions are only part of the potentially important context. Another limitation is that startups change rapidly, so their historical data often depreciates in terms providing insight into their current problems. For an extreme example, I’m sure zoom’s conversion data before and during pandemic look completely different. But even a small tweak to google’s search algorithm could totally change what type of customer finds your site.
Personally I’d advocate talking to customers, potential customers, and other stake holders to understand what is important and measure that. Most companies, currently do the opposite where they take a lot of measurements and then try to figure out what’s important. The first approach can probably be done in google analytics. The second I might try and use Amplitude which is I what imagine a tool like heap will eventually try to evolve into.
The hardest person to help with data in the organization is the CEO because really they use data as form sales tool and reporting. The closest I have seen a tool to doing this in a way the CEO could mostly self service is Sisu data. Though it’s the CEO so it’s probably reasonable to hire some help anyway.
Lastly data warehouses were the gold standard in the early 2010s but Presto is better fit these days for companies whose data is distributed across many different places.
You're making a pretty big assumption on cost of team & infrastructure there. This company could have 100+ people with that kind of revenue (I've worked at a company this size before). The data team is only about 6 people. The cost of the data team & infrastructure is likely less than $1M
Having unique data is quite valuable. If your organisation can make decisions based on signals that other people can't detect then it can gain a decisive edge.
I do wonder at the anecdotes in this article though. In businesses that I've seen, the data team is usually the biggest impediment to a data-driven culture because they have databases full of numbers and no real grasp of how that links to the decision making process that makes the business money.
Beefing up the team doesn't help. In data, as in business more generally, the important think is not trying to guess what job your doing and spend a lot of time talking to customers about what job they need done. If the data team is where that work happens in a business then that can be helpful - but the grunt work of SQL/reporting/basic analysis is almost never where the value appears from.
> My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built.
I really like your takeaway about data teams at tech companies. They try to make "data" a core competency of their business, at huge cost for fixed value.
I also appreciated the very subtle implication that the OP is shrouding empire building under an otherwise informative growth story.
This is so eerily familiar I swear I've had many of these exact conversations word for word. The only way this doesn't turn into a complete nightmare of a cluster is if the exec team "gets it". If so, you just might stand a chance at building a data team that gels with the rest of the org.
But if the exec team simply hired you for window-dressing, expect to be treated like a scapegoat and a punching bag. Any mistakes will be your fault. Any wins will be to the credit of the business. The Director of Product will ask to "embed" dedicated DS headcount and you won't have any real power to shape the roadmap. If the exec team doesn't give you equal footingf with Product (or Marketing, Finance, and Eng for that matter) then this will rapidly become a soul-sucking job. However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads, then you actually might be able to accomplish something really cool.
This applies to most specialties. Companies tend to have a few teams that lead the charge and expect everyone else to follow. Knowing which teams get the authority and which teams are along for the ride at a company is important for knowing what your job experience will look like.
> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads
I know this was meant partially in jest, but if you reach the point where you're at odds with all of the teams and departments in the company you may get a lot done in the short term, but long term it's going to be difficult if you don't have some allies in each of those departments. Obviously no one should roll over and take orders from other departments, but some times it's necessary to do some give and take to build rapport. It's a balance, not a war.
Thanks for the tips! One mantra I've tried when starting at a new job is "for the first 3 months say yes to everything, for the next 3 months say no to everything." The idea is you first immerse yourself in everything, to find out what works and what doesn't. Then you dedicate time to fix the broken processes so that hopefully when you hit 6 months your team is better positioned to be more efficient. Obviously you can't be too rigid, but it seemed to work for me when I had buy in. Curious if you think that approach sounds good.
Good advice as long as you don't take it too literally.
The most important thing is to work closely with your manager on expectations. If someone from another department comes to you with a proposal, an ask, or a directive, you don't want to say yes without first consulting with your manager. Depending on company politics, some managers might try to rope new employees into doing work that isn't actually part of their job description.
Discovering expectations and then proactively managing those expectations is key in any role.
Very good advice. I've also seen this from new ICs (incidentally from one of our new data guys). I bet he said yes but he shouldn't have.
New guy, knows nothing about the company and product yet but was asked to "get KPI X by end of day". He obviously has no idea how to get this done so goes to various people and throws around the "VP XYZ wants this by end of day, help me now or else!".
Needless to say I, as politely as I could, told him to shut it, look at his data and what he could get from it and stop interrupting dev with mid day, two days after start of a sprint, requests to do his work for him (dude I don't even have access to your data storage, don't know what data you have or don't etc). And do it by end of day. Sure.
The guy is burned for me now. He will have to do a LOT of sucking up to dev now for his try at "do my job for me or else"
In my experience much of this is a question of trust, political capital and soft power. Find out the problems that the key players in the business are actually having that you can solve and then solve them. Find out what the key KPIs are for the business and make a plan to improve them and then have a plan to publicize that improvement. And make sure to hire a team that covers your weaknesses rather than exposes them. Don't fight people if you can help it, either they're as competent as you on average or you shouldn't have taken the job. Figure out how to help them and what they need to work more efficiently and then give it to them. Sure there's a ton of politics involved in all of that but that's management in general.
> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads, then you actually might be able to accomplish something really cool.
So what's the business case for having a data team independent of product, business and engineering?
Because as I see it the data team is a support function not q core part of the business. I'm sure it can be cool for you but if you are at odd with all the people actually creating value, what exactly do you bring to the table?
Engineering is building some schema, creates and uses multiple data stores , message queues, etc, eventually the queries do not longer work properly as the company scales and gets more and larger customers and hundreds of other issues. Doesn’t engineering need a proper data engineering team/dba/you name it to handle those?
This is probably the singly best written and most realistic article I’ve read on HN ever and I’ve been on HN for a long long time. It’s so realistic I wonder if the author took it from his diary or something. Everything about it is supersaturated with authenticity and teaches better than any other article I’ve read. Kudos to the author, and I would love to see this style of article take off.
> You notice a a lot of the code starts with very complicated preprocessing steps, where data has to be fetched from many different systems. There appears to be several scripts that have to be run manually in the right order to run some of these things.
> “We need to focus on delivering business value as quickly as possible”, you say, but you add that “we might get back to the machine learning stuff soon… let's see”.
So so relatable. But the key insight is a really really key insight.
> What I think makes most sense to push for is a centralization the reporting structure, but keeping the work management decentralized. Why? Primarily because it creates a much tighter feedback loop between data and decisions. If every question has to go through a central bottleneck, transaction costs will be high. On the other hand, you don't want to decentralize the management. Strong data people want to report into a manager who understands data, not into a business person.
I have the same role at a non-software company, and to me this is nothing short of a complete reimagining of IT. It’s not just, “make sure everyone’s computer works and help them install software,” it’s, “build a model of the business, determine what information flows and metrics are crucial to success, and build an IT and analysis infrastructure around that model.” The CIO will soon be better thought of as the Chief Optimization Officer.
> Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?
It's important to get the core centralised data infrastructure up and running (even if it's dirty af) as that helps with the bulk of the data work.
The oft quoted not completely true but kinda true statistic is that 70% of data work is finding, cleaning and storing the data. Analysis and modelling is the easy bit.
You could do it the other way around. Hire some data people in each team and get them to meet up every once in a while.
But I'd wager the central data stuff that makes everyone's life easier will get pushed back behind the "urgent" team work every time.
#ConwaysLaw
Edit: it's possible to do both btw. E.g. Have a bunch of centralised data engineers that do the heavy lifting stuff. With data scientist/analysts embedded in teams doing the fine grained modelling stuff. It's not a binary choice (once things are up and running).
> My (uninformed) impression is that data-driven is responsible for rather a lot of rot.
I agree! I was talking to someone else (not a tech head) the other week and realised why they hate tech so much... User interfaces that just... Don't work.
Showed him a terminal cli and he went nuts over it.
Then again, we're two kinda weird ye olde "back in my day" kinda people... So...
Interesting. I'm a bit of a hybrid, CLI/GUI user. There are things that I find easier to to in a CLI (or with text in general) and things were a GUI is more natural.
CLIs are finicky and force you to think in terms of text, whether it is appropriate or not. GUIs can be more expressive and haptic, but are typically very idiosyncratic and can get in the way of things.
The data-driven approach to UI seems a bit crazy?
If I think about the problems of any UI, I think in terms of communication, intent, learning, psychology and aesthetics. All of those things are human to human or human to computer related issues.
I think data-driven (as in statistical data derived from user behavior) approaches are or can be useful in terms of "what" to present, prioritize and so on. But much less so on "how", because I think this should be based on experiences derived from direct interaction and needs to be induced by creativity.
And I mean creativity from both sides, the implementer and the user. One thing that CLIs generally do better is to provide composable tools within a adaptive and simple system (pipes, text etc.), whereas it is hard to impossible to let GUIs talk to eachother and compose them to a user tailored whole.
I think we should empower "non-technical" users with the freedoms and sound principles we have come to enjoy ourselves, instead of letting statistical data dominate their experience.
Is driving your business by the highest paid person’s opinion any different than driving it by A/B testing? I see those as two extreme end positions.
A/B testing can help you with optimizing existing processes for incremental improvement, but big bets, which can sometimes have data and sometimes don’t, help with step change improvements.
Even with big bets you need a way to show that it’s better than the previous way. Either by coming up with ways to cheaply test the hypothesis or committing to being “agile” (I hate that term) and continuing to iterate.
What is statistical significance anyways? If the p-value is 0.06 is that good enough? Practical significance is something that also needs to be accounted for.
If something can’t be measured, is there a way to find some proxy metric for it?
If not, then you can try to negotiate a pilot study of the problem and have specific criteria to determine success.
Just because something can’t be measured with existing processes doesn’t mean it can’t be measured at all.
For example, there were complaints about systems crashing and having intermittent behavior, and the claim was that’s affecting sales. Technology said nothing in our logs shows any issues, our service center shows no reporting of issues, so we think they are overreacting. We put a team together and went to several different locations to observe the process and get feedback. From the feedback we put together a data collection sheet and went back for a week to collect more data. That finally convinced the Tech team that it was a problem they needed to investigate. They went to the stores, determined it’s true, and amended logging to capture what’s truly going on.
I share the frustration with how many A/B testing driven development processes end up. Leads to a very iterative process with lots of small changes, rather than big bets. Also, trying to get statistical significance from iterative changes when you don’t have a ton of data is problematic.
I think that’s just down to a lot of folks who think ab testing is the answer to every problem not necessarily having a background in maths or stats. I see it all the time in marketing teams where people’s are so conditioned to think of testing as the default that they don’t understand what they’re doing or why.
In my experience AB testing has a time and place - and that is after a certain level of traffic load and product/feature maturity and only to "validate" certain hypotheses.
For low volumes of traffic AB testing would takes ages to wield significant results and for products still maturing and shaping there is lot of "wisdom of crowds" data already available to help make decisions faster (ie: do you really need an AB test to know offering timely promotion to users helps convert?)
If you got a young product trying to grow, fast, it's a lot more effective to rely on experienced product people and off-the-shelf simple analytics to iterate quickly and to take some bets so one day you get to a point where AB testing "optimisations" starts to make sense.
It's a quite an interesting topic! I agree with you too - A/B test driven sites tends to culminate in terrible "cumulative experience" for users
What would be the name of the position/profile of someone in charge of building the data warehousing architecture/ETL pipelines?
I my view, they need make sure the warehouse model is a correct representation of the business and that it can be leveraged to answer basic or not-so-basic questions using SQL. They also need to promote it's usage internally by ensuring it is accessible and easy to use and guide other team to a more data oriented mindset.
I feel that this is a specialised position not exactly similar to a developer, but every time I look for "data scientist" I get guys that want to do machine learning prediction models, which is not exactly the same stuff either.
I would also vote for "data engineer" (it's my current job title).
You very likely don't want a data scientist to be doing a data engineer's job (and they probably don't want to be doing it themselves!). While there are similarities, data engineering tends to be a lot closer to software development than data science. If you're advertising for a data scientist role, don't expect them to be happy if 80% of their job is writing ETL scripts and cleaning datasets.
I think the reason there has been a flattening in data scientist job growth more recently is that lots of companies hired data scientists to build cool ML applications but had no infrastructure in place to support advanced data analysis. These companies didn't realize they needed to walk before they could run, and that what they really wanted was data analysts and engineers to build the foundation for a strong data science function.
Tools like dbt have been great for advancing an ELT approach to managing data pipelines, where modeling for BI tools, business users, and data scientists alike can all happen in the warehouse and ensure consistency in data usage across the company.
The one issue is that the gamut of experience and ability in a data engineer (and the salaries) is extremely wide, far wider than I’ve seen for any other role. Hiring a good DE is so hard!
I was a bit sad to not see any mention of a data engineer anywhere in the article.
Like, if you gave me access to all the prod tables and the warehouse I'd be having a whale of a time and (hopefully) delivering enough business value to automate some of the more regular "English to SQL" translations.
> You very likely don't want a data scientist to be doing a data engineer's job.
100%. This is one of those things that would make "disgruntled ML people" in the article want to leave.
This is spot on. As someone who has been looking for a data analyst role, I’ve actually read quite a few DS reqs that were geared more towards infrastructure and ETL. Then the flip side with the DE reqs wanting NumPy and Pandas along with the infrastructure and ETL. Weird, right?
I currently do that job as a Data Architect - kind of a mouthful lol but it covers the gamut of understanding the entire business as an abstract set of data flows, being responsible for the ingest and outflows of data, the level of quality in our overarching system, managing data engineers, developers, business folks all accessing said data, at the end of the day explaining what it all means to our clients and devs via standard modeling stuff and more targeted things as needed.
In our team its mostly a difference of business focus and the overarching responsibility - most data engineers I work with manage a major leg of the business and are responsible for their domain but I am responsible for all of them.
I certainly spend time coding (especially because again, small-medium startups cant afford anyone in the data space who isnt able to heave ho) but much of it is translating pretty vague stuff into market research/a proof of concept/an initial design of what will bring value to the business and scale alright and then often more people will throw in.
That being said you can call me whatever you want, as long as its not late for dinner :)
Yeah we would call this Data Engineer (likely Senior level or up for someone that has had experience building multiple data warehouses) plus the DevOps/SRE work required to stitch all the architecture together
In a traditional ETL infrastructure they are jumbled together but if you do ELT they are not. A data engineer can build the infrastructure but the transformations can be handled better by technical analysts. They're simply one view on the underlying data so the risk is minimal. Analysts query the data day in and day out so they know much better what they need than someone who doesn't.
The bigger issue is adaptability.. can you migrate schemas preserving older clients, typically that’s by providing a decent middleware…. SQL views are one way, APIs are another etc…
Analytics Engineer is a clear one for this, as teej said.
The title is strongly associated with the dbt community, so it could imply you’re using dbt for your data modeling (not necessarily a bad thing, as it sounds like it would be a good tool for your use case).
I’ve done this for the past 6 years and my title was “Big Data Infrastructure Engineer” but I don’t think there’s any consistency at companies from what I’ve seen
Great article. The confusion about what team does what is priceless...yet so common!
To provide some sympathy for the folks already working there: you always replace systems well after you've overrun them.
When the ad hoc system works (consider that google spreadsheet at a time when there were three support people and perhaps a dozen customers) you're not going to decide to replace it with something more complicated. Then you're busy growing so you just keep the system going through sheer force of will. You only replace it when the effort is unbearable; at that point you say, frustratedly, "I wish we'd done this sooner."
Thank you for writing this. I personally just walked into a very similar role and this rang really true. This article made me realize how much more effort I need to put into the data culture side of the role.
This is a perfect encapsulation of my career as a data-guy square peg in a round hole, filled with jargon and misplaced understanding of data in general.
Despite all that you read and hear about data science advancing, you’ll be surprised to see how poorly leveraged, or worse, billions of dollars are sought to implement the latest tool that promises to change the world. Tech and data as we imagine it be in the FAANG kind of companies is far different than how it is in older industries. It’s not just systems that need upgrading, company cultures do and that’s never an easy or fast process. I’ve been in the data Analytics space for 16 years now and I still feel, more often than not, I’m part of the minority, working to demonstrate true data use-cases
Part of me wonders what the long term of a transition like this looks like. Would this company be able to keep its data consumption healthy, or would it drive product changes that might harm it's users or lead to dark patterns?
When I had started reading this article, I had thought that it would be a sad story about another startup failure. The blogpost turned out to be a fascinating story of the success. I really liked it.
But after I had finished reading it, I have realized that it is a sad story, if we look from the eyes of data scientists in the team.
People were hired to do cool machine learning projects, but it turned out there is no infrastructure for them. After the new boss had arrived, they had to work as analysts for months. What is more sad - the new boss dangled a carrot before them several times, but each time the carrot disappeared.
Very interesting perspective. As a early-mid stage startup, you definitely want to invest in generalists who are able to build infrastructure before hiring specialized ICs.
I honestly had flashbacks when the author mentioned the carrot dangling thing. I’ve personally experienced this and as a naive early career swe, I gave the manager the benefit of doubt for a year even though I knew there was no way they could guarantee it. This is just pure manipulation.
The worst part is that he wrote the job description himself and resorted to manipulation to cover up his mistake of hiring for the wrong job role.
Building a good process into your company to receive a query, execute it against a read-only database, and shovel the results back to the user as a CSV file will pay dividends and is, honestly, pretty trivial in most cases.
Funnily enough, this is what I did, except I built an app where I write the queries as "pre-built" parameterized ones (sanitized, of course).
People still do a bunch of stuff in Excel, though, and every once in a while, it breaks, and I have to dig through the mess. Excel is great when it's just for yourself and you can manage it... it's a pain when others have to figure out someone else's.
This is a good write-up, but for the sort of insights they’re getting they’re over staffed and overpaying. A combination of a cloud dw (big query, e.g), cloud etl (stitch, fivetran) and dbt for the T in ELT to build useful reporting tables, along with some sort of sql based BI (mode, in our case), could deliver the same insights for a fraction of the price. Throw in a sub to Heap or similar for ad-hoc product analytics as a cherry on top.
I concede, of course, that they’re rescuing a bad situation, not starting from scratch, but still.
Really enjoyed this narrative, but what about the next phase? Going from mid-stage to mature startup?
> Note that you took on a lot of “tech debt” earlier when you started dumping the production database tables straight into the data warehouse.
How do you manage expectations when the year-long honeymoon is over, the business grows tremendously, and the centralized data warehouse reaches a breaking point?
Excellent article. For me, the timing couldn't be better as I am about to step into a role not too dissimilar to the one described in the piece. It will be interesting to see if I run into many of the situations the author describes.
I really enjoyed reading this. Very well written. At companies I worked teams can never read data from the DW btw.
My experience with A/B tests is that they are way overrated.
On the poor data quality. You sit on a product like a call center. Frontend developers thinks it is an excellent idea to store all data in some doc db blob. Then business wants stats about number of calls based on users...
Be careful when putting tabular data into doc dbs.
It's such an interesting and valuable article on building a data team, esp. insightful for organisation starting out. Guess the challenges in traditional/larger companies starting out a data team might look slightly different.
Such a great read. Have been in this position in a large public org. Over a year was spent just creating a catalog of what all data the company has and figuring out how to pull them into a data-warehouse
Can correlate, author is a truly a genius. We had a company mandate to be ML first, we went through a lot of phases and so many conversations happened as described in this amazing piece. Thanks Erik
This is a wonderful article, thank you for sharing. I really like the narrative of bringing people with you on the journey, and celebrating the small wins that lead to a good long term outcome.
This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?
A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.
I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.
I’ve heard people talk about data as the new oil but for most companies it’s a lot closer uranium. Hard to find people who can to handle / process it correctly, nontrivial security/liabilities if PII is involved, expensive to store and a generally underwhelming return on effort relative to the anticipated utility.
My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built. Obviously there are a couple of exceptions such regulatory reasons like hippa compliance for which building in-house can be the right choice if no vendor fits your use case.