This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?
A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.
I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.
I’ve heard people talk about data as the new oil but for most companies it’s a lot closer uranium. Hard to find people who can to handle / process it correctly, nontrivial security/liabilities if PII is involved, expensive to store and a generally underwhelming return on effort relative to the anticipated utility.
My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built. Obviously there are a couple of exceptions such regulatory reasons like hippa compliance for which building in-house can be the right choice if no vendor fits your use case.
(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)
Edit: Also love the Uranium quote :-)
The problem with English to sql translators or most coders in general are the assumptions we make, in particular about the underlying data. For example, say we want a join two tables, so we write a query to join on two columns and often call it correct which it is from a logical or schema perspective it is. However, null values, defaults like 0, many to one relationships vs one to one relationships, issues with instrumentation such as networking timeouts or bot detection, etc all can impact the down stream metrics. My point is that when there are 500 lines of sql in a query such as those mentioned the article, there’s a lot of ways to be mostly correct but to cumulatively be wrong.
Like many popular enough open source tools, 3rd party vendors get battle tested, issues get found before you, and they can justify devoting more resources to rigorously ensure correctness than the average analyst has the time or energy todo because their business depend on you trusting the outputs.
I’m not saying you couldn’t do all this yourself. But given the sheer number of analytics tools that are reasonably priced, you might have chosen to spend your time on something more specialized like a recommendation system.
Or is this - for exmaple - people taking google analytics and producing analysis on top of that.?
Disclaimer: I was an early engineer at Heap.
I've found https://contentsquare.com/ to be much better received by juniors and seniors alike, and it's a fraction of the cost of heap.
Were you a later-stage startup by chance? The price point for pre-Series-C startups should be much, much lower.
Also love the Uranium analogy.
In the long run, there is plenty of useful logistics software that should do everything they want but the most important thing is to empower the people with domain expertise in the data to be as close to the solution as possible. Better decisions are often a result of better information/experience than better analysis. Unfortunately I haven’t studied these vendors well enough to make any suggestions though I believe that the solutions are well defined enough to write textbooks on them, which suggests to me that existing software and I would mostly implement similar methodologies.
On the marketing and product analytics tools, I think 80% of the problems boil down to measuring conversion rates and the comparing those rates across different contexts to select for the contexts which improves those rates.
Another user mentioned heap, which is great product if you know you don’t know what contextual data is meaningful but you suspect that it’s partially in how they interact with other parts of your website. Personally I’d use heap judiciously since I suspect there will be limitations to how useful the historical data will be in the future and collecting everything is expensive. One limitation is that site interactions are only part of the potentially important context. Another limitation is that startups change rapidly, so their historical data often depreciates in terms providing insight into their current problems. For an extreme example, I’m sure zoom’s conversion data before and during pandemic look completely different. But even a small tweak to google’s search algorithm could totally change what type of customer finds your site.
Personally I’d advocate talking to customers, potential customers, and other stake holders to understand what is important and measure that. Most companies, currently do the opposite where they take a lot of measurements and then try to figure out what’s important. The first approach can probably be done in google analytics. The second I might try and use Amplitude which is I what imagine a tool like heap will eventually try to evolve into.
The hardest person to help with data in the organization is the CEO because really they use data as form sales tool and reporting. The closest I have seen a tool to doing this in a way the CEO could mostly self service is Sisu data. Though it’s the CEO so it’s probably reasonable to hire some help anyway.
Lastly data warehouses were the gold standard in the early 2010s but Presto is better fit these days for companies whose data is distributed across many different places.
You're making a pretty big assumption on cost of team & infrastructure there. This company could have 100+ people with that kind of revenue (I've worked at a company this size before). The data team is only about 6 people. The cost of the data team & infrastructure is likely less than $1M
I do wonder at the anecdotes in this article though. In businesses that I've seen, the data team is usually the biggest impediment to a data-driven culture because they have databases full of numbers and no real grasp of how that links to the decision making process that makes the business money.
Beefing up the team doesn't help. In data, as in business more generally, the important think is not trying to guess what job your doing and spend a lot of time talking to customers about what job they need done. If the data team is where that work happens in a business then that can be helpful - but the grunt work of SQL/reporting/basic analysis is almost never where the value appears from.
I really like your takeaway about data teams at tech companies. They try to make "data" a core competency of their business, at huge cost for fixed value.
I also appreciated the very subtle implication that the OP is shrouding empire building under an otherwise informative growth story.
Love this analogy!
But if the exec team simply hired you for window-dressing, expect to be treated like a scapegoat and a punching bag. Any mistakes will be your fault. Any wins will be to the credit of the business. The Director of Product will ask to "embed" dedicated DS headcount and you won't have any real power to shape the roadmap. If the exec team doesn't give you equal footingf with Product (or Marketing, Finance, and Eng for that matter) then this will rapidly become a soul-sucking job. However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads, then you actually might be able to accomplish something really cool.
> However, if E-team does give you the authority to call Product's bullshit, and tell Finance to stuff it, and not take direction from Eng leads
I know this was meant partially in jest, but if you reach the point where you're at odds with all of the teams and departments in the company you may get a lot done in the short term, but long term it's going to be difficult if you don't have some allies in each of those departments. Obviously no one should roll over and take orders from other departments, but some times it's necessary to do some give and take to build rapport. It's a balance, not a war.
The most important thing is to work closely with your manager on expectations. If someone from another department comes to you with a proposal, an ask, or a directive, you don't want to say yes without first consulting with your manager. Depending on company politics, some managers might try to rope new employees into doing work that isn't actually part of their job description.
Discovering expectations and then proactively managing those expectations is key in any role.
New guy, knows nothing about the company and product yet but was asked to "get KPI X by end of day". He obviously has no idea how to get this done so goes to various people and throws around the "VP XYZ wants this by end of day, help me now or else!".
Needless to say I, as politely as I could, told him to shut it, look at his data and what he could get from it and stop interrupting dev with mid day, two days after start of a sprint, requests to do his work for him (dude I don't even have access to your data storage, don't know what data you have or don't etc). And do it by end of day. Sure.
The guy is burned for me now. He will have to do a LOT of sucking up to dev now for his try at "do my job for me or else"
OTOH, if the execs don't have this priority, no one gets hired to lead and scale a data team and the story never starts.
So what's the business case for having a data team independent of product, business and engineering?
Because as I see it the data team is a support function not q core part of the business. I'm sure it can be cool for you but if you are at odd with all the people actually creating value, what exactly do you bring to the table?
> You notice a a lot of the code starts with very complicated preprocessing steps, where data has to be fetched from many different systems. There appears to be several scripts that have to be run manually in the right order to run some of these things.
> “We need to focus on delivering business value as quickly as possible”, you say, but you add that “we might get back to the machine learning stuff soon… let's see”.
So so relatable. But the key insight is a really really key insight.
> What I think makes most sense to push for is a centralization the reporting structure, but keeping the work management decentralized. Why? Primarily because it creates a much tighter feedback loop between data and decisions. If every question has to go through a central bottleneck, transaction costs will be high. On the other hand, you don't want to decentralize the management. Strong data people want to report into a manager who understands data, not into a business person.
I have the same role at a non-software company, and to me this is nothing short of a complete reimagining of IT. It’s not just, “make sure everyone’s computer works and help them install software,” it’s, “build a model of the business, determine what information flows and metrics are crucial to success, and build an IT and analysis infrastructure around that model.” The CIO will soon be better thought of as the Chief Optimization Officer.
1. Is it definitely a good idea to build a separate data team, rather than embedding people with analytics knowledge in feature teams?
Is it possible to do the latter, but still have end up with a well-curated source-of-truth for your data?
2. Is A/B testing and driving your business by metrics really a good idea?
My (uninformed) impression is that data-driven is responsible for rather a lot of rot:
- Extremely irritating websites.
- Businesses ignoring important things because they can't measure them. (Financialisation, hand-in-hand with the MBA types the author decries.)
It's important to get the core centralised data infrastructure up and running (even if it's dirty af) as that helps with the bulk of the data work.
The oft quoted not completely true but kinda true statistic is that 70% of data work is finding, cleaning and storing the data. Analysis and modelling is the easy bit.
You could do it the other way around. Hire some data people in each team and get them to meet up every once in a while.
But I'd wager the central data stuff that makes everyone's life easier will get pushed back behind the "urgent" team work every time.
Edit: it's possible to do both btw. E.g. Have a bunch of centralised data engineers that do the heavy lifting stuff. With data scientist/analysts embedded in teams doing the fine grained modelling stuff. It's not a binary choice (once things are up and running).
> My (uninformed) impression is that data-driven is responsible for rather a lot of rot.
I agree! I was talking to someone else (not a tech head) the other week and realised why they hate tech so much... User interfaces that just... Don't work.
Showed him a terminal cli and he went nuts over it.
Then again, we're two kinda weird ye olde "back in my day" kinda people... So...
CLIs are finicky and force you to think in terms of text, whether it is appropriate or not. GUIs can be more expressive and haptic, but are typically very idiosyncratic and can get in the way of things.
The data-driven approach to UI seems a bit crazy?
If I think about the problems of any UI, I think in terms of communication, intent, learning, psychology and aesthetics. All of those things are human to human or human to computer related issues.
I think data-driven (as in statistical data derived from user behavior) approaches are or can be useful in terms of "what" to present, prioritize and so on. But much less so on "how", because I think this should be based on experiences derived from direct interaction and needs to be induced by creativity.
And I mean creativity from both sides, the implementer and the user. One thing that CLIs generally do better is to provide composable tools within a adaptive and simple system (pipes, text etc.), whereas it is hard to impossible to let GUIs talk to eachother and compose them to a user tailored whole.
I think we should empower "non-technical" users with the freedoms and sound principles we have come to enjoy ourselves, instead of letting statistical data dominate their experience.
A/B testing can help you with optimizing existing processes for incremental improvement, but big bets, which can sometimes have data and sometimes don’t, help with step change improvements.
Even with big bets you need a way to show that it’s better than the previous way. Either by coming up with ways to cheaply test the hypothesis or committing to being “agile” (I hate that term) and continuing to iterate.
What is statistical significance anyways? If the p-value is 0.06 is that good enough? Practical significance is something that also needs to be accounted for.
If something can’t be measured, is there a way to find some proxy metric for it?
If not, then you can try to negotiate a pilot study of the problem and have specific criteria to determine success.
Just because something can’t be measured with existing processes doesn’t mean it can’t be measured at all.
For example, there were complaints about systems crashing and having intermittent behavior, and the claim was that’s affecting sales. Technology said nothing in our logs shows any issues, our service center shows no reporting of issues, so we think they are overreacting. We put a team together and went to several different locations to observe the process and get feedback. From the feedback we put together a data collection sheet and went back for a week to collect more data. That finally convinced the Tech team that it was a problem they needed to investigate. They went to the stores, determined it’s true, and amended logging to capture what’s truly going on.
For low volumes of traffic AB testing would takes ages to wield significant results and for products still maturing and shaping there is lot of "wisdom of crowds" data already available to help make decisions faster (ie: do you really need an AB test to know offering timely promotion to users helps convert?)
If you got a young product trying to grow, fast, it's a lot more effective to rely on experienced product people and off-the-shelf simple analytics to iterate quickly and to take some bets so one day you get to a point where AB testing "optimisations" starts to make sense.
It's a quite an interesting topic! I agree with you too - A/B test driven sites tends to culminate in terrible "cumulative experience" for users
I my view, they need make sure the warehouse model is a correct representation of the business and that it can be leveraged to answer basic or not-so-basic questions using SQL. They also need to promote it's usage internally by ensuring it is accessible and easy to use and guide other team to a more data oriented mindset.
I feel that this is a specialised position not exactly similar to a developer, but every time I look for "data scientist" I get guys that want to do machine learning prediction models, which is not exactly the same stuff either.
You very likely don't want a data scientist to be doing a data engineer's job (and they probably don't want to be doing it themselves!). While there are similarities, data engineering tends to be a lot closer to software development than data science. If you're advertising for a data scientist role, don't expect them to be happy if 80% of their job is writing ETL scripts and cleaning datasets.
I think the reason there has been a flattening in data scientist job growth more recently is that lots of companies hired data scientists to build cool ML applications but had no infrastructure in place to support advanced data analysis. These companies didn't realize they needed to walk before they could run, and that what they really wanted was data analysts and engineers to build the foundation for a strong data science function.
Tools like dbt have been great for advancing an ELT approach to managing data pipelines, where modeling for BI tools, business users, and data scientists alike can all happen in the warehouse and ensure consistency in data usage across the company.
I was a bit sad to not see any mention of a data engineer anywhere in the article.
Like, if you gave me access to all the prod tables and the warehouse I'd be having a whale of a time and (hopefully) delivering enough business value to automate some of the more regular "English to SQL" translations.
> You very likely don't want a data scientist to be doing a data engineer's job.
100%. This is one of those things that would make "disgruntled ML people" in the article want to leave.
1. kafka / streaming oriented software engineering
2. data warehouse and ETL/ELT development for analytics
They're both "data in, data out" mental models that are part of the Lambda architecture which every data engineer should at least know about .
But if you want a specialist streaming person to optimise all the streaming pipelines, then sure hire a specialist.
This article by Claire Carroll describes the role and motivation for it https://www.getdbt.com/what-is-analytics-engineering/
I certainly spend time coding (especially because again, small-medium startups cant afford anyone in the data space who isnt able to heave ho) but much of it is translating pretty vague stuff into market research/a proof of concept/an initial design of what will bring value to the business and scale alright and then often more people will throw in.
That being said you can call me whatever you want, as long as its not late for dinner :)
* Building/defining the data infrastructure
* Building/defining the schemas
In a traditional ETL infrastructure they are jumbled together but if you do ELT they are not. A data engineer can build the infrastructure but the transformations can be handled better by technical analysts. They're simply one view on the underlying data so the risk is minimal. Analysts query the data day in and day out so they know much better what they need than someone who doesn't.
All of that while improving performance.
The title is strongly associated with the dbt community, so it could imply you’re using dbt for your data modeling (not necessarily a bad thing, as it sounds like it would be a good tool for your use case).
To provide some sympathy for the folks already working there: you always replace systems well after you've overrun them.
When the ad hoc system works (consider that google spreadsheet at a time when there were three support people and perhaps a dozen customers) you're not going to decide to replace it with something more complicated. Then you're busy growing so you just keep the system going through sheer force of will. You only replace it when the effort is unbearable; at that point you say, frustratedly, "I wish we'd done this sooner."
I don't think this is very cynical at all! Feels pretty accurate to me.
This sentence from the article resonated with me:
> You're starting to lay the most basic foundation of what is most critically needed: all the important data, in the same place, easily queryable.
Despite all that you read and hear about data science advancing, you’ll be surprised to see how poorly leveraged, or worse, billions of dollars are sought to implement the latest tool that promises to change the world. Tech and data as we imagine it be in the FAANG kind of companies is far different than how it is in older industries. It’s not just systems that need upgrading, company cultures do and that’s never an easy or fast process. I’ve been in the data Analytics space for 16 years now and I still feel, more often than not, I’m part of the minority, working to demonstrate true data use-cases
But after I had finished reading it, I have realized that it is a sad story, if we look from the eyes of data scientists in the team.
People were hired to do cool machine learning projects, but it turned out there is no infrastructure for them. After the new boss had arrived, they had to work as analysts for months. What is more sad - the new boss dangled a carrot before them several times, but each time the carrot disappeared.
I honestly had flashbacks when the author mentioned the carrot dangling thing. I’ve personally experienced this and as a naive early career swe, I gave the manager the benefit of doubt for a year even though I knew there was no way they could guarantee it. This is just pure manipulation.
The worst part is that he wrote the job description himself and resorted to manipulation to cover up his mistake of hiring for the wrong job role.
I chuckled. Then cried, because at least his MBA types can use SQL. My MBA types use Excel.
OT: Good article. Like and agree with the push for centralizing data first, then building outwards so external teams can move towards self-service.
Pretty easy to set up and share queries, dashboards, whatever
People still do a bunch of stuff in Excel, though, and every once in a while, it breaks, and I have to dig through the mess. Excel is great when it's just for yourself and you can manage it... it's a pain when others have to figure out someone else's.
I chuckled too.
I concede, of course, that they’re rescuing a bad situation, not starting from scratch, but still.
> Note that you took on a lot of “tech debt” earlier when you started dumping the production database tables straight into the data warehouse.
How do you manage expectations when the year-long honeymoon is over, the business grows tremendously, and the centralized data warehouse reaches a breaking point?
Will wait for a follow up post on how decentralised data team created data silos and how we solve it using data discovery and data standardisation. :P
Disclaimer: I have built decentralised data teams and it scales well.
See also: The Algebra Project https://algebra.org/wp/
This was very fun to read, and an interesting window into the processes and inner workings of a startup that size.