This is an article from Jan 2022 when we were a company of 10, and now are a company of ~80.
Worth some observations that:
- We're still using Fivetran for the EL stages. Costs are much more significant than they were before and we're looking (for the high volume sources) into options like DataStream as cost savers, but it's not unmanageable.
- dbt is still working great, even if we've done a lot of investment having now built a 5 person data team (BI, DA, DE) around it.
- Still use Metabase but have some frustrations and are considering other options.
We do very similar things ourselves: our insights product (https://incident.io/learn) uses Metabase to power the dashboards.
The data that goes into those insights can be quite complex and the queries are actually threading JSON parameters through into BigQuery SQL queries using JavaScript UDFs to power the filters in the dashboard (show incidents with these custom field values). This works pretty well with signed Metabase dashboard links.
We have hit limitations with Metabase though. Performance of the instance can be a bit unpredictable and their support has been poor when things do go wrong, with very little willingness to take our feedback into account for new product features.
For that reason and more (such as more flexible dashboards) we’re going to move ourselves to Omni (https://omni.co/) for internal business analytics use cases, and will reconsider Metabase for our customer facing product dashboards when we do. Omni may work for these or we might build them bespoke, we’ll see at the time.
I recently ran a little shootout between Superset, Metabase, and Lightdash — all open source with hosted options. All have nontrivial weaknesses but I ended up picking Lightdash.
Superset the best of them at data visualization but I honestly found it almost useless for self-serve BI by business users if you have existing star schema. This issue on how to do joins in Superset (with stalebot making a mess XD) is everything difficult about Superset for BI in a nutshell. https://github.com/apache/superset/issues/8645
Metabase is pretty great and it's definitely the right choice for a startup looking to get low cost BI set up. It still has a very table centric view, but feels built for _BI_ rather than visualization alone.
Lightdash has significant warts (YAML, pivoting being done in the frontend, no symmetric aggregates) but the Looker inspiration is obvious and it makes it easy to present _groups of tables_ to business users ready to rock. I liked Looker before Google acquired it. My business users are comfortable with star and snowflake schemas (not that they know those words) and it was easy to drop Lightdash on top of our existing data warehouse.
I don’t think we did consider this, probably because we have a preference to buy instead of build with tools like these and prefer a team who can respond to our feedback.
It looks like a promising tool though! I’m sure we’ll blog about our experience with the new tooling once we’ve moved over, the team will no doubt have a lot to say about it.
I had a feeling from the article and your comments, which is why I mentioned the hosted service. :)
From their website[0]:
> Preset was founded by the original creator of Apache Superset™. Our team of experts contributes over 75% of all commits to the open-source software project.
I'd be interested to see your blog post, regardless of tool.
Curious, have you tried speeding things up with e.g. cube.js? We used it in a fully custom project and it was a Performance life saver. It works quite well with Superset actually.
What's the business justification for spending this much effort (money) on data warehousing as a startup?
I've not worked at any startups that did data warehousing, the one place I did work at where we were /starting/ to get it setup was like 300+ employees and $100M+/year revenue.
Data like this allows us to be extremely customer focused and help direct investment for the business, such as what features we build on and when.
We also use the same data pipeline to power a lot of our data product features which customers pay us for.
So it’s extremely worthwhile as an investment for us. It’s also why we have about five people hired into data adjacent roles, as it’s so key to us running the business correctly.
Out of curiosity, would running DBT that outputs to a reporting schema on a Postgres read replica work? Or as a startup do you already have too much data for that?
That requires us to do some expensive cross-joining of every action ever taken in an incident from messages sent to the channel to GitHub PRs being merged. We could make this incremental and optimise it for performance but using BigQuery by default means we don't need to worry yet and can leave the optimisations for when we're bigger and the engineering resource we'd dedicate wouldn't detract as much from customer-focused work.
Meta does it another way. Instead of one giant data warehouse or various DW silos, build a data platform API stack supporting heterogeneous storage adapters, privacy policies, regional locality policies, and retention policies underneath supporting heterogeneous D*L operations. This sidesteps duplication of and denormalizing data and allows for maximum data discovery, reporting, and reuse. And while GraphQL can't be all things to all people, it's pretty damn good. If needing {MySQL,PostgreSQL,{{other_thing}}}-compatible or REST APIs, then build them similarly.
ETL should be minimized (except for external data, which is a bad sign of data owned or managed by a third-party) and replaced with the equivalent of dynamic or materialized "views". Prefer to create hygienic "views" of data against original data rather than mutating and destroying such original data with destructive transformations.
Finally, have a deeply-integrated, robust, enterprise-wide, fine-grained ACL system and privacy policy to keep everyone (and system users) from accessing anything without a specific business purpose need and an approval audit record stored via some sort of blockchain-like tech.
This sounds really awesome, I will note that I put this data stack together by myself in about 1 week when we were just ten people in the company.
Obviously very different resource constraints than Meta, so worth considering which situation you may be closer to when picking an implementation plan.
I’d be curious to know if you considered using something like Dagster for orchestrating these runs? Seems like a more natural choice over CircleCI for running what resembles a DAG. (And either way, thanks for sharing this.)
I'm not sure it's so steep: Starter is $16/user, Pro is $23/user (with the $10k minimum) and then Enterprise is expecting 100+ (into the many thousands) of users and companies often want to discuss that pricing alongside support constraints etc.
The Pro plan contains a load of features like insights, audit logs, API access, etc that you don't get in Starter, hence the increase per-seat bill.
Would be interested if that makes more sense to you? (I work there, always keen to hear feedback)
If I understand the pricing table correctly, Starter is $16/user, Pro is the equivalent of $27/user ($10k divided by 12 months and 30 free responders), and $23 for every user after that.
30 responders probably covers a company of 100+ people (viewers are free after all). If you are smaller but want things like an API or webhooks, the minimum payment for the Pro plan is a huge cliff. And it doesn't even give you audit logs or an SLA.
The product provides login and sign up with Slack which means all product plans include SSO under whatever scheme the Slack workspace is provisioned with.
The SAML feature in the Pro and up plans is about hooking us up directly to your identity provider to do things like RBAC and group syncing, and we pay WorkOS a monthly fee of several hundred dollars per customer that uses this, so makes sense to gate in the higher plans.
Worth some observations that:
- We're still using Fivetran for the EL stages. Costs are much more significant than they were before and we're looking (for the high volume sources) into options like DataStream as cost savers, but it's not unmanageable.
- dbt is still working great, even if we've done a lot of investment having now built a 5 person data team (BI, DA, DE) around it.
- Still use Metabase but have some frustrations and are considering other options.
- We no longer use Stitch :tada:
There's a post that followed this on improvements we made to our setup that may be interesting: https://incident.io/blog/updated-data-stack
The OP is still full of relevant, useful information, though (imo, of course).