It's a cool idea, but it looks incomplete for the production use case.
1. Usually, you want to run some warehouse all the time. Bring their data through ETL, run transformation, and report. This goes against the local environment. Ideally, I would love a cloud warehouse, which each engineer could easily fork to their laptop.
2. Almost all companies already have some data setup. The migration path is very unclear. Most likely, this is a secondary system for the majority of companies. Ideally, I would love to describe how I can use it along big platforms (e.g. BigQuery or Snowflake).
I could add more but on a newish local stack I was trying: I spent a good amount of hours on duckdb this week to process personal data from data dumps (social networks, etc) and now I'm back to the combo of postgresql in containers + sqlite.
After the initial imports and some massaging with queries that felt awesome, I found hard to step-up the game building the relationships I wanted. The last stroke before switching was the lack of managing foreign keys without recreating entire tables again. I can go over other examples.
It can be done, but it just takes you out of the flow when you're analyzing the data + cleaning it, specially because I know that I can do with psql and sqlite in a blink of an eye.
Since many etl tools don't care about the target database being these old and trusty fellas, I felt losing a lot of time just to get rid of a postgres install that is right now consuming only 200 mb of ram on a docker/podman container. Or working around some sqlite ingestion issues with simple notebooks + pandas/polars/etc.
in my pov it seems a shaky ground for an entire new stack
I appreciate duckdb taking me out of the comfort zone tho.
Hi, Aleks here, one of the authors, and thank you very much for your comment
We run this stack in production for the last few months, and it has its downsides (I would argue due to the young ecosystem) and upsides, which we try to explain in the blog. We wanted to concentrate more on the concepts and technology change/improvement that allows us to run such a stack and explain how we see the future steps forward.
1. Running a warehouse is not a bad idea, but you must always be careful to separate the storage from the compute to scale. I experienced the limitation of such a system as described in this blog https://delta.io/blog/2022-09-14-why-migrate-lakehouse-delta... -> tough position if your solution is good but not scale.
Ideally run it with external tables in order that data is visible without engine access
Another limitation is the metastore for your tables and metadata, which you usually have per workspace/environment in such a scenario. Databricks' unity catalog is an excellent way to solve it, but it is only compatible with some engines.
2. We do not think that this stack exists to exchange the snowflake or big query but to take a part of the workload away ( data transformation) and let PaaS solutions be good at what they are made for -> user interface and interaction.
- for a great software/data engineering/creation experience we recommend such a stack that is only on when needed (when transformations occur)
- for a great data consumption experience we suggest the integration with an established PaaS platform. Not only for the sake of being available (as a serving layer of data to end users) but also for the missing fine-grained RBAC in the proposed transformation layer
Definite[0] is the cloud version of this idea (data stack in a box). We have ETL, modeling, a data warehouse (Snowflake), and BI (dashboards) in one app.
We're experimenting with using DuckDB as the warehouse. Would be awesome to let people pull down parts of their warehouse locally for testing.
Isn't that case easily handled by something like a large stateless VM and cloud storage, so EC2 + S3 ? Doesn't have to be local, the point is that it doesn't have to be distributed either, just one large instance that is only on when it's needed.
what stops you from running this on a container and dumping the results with a script to be queried by a reporting solution into bigquery or snowflake?
In fact, this is exactly what we argue: Keep data the consumer experience high a Paas platform (Fabric, BigLake, Databricks, SF, ...) can make a lot of sense, whereas for the best data development/creator experience a high-code solution based on solid software engineering best practices should be the preferred solution - at least in my/our(authors) opinion.
If you use bigquery in big lake mode - i.e. with parquet or iceberg or delta files on GCS (or other object store) you can easily pull in the data into duckdb as well
I've helped with multiple data engineering teams that suffer tremendously due to a lack of local setup, so DuckDB and the right choice of tools looks immensely promising.
Also for 99% of companies and data workloads the optimal choice should usually be a janky bash script, so this provides meaningful competition for annoyingly medium data.
remote setup was enough, but our dev environments were never quite 100% the same as prod because copying all the data would have been too expensive, and the remote environments were always just a little higher latency for things than the local computers... Just a little bit worse in ways that annoyed people.
If you're interested in interrogating dbt models using DuckDB, you may be interested in some new dbt features we've recently released at https://count.co
Dagster (orchestration tool) is complementary to dbt core (open source CLI tool for defining data transformation jobs) but competitive to dbt cloud (SaaS, orchestrates dbt jobs).
Hi, i am working on something similar and was looking for ways how i can host my open data. the approach seems interesting, can i reach out to you to discuss more on this somewhere?
Tools like Dagster, dbt, and AWS Glue always strike me as built for BI/in-house analytics over user & app-generated data.
What do people use for ingesting data from end-users, their transactional CRUD to it too, and then OLAP queries triggered (not written) by and for those same end users?
I only skimmed the article due to its length. But I didn't see anything like a comparison to other toolchains. Like, how is this better/different/worse than Airflow + dbt + Snowflake?
Snowflake isn't local, you have to pay for cloud ... Airflow is Airflow, complexity, steep learning curve, there is a whole industry trying to be as/more powerful than Airflow with less complexity and cleaner integration with modern things you might want to do like k8s, although of course Airflow is still super popular and powerful.
the Dagster folks have some comparisons, of course there are popular modern alternatives other than Dagster
Far more helpful to me than these essays on the benefits of a particular paradigm is a simple, minimal example in a repo that I can dig into and explore.
It just seems like one choice, among many. The dagster-dbt library in particular seems like a slightly pointless wrapper around the dbt cli and the json artifacts it creates. I've been using airflow and astronomer-cosmos which is not perfect but I didn't enjoy using dagster last time I tried.
We (Aleksandar and Georg) want to share our new blog post on: "Dagster, dbt, DuckDB as new local MDS" here.
It re-introduces the local environment to enhance the developer productivity for data pipelines by bringing back software engineering best practices.
We suggest that PaaS platforms should become an implementation detail and refine the new local stack with great a data consumer experience by combining the best of both worlds.
Hey Georg, thanks for posting. I've been working on building a thing with almost the same stack lately and the Dagster integration is my next large-ish step. In the last couple weeks I've kicked around Duck and Clickhouse for the backend. It's been a lot of fun. Ultimately I'd like something that can be run locally or easily installed and run on a PaaS.
The only trick with local is that data sets of any appreciable size take ages to pull down, at least here in the US with our terrible internet (in the average Italian mountain village this would probably work great).
Well local can mean your laptop. But could mean your local server. However, it can also mean a VM on your cloud provider of choice colocated to the object store.
In such a case the network transfer (for many cases) can be almost irrelevant - at least for up to medium-ish sized datasets.
I use DuckDB, it's a great piece of tech. Only way I could figure out how to use it though was to read the source (found a few interesting bugs while doing so as well)
As the other poster mentioned, DBT is an open core product and I'm not sure how you'd use the cloud service on a local DB install anyway. Airflow and DBT don't really do the same things anyway.
DBT is open source and can be used entirely for free. They sell a hosted version.
Also DBT and Airflow are different tools, aimed at solving completely different problems. The Airflow competitor in this article is Dagster, which is also open source.
1. Usually, you want to run some warehouse all the time. Bring their data through ETL, run transformation, and report. This goes against the local environment. Ideally, I would love a cloud warehouse, which each engineer could easily fork to their laptop.
2. Almost all companies already have some data setup. The migration path is very unclear. Most likely, this is a secondary system for the majority of companies. Ideally, I would love to describe how I can use it along big platforms (e.g. BigQuery or Snowflake).