Hacker News new | past | comments | ask | show | jobs | submit login
The new local data stack: Integrating Dagster, dbt and DuckDB (georgheiler.com)
144 points by geoHeil 9 months ago | hide | past | favorite | 50 comments



It's a cool idea, but it looks incomplete for the production use case.

1. Usually, you want to run some warehouse all the time. Bring their data through ETL, run transformation, and report. This goes against the local environment. Ideally, I would love a cloud warehouse, which each engineer could easily fork to their laptop.

2. Almost all companies already have some data setup. The migration path is very unclear. Most likely, this is a secondary system for the majority of companies. Ideally, I would love to describe how I can use it along big platforms (e.g. BigQuery or Snowflake).


I could add more but on a newish local stack I was trying: I spent a good amount of hours on duckdb this week to process personal data from data dumps (social networks, etc) and now I'm back to the combo of postgresql in containers + sqlite.

After the initial imports and some massaging with queries that felt awesome, I found hard to step-up the game building the relationships I wanted. The last stroke before switching was the lack of managing foreign keys without recreating entire tables again. I can go over other examples.

It can be done, but it just takes you out of the flow when you're analyzing the data + cleaning it, specially because I know that I can do with psql and sqlite in a blink of an eye.

Since many etl tools don't care about the target database being these old and trusty fellas, I felt losing a lot of time just to get rid of a postgres install that is right now consuming only 200 mb of ram on a docker/podman container. Or working around some sqlite ingestion issues with simple notebooks + pandas/polars/etc.

in my pov it seems a shaky ground for an entire new stack

I appreciate duckdb taking me out of the comfort zone tho.


Hi, Aleks here, one of the authors, and thank you very much for your comment

We run this stack in production for the last few months, and it has its downsides (I would argue due to the young ecosystem) and upsides, which we try to explain in the blog. We wanted to concentrate more on the concepts and technology change/improvement that allows us to run such a stack and explain how we see the future steps forward.

1. Running a warehouse is not a bad idea, but you must always be careful to separate the storage from the compute to scale. I experienced the limitation of such a system as described in this blog https://delta.io/blog/2022-09-14-why-migrate-lakehouse-delta... -> tough position if your solution is good but not scale. Ideally run it with external tables in order that data is visible without engine access

Another limitation is the metastore for your tables and metadata, which you usually have per workspace/environment in such a scenario. Databricks' unity catalog is an excellent way to solve it, but it is only compatible with some engines.

2. We do not think that this stack exists to exchange the snowflake or big query but to take a part of the workload away ( data transformation) and let PaaS solutions be good at what they are made for -> user interface and interaction.


Hi,

Georg - one of the authors is here.

What we argue is that:

- for a great software/data engineering/creation experience we recommend such a stack that is only on when needed (when transformations occur)

- for a great data consumption experience we suggest the integration with an established PaaS platform. Not only for the sake of being available (as a serving layer of data to end users) but also for the missing fine-grained RBAC in the proposed transformation layer


Definite[0] is the cloud version of this idea (data stack in a box). We have ETL, modeling, a data warehouse (Snowflake), and BI (dashboards) in one app.

We're experimenting with using DuckDB as the warehouse. Would be awesome to let people pull down parts of their warehouse locally for testing.

0 - https://www.definite.app/


Isn't that case easily handled by something like a large stateless VM and cloud storage, so EC2 + S3 ? Doesn't have to be local, the point is that it doesn't have to be distributed either, just one large instance that is only on when it's needed.


what stops you from running this on a container and dumping the results with a script to be queried by a reporting solution into bigquery or snowflake?


Hi,

Georg - one of the authors is here.

In fact, this is exactly what we argue: Keep data the consumer experience high a Paas platform (Fabric, BigLake, Databricks, SF, ...) can make a lot of sense, whereas for the best data development/creator experience a high-code solution based on solid software engineering best practices should be the preferred solution - at least in my/our(authors) opinion.


I already see companies with BigQuery setuped.

I would like to quickly grab some data, queries and dashboards and run them in my local warehouses.


If you use bigquery in big lake mode - i.e. with parquet or iceberg or delta files on GCS (or other object store) you can easily pull in the data into duckdb as well


I've helped with multiple data engineering teams that suffer tremendously due to a lack of local setup, so DuckDB and the right choice of tools looks immensely promising.

Also for 99% of companies and data workloads the optimal choice should usually be a janky bash script, so this provides meaningful competition for annoyingly medium data.


> I've helped with multiple data engineering teams that suffer tremendously due to a lack of local setup

why remote setup is not enough for those teams?..


remote setup was enough, but our dev environments were never quite 100% the same as prod because copying all the data would have been too expensive, and the remote environments were always just a little higher latency for things than the local computers... Just a little bit worse in ways that annoyed people.


If you're interested in interrogating dbt models using DuckDB, you may be interested in some new dbt features we've recently released at https://count.co


Your link probably doesn't lead where you want it to btw.


Thanks for the heads up - a page more focussed on dbt specifically is here:

https://count.co/product/data-modeling/

and associated blog post:

https://count.co/blog/how-we-made-data-modeling-multiplayer/


Thanks! I was logged into count so I just got sent straight to my homepage as well. Seems interesting


On Dagster’s website there are comparisons to dbt. Is it complementary (as positioned in this stack) or competitive?

Dagster versus dbt https://dagster.io/vs

Migrating off dbt cloud https://dagster.io/blog/migrate-off-dbt-cloud


Dagster (orchestration tool) is complementary to dbt core (open source CLI tool for defining data transformation jobs) but competitive to dbt cloud (SaaS, orchestrates dbt jobs).


you can run dbt core via a dagster job (so it'd replace dbt cloud)


I've been working with this stack (building Open Data Portals¹) for a few months and am super happy with how well everything plays together.

¹ https://github.com/davidgasquez/gitcoin-grants-data-portal


Hi, i am working on something similar and was looking for ways how i can host my open data. the approach seems interesting, can i reach out to you to discuss more on this somewhere?


Sure! You can find my contact details on GitHub: https://github.com/davidgasquez.


Tools like Dagster, dbt, and AWS Glue always strike me as built for BI/in-house analytics over user & app-generated data.

What do people use for ingesting data from end-users, their transactional CRUD to it too, and then OLAP queries triggered (not written) by and for those same end users?


There are a bunch of collector tools for capturing application side events and sending them into your server.

The common stack is then to ingest them from Kafka into an OLAP datastore like ClickHouse or Pinot.

Then I think you are referring to what is sometimes called “user facing analytics” against that datastore.

We like to use a tool called Cube.dev as middleware between the app and OLAP database which gives APIs and a semantic layer.



'user-facing analytics' and (via Cube.dev) 'embedded analytics' are helpful terms, thank you!


+1 for using a semantic layer. I built (and open-sourced) one for use in my own business but Cube looks well done.

https://github.com/totalhack/zillion


For the ingesting bit, we (Polytomic) have an API you can use: https://www.polytomic.com/connect.


Bunch of python code.


I only skimmed the article due to its length. But I didn't see anything like a comparison to other toolchains. Like, how is this better/different/worse than Airflow + dbt + Snowflake?


It does feel like it was lengthened with GPT. There's so many essay-style "now we propose to show that...", it's uncanny.

> we argue that we can rethink the current state of data transformation pipelines (intro)

> In the blog we will cover:

> We show a way how you can combine the best of both worlds

> We will tackle its impact and explain more in the following development section.

edit: yeah it's ChatGPT:

> Parts of this text were adeptly generated by ChatGPT but enhanced by real humans.

I guess it's the future, turn a tweet into a 5-page essay with your AI, so your readers can summarize it back to tweet-length with their AI.


Hi,

Georg - one of the authors is here.

Indeed, we used LLMs/GPT4 for proof reading and enhancing the English language (we are not native speakers).

We were thinking about breaking up the content - but decided that one long post is a better fit.

> > We show a way how you can combine the best of both worlds

In fact, that sentence I wrote by hand : )


Snowflake isn't local, you have to pay for cloud ... Airflow is Airflow, complexity, steep learning curve, there is a whole industry trying to be as/more powerful than Airflow with less complexity and cleaner integration with modern things you might want to do like k8s, although of course Airflow is still super popular and powerful.

the Dagster folks have some comparisons, of course there are popular modern alternatives other than Dagster

https://dagster.io/blog/dagster-airflow

https://dagster.io/vs/dagster-vs-airflow


Far more helpful to me than these essays on the benefits of a particular paradigm is a simple, minimal example in a repo that I can dig into and explore.


It just seems like one choice, among many. The dagster-dbt library in particular seems like a slightly pointless wrapper around the dbt cli and the json artifacts it creates. I've been using airflow and astronomer-cosmos which is not perfect but I didn't enjoy using dagster last time I tried.


Interesting why? What were the pain points?


We (Aleksandar and Georg) want to share our new blog post on: "Dagster, dbt, DuckDB as new local MDS" here. It re-introduces the local environment to enhance the developer productivity for data pipelines by bringing back software engineering best practices. We suggest that PaaS platforms should become an implementation detail and refine the new local stack with great a data consumer experience by combining the best of both worlds.

Please take a read here: https://georgheiler.com/2023/12/11/dagster-dbt-duckdb-as-new...

We (Georg Heiler and Aleksandar Milicevic) are keen to discuss the proposed new stack with you. Do not hesitate to reach out.


Hey Georg, thanks for posting. I've been working on building a thing with almost the same stack lately and the Dagster integration is my next large-ish step. In the last couple weeks I've kicked around Duck and Clickhouse for the backend. It's been a lot of fun. Ultimately I'd like something that can be run locally or easily installed and run on a PaaS.

The only trick with local is that data sets of any appreciable size take ages to pull down, at least here in the US with our terrible internet (in the average Italian mountain village this would probably work great).


Well local can mean your laptop. But could mean your local server. However, it can also mean a VM on your cloud provider of choice colocated to the object store.

In such a case the network transfer (for many cases) can be almost irrelevant - at least for up to medium-ish sized datasets.


Funny thing, I'm trying to get going with Dagster today like I said, and figuring out how to debug is a bit .. annoying. The most helpful post I've come across so far just happens to be https://georgheiler.com/2022/02/02/interactive-dagster-debug...

So thanks!


I use DuckDB, it's a great piece of tech. Only way I could figure out how to use it though was to read the source (found a few interesting bugs while doing so as well)


No code is fully free of bugs. Did you submit an issue or PR?


I didn't do all of them because I found them on the job.

Submitted one bug report for the thing I didn't know how to fix.


Which bug?


Looks cool but I don't like to push to paid managed services like dbt. Rather use something like Airflow but its not as easy to use for most teams.


As the other poster mentioned, DBT is an open core product and I'm not sure how you'd use the cloud service on a local DB install anyway. Airflow and DBT don't really do the same things anyway.


DBT is open source and can be used entirely for free. They sell a hosted version.

Also DBT and Airflow are different tools, aimed at solving completely different problems. The Airflow competitor in this article is Dagster, which is also open source.


I'm not sure those two solve the same problem. Regardless airflow is an amazingly useful tool.


dbt-core is open source and can be used locally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: