Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Dlt – Python library to automate the creation of datasets (colab.research.google.com)
114 points by MatthausK on Oct 25, 2023 | hide | past | favorite | 54 comments
Hi HN,

We're Anna, Adrian, Marcin and Matt, developers of dlt. dlt is an open source library to automatically create datasets out of messy, unstructured data sources. You can use the library to move data from about anywhere into most of well known SQL and vector stores, data lakes, storage buckets, or local engines like DuckDB. It automates many cumbersome data engineering tasks and can by handled by anyone who knows Python.

Here’s our Github: https://github.com/dlt-hub/dlt

Here’s our Colab demo: https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-...

— — —

In the past we wrote hundreds of Python scripts to fit messy data sources into something that you can work with in Python - a database, Pandas frame or just a Python list. We were solving the same problems and making the similar mistakes again and again.

This is why we built an easy to use Python library called dlt that will automate most data engineering tasks. It hides the complexities of data loading and automatically generates a structured and clean datasets for immediate querying and sharing.

— — —

At its core, dlt removes the need to create the dataset schemas, react to changing data, generate append or merge statements, and to move the data in transactional and idempotent manner. Those things are automated and can be declared right in the Python code, just by decorating functions.

Add @dlt.resource decorator, give it a few hints, and convert any data into a simple pipeline that creates and updates datasets.

dlt gets the details out of your way:

1. You do not need to worry about the structure of a database or parquet files

dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.

2. You do not need to write any INSERT/UPDATE or data copy statements

dlt will push the data to DuckDB, Weaviate, storage buckets and many popular SQL stores. It will align the data types, file formats, and identifier names automatically

3. You do not need to worry when you need to add new data or update the changes.

dlt lets you declare how to load the data, how to increment it and will keep the loading state together so they are always in sync.

4. You keep how you develop and test your code

Iterate and test quickly on your laptop or in a dev container. Run locally on DuckDB and just swap destination name to go to the cloud - your code, schema and data will stay the same.

5. You can work with data on your laptop.

Combine dlt with other tools and libraries to process data locally. duckdb, Pandas, Arrow tables and Rust based loading libraries like ConnectorX work nicely with dlt and process data blazingly fast, compared to the cloud.

6. You do not need to worry if your pipeline will work when you deploy it.

dlt is a minimalistic Python library, requires no backend and works whenever Python works. You can finetune it to work on constrained environments like AWS Lambda or run with Airflow, GitHub Actions or Dagster.

dlt has an Apache 2.0 license. We plan to make money by offering organizations a paid control plane, where dlt users can track and policy what every pipeline does, manage schemas and contracts across organization, create data catalogues, and share them with the team members and customers.




All of this sounds very appealing, both as a standalone tool and as a complement to some thing like Dbt.

However the following seems like kind of an anti-feature, which I at least would want the option to disable:

> You do not need to worry about the structure of a database or parquet files

> dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.

This is the opposite of what I want in 99% of projects. Most of the time, there is some kind of well-defined schema, even if it changes a little bit over time. If that schema is going to be depended upon by something like a data warehouse ELT pipeline, I want precise control over it. I do not want to hand that off to an opaque library.

Moreover, the work of actually writing out the schema is like 1% of the overall effort in consuming a new data source, and usually it turns out to be a constructive, useful exercise in pinning down assumptions, finding gaps in understanding, etc. So I see a little benefit in hiding it.

A schema essentially forms a business-critical contract between two major sections of the overall data pipeline, and that is absolutely not something I want to be changing dynamically without my explicit understanding and consent.

This reminds me of the temptation I have seen in some developers (several of them ostensibly "senior") to use MongoDB for a straightforward CRUD-like application. The argument that it's schema-less to me is a striking anti-feature, something I explicitly do not want!

The only time I really want this is in the rare and atypical case where I truly have no schema at all, or the schema is changing erratically and frequently in ways that I cannot reasonably anticipate and/or cannot dedicate developer resources to accommodating. That's a niche case that most people flatly do not have. Of course it's nice when a tool supports the niche use case that is very hard to deal with by conventional means (see also: OpenRefine), but it should absolutely not be the default and our tools should not encourage us to lie to ourselves that it's something we want or need.

If you just want to reduce manual grunt work effort, consider something like generating a schema from an OpenAPI specification / JSONSchema.


ahh good old manual fine tuning and maintenance. We are adding data contracts for things like event ingeston where schema needs to be strict or cases where you know ahead of time what to expect.

Our experience comes from startups that usually do not have time to track down the knowledge and rather go out and find/make their own. Here you definitely want evolution with alerts before curation - so load to raw, and curate from there. Picking out data out of something without a schema is called "schema on read" and you can read about its shortcomings. So this is both robust and practical.

For the fine tuning, as I mentioned, data contracts are a PR review and some tweaks away. They will be highly configurable between strict, rule based evolution, or free evolution. Definitely use alerts for curation of evolution events!


Fair enough, especially if explicit alerting is involved.

Have you considered a hybrid solution, something that generates a contract from a large corpus of data, which can then be deployed statically?

I consider "responding to change" as a somewhat different scenario from "heterogeneous but not changing". So statically generating a contract from an existing corpus supports the latter.

I could also envision some kind of graceful degradation, where you have a static contract, but you have dynamic adjustments instead of outright failures if the data does not conform to that contract.


I worked with dlt guys on exactly that. Using OpenAI functions to generate a schema for the data based on the raw data structure. You can check that work here: https://github.com/topoteretes/PromethAI-Memory It's in the level 1 folder


we actually spent several weeks writing openAPI -> dlt pipeline converter. you can check what've got here: https://github.com/dlt-hub/dlt-init-openapi

we'll continue this project but I learnt from it that most of the openAPI specs are a mess with hundreds of endpoints, incomplete definitions, lack of relations between endpoints, unique constraints etc. so there's tons of heuristics needed anyway. but sometimes it works. and is quite amazning!

if your source has well defined schema, we support ie. arrow tables natively. we keep 100% of that schema: https://dlthub.com/docs/blog/dlt-arrow-loading if you want to define your own schemas you can do it in many different way: - via pydantic models: https://dlthub.com/docs/general-usage/resource#define-a-sche... - via json-schema like definitions: https://dlthub.com/docs/general-usage/resource#define-schema - in a schema file: https://dlthub.com/docs/walkthroughs/adjust-a-schema

if you want to enforce schema and data contracts: - you can use pydantic models to validate data (if you use pydantic model as a table definition, this is the default) - we have soon-to-be-merged schema contract PR: https://github.com/dlt-hub/dlt/pull/594

My observations are that it is more than 1% of people that are fine with auto-generated schemas. But that could be selection bias (they use our library because they like it).


Tbh I found it confusing why suddenly you were using chatgpt in the middle of the example. It made it seem like this is a gpt based tool but it's not.


Thank you for the feedback! I can see now how it could be confusing.

The reason we used chatgpt is because it's an easy starting point - why read through examples when you can get the one you want in seconds?

Because dlt is a library, it's closer to how language works and gpt can just use it - from our experiments, we cannot say the same about frameworks.


That example looks completely opaque to me. Not only does it obfuscate what's actually happening by using some incomplete code from a chatbot, but it also is an actually relevant to the task at hand, which is to demonstrate your library, not to demonstrate some beginner level API data access. Skimming over it, I couldn't tell where your library actually got involved at all, it just looked like a couple of functions to access data, followed by links to your documentation. I suggest dumping the whole thing and starting with a more coherent demo that focuses on the features of the tool you actually built, not on features of irrelevant systems.


Yes I agree. Based on what they show compared to what they say, I'm not really sure what this library actually does.


This is nice. Pulling data from an API and putting it in a SQL database should be a simple everyday task but the tools for this are 99% massive overkill. Great to see a simple library for a simple job.


Pulling from and into production databases is one of the early favourites from our dlt user base. Some reasons explained here in this MongoDB example (https://dlthub.com/docs/blog/MongoDB-dlt-Holistics)


This is a really cool project—congrats! A somewhat related project that I worked on at MongoDB is PyMongoArrow, it does some of the same transformations to take unstructured MongoDB data and convert it to tabular formats like Arrow data frames. I’m curious what the support for BSON types that do not map cleanly to JSON types looks like? One example I can think of off the top of my head is Decimal128


> dlt is a minimalistic Python library, requires no backend and works whenever Python works. You can finetune it to ... run with ... Dagster.

Relating to dagster in particular, this is in your docs:

dlt incorporates the concept of implicit extraction DAGs to handle the dependencies between data sources and their transformations automatically. A DAG represents a directed graph without cycles, where each node represents a data source or transformation step.

When using dlt, the tool automatically generates an extraction DAG based on the dependencies identified between the data sources and their transformations. This extraction DAG determines the optimal order for extracting the resources to ensure data consistency and integrity.

How do you think about tying and running this and dagster together?


There are multiple ways to run together - we will show a few in a demo coming out soon.

We also consider a tighter integration like with Airflow described here as a possible next step https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deplo...

We will investigate the interest incrementally as to not build any plugins that don't end up used.


For an example of prior art, you should look into Astronomer's Cosmos Library to see how they integrate Dbt into Airflow.


Thank you! That's the example we looked at for our dlt-airflow integration :) the dlt dag becomes an airflow dag.


We are one of the early adopters and really like the “weight” of DLT. It’s heavy enough to be adding substantial value over homegrown scripts for extraction and loading. At the same time, it is light enough to be easy to add to a pre-existing data stack.


Thanks for your vote of confidence & support Max!


Very cool project, @MatthausK!

What are your thoughts on reducing LLM cost?

We are also exploring LLM-based data wrangling using EvaDB and cost is an important concern [1, 2, 3].

[1] https://github.com/georgia-tech-db/evadb

[2] https://medium.com/evadb-blog/stargazers-reloaded-llm-powere...

[3] https://github.com/pchunduri6/stargazers-reloaded


In the same way that dbt has established a foundational format for relational data transformation, dlt seems poised to establish a more ubiquitous solution for batch data pipelines. Native upsert "relational-ization" with configurability is a huge challenge that data engineers have been fighting for decades. Obviously, this library will not replace Kafka setups or more intensive streaming jobs, but there are so many custom CRON scripts in the universe doing exactly this. Would be a big win to see some standardization to Python and an OSS tool like this.


I've been fiddling around with meltano and finicky singer taps for 2 weeks, and today I found dlt. What a breath of fresh air -- thank you for the work! I think this is the future over the meltano and singer ecosystem -- the code quality, simplicity and ease of integration into my project, and documentation really makes this project shine.


brilliant. This is exactly the solution I always hoped for for the exact problem you describe. My specific use-case has always been reading news articles from a set of heterogeneous websites and consolidating them into one db.

why chose duckdb over sqlite?


Duckdb is analytical and gained popularity with the analytics crowd. it has multiple features that make it play well with use cases in that ecosystem such as aggregation speed, parquet support, etc


Potential conflict with other quite popular Python library: https://dlthub.com/docs/intro


It's literally the same library though


A similar tool was discussed a few days ago. OpenRefine. https://news.ycombinator.com/item?id=37970800


dlt is a python library that you can probably plug into the OpenRefine java application to enable moving the data somewhere easily and into different formats, making OpenRefine more useful in a connected environment.

I would not say they are similar - rather OpenRefine is made for visual data cleaning, while dlt is made for automation of data movement with structuring and typing to enable crossing different format standards with ease.

Together you should have a good combination of automation and manual tweaking option if needed.


I'd say they're complimentary. One could use dlt to load the data and then use OpenRefine for to clean/transform it. dlt already does when combined with dbt, for example.


For future reference, that's "complementary". By far one of the most common mistakes (at least, among relatively uncommon words).


Yes it's pretty common :). I did notice the funny looking spelling after I hit reply but there was no way to edit it.


Interesting. Usually editing is enabled for some window of time, but perhaps not for new accounts.


Nicely done! Looks lightweight and powerful - especially the schema auto-generation and flattening will make many raw data tables look much cleaner.


Nice, the demo is cool. How do you differentiate from DBT?


first you extract and load data with dlt and then you transform them with dbt. so both tools work very well together. we did a really cool helper to make it easier(https://dlthub.com/docs/dlt-ecosystem/transformations/dbt/) (I'm one of core devs in dltHub)


Can the Destinations also natively be used as Sources?


1) Yes. We support all the databases and buckets as data sources as well. Some examples: - get data from any sql database: https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_d... or https://dlthub.com/docs/getting-started#load-data-from-a-var... - do it super quickly with pyarrow: https://dlthub.com/docs/examples/connector_x_arrow/ - get data from any storage bucket:https://github.com/dlt-hub/verified-sources/tree/master/sour... 2) Strictly technical answer: on the code level sources and destinations are different Python objects so the answer is no:) but you as a user rarely deal with them directly when coding


Do you plan on integrating with metadata sources as well such as Amundsen or Datahub? Or is the plan that DLT will become the metadata source?


Since dlt generates a schema, and tracks evolution etc, contains lineage, and follows data vault standard it can easily provide metdata or lineage info to the other tools.

At the same time, dlt is a pipeline building tool first - so if people want to read metadata from somewhere and store it elsewhere, they can.

If you mean to take metadata like we integrate with arrow - that remains to be seen if the community might want this or find it useful, we will not develop plugins for collecting cobwebs, but if there are interested users we will add it to our backlog.


Thanks for the response. I also noticed there was a mention of data contracts or Pydantic to keep your data clean. Would it make sense to embed that as part of a DLT pipeline or is the recommendation to include it as part of the transformation step?


You can use pydantic models to define schemas, validate data (we also load instances of the models natively): https://dlthub.com/docs/general-usage/resource#define-a-sche...

We have a PR (https://github.com/dlt-hub/dlt/pull/594) that is about to merge that makes the above highly configurable, between evolution and hard stopping: - you will be able to totally freeze schema and reject bad rows - or accept the data for existing columns but not new columns - or accept some fields based on rules'


you can request a source or a feature by opening an issue on sources/dlt repo https://github.com/dlt-hub


DataHub is AMAZING!!!! I can't believe it slipped under my radar! Thanks for sharing/prompting me.


Love this!


As an FYI, Databricks' "Delta Live Tables" product keeps being contracted to "DLT" at my employer, so you have some potential naming confusion in your future.


Thank you for the heads up!

it is unfortunate, and with 3 letter acronyms this will happen.

An easy way to remember is that we are the one you can pip install and plays well in the ecosystem.

Databricks has interesting choice in marketing names, ngl - DLT named after the competing dbt - Renaming standards like raw/staging/prod to Bronze, Silver, Gold


I had a related problem when searching for "dlt on aws lambda"... Google thinks "dlt" is an abbreviation of "delete" and returns results accordingly


for now :) Thanks for pointing it out - and it looks like we should add an aws lambda guide too :)

If you want to deploy to lambda, try asking in the slack community, some folks there do it.

Or if you wanna try yourself, here is a similar guide that highlights some concerns from deploying on gcp cloud functions https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deplo...


We hear a lot about the dlt & AWS Lambda. We have currently one user working on the use case (see our Slack https://dlthub-community.slack.com/archives/C04DQA7JJN6/p169...)


Sorry for not being interested in what your library does, I like looking at how different Python packages go about packaging / installation / testing and other infra stuff.

Having a Makefile is kind of... unusual. A hand-written one is even less so.

Now, I haven't used Poetry a lot, but... your Makefile is all about using Poetry. Which is kind of fun since it alone is supposed to provide most of that infra stuff. Also, allegedly in a cross-platform way, but you kind of made it very Linux-exclusive by including some Bash, and, well, Make itself being a foreigner in the MS land.

Another funny part that I can see there is how Python (not very smart) community was fighting against removing integrations from setuptools to kill stuff like "setup.py test" or "setup.py install", but Poetry does exactly that. I mean, it's not on you, and it's not bad. I actually believe that that's the better way to do it. But you'll find a lot of Python (not very smart) apologists foaming in their mouth when telling you how you are supposed to use different utilities on your project and how pyproject.toml is supposed to be the connecting link between all of them. Which, turns out, you also have.

And, hey-ho! you are also using tox, while running tests from Poetry.

Another funny part is that you specify all your dependencies while including patch in the version. You are definitely not the only one, and it's a common thing in (not very smart) Python community, but it still cracks me up every time I see this. Like, the whole point of semantic versioning was that you are supposed to depend on the version of public API. So, in principle, any package needs to depend only on the major version. Minor version if you've done something stupid (like depending on features that haven't been officially released). And patch -- well, that should never happen. The cherry on top if the caret (^) in your dependency specifications. That's another thing that should've never happened. The >= was supposed to work like that (but that's not on you, that the not very smart community behind SemVer's fault).

Now, just to taunt you a little bit more: you do realize that the most sizable chunk of Anaconda Python users are in the research community, which is... sort of your target audience? Don't you think it's ironic that you don't make conda packages for your project? (And wait until you discover that virtually none of the infra code you wrote for your project will work well in that environment).

I'm not trying to disparage you. I see a lot of projects aimed at the science research world. Not being a scientist myself, ironically, I get credit in scientific papers for working on projects' infra :) I see a lot of struggle to keep up with the programming world from the scientific community, and I also believe I see some terrible choices (eg. Python) that are now put this whole group of people in front of a very difficult choice: either to continue with Python because there's some knowledge garnered, even though it's not enough by any measure, some still want to clutch the pearls... while a smaller part of this community wants to cut the rope before it's too late.

From what I see, in the research world, dealing with Python infra is an unmitigated disaster with lots of negative consequences. But the way forward is not very clear.


I think you have some interesting feedback and points, but I somehow think they get lost in some of the prose that arguably is against HN guidelines, notably:

>Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

>When disagreeing, please reply to the argument instead of calling names.

>Please don't fulminate. Please don't sneer, including at the rest of the community.


Where's "disagreeing"? I'm not disagreeing with the authors of Dtl on anything...


We took at least one immediate practical good piece of advice out of this which is that we should release a conda package and make sure that dlt works in it.


I wouldn't make it a high priority. If there's one thing I know about conda users it's that "no conda package available" has never stopped them. In fact they prefer to pip install inside their conda environment, and the only conda packages they use are the ones that touch Nvidia drivers (e.g. pytorch).


> "no conda package available" has never stopped them.

Yes and no. They won't stop because they want to get things done, and the things usually don't involve honing the infrastructure. But installing packages with pip usually breaks conda installation, not even a particular virtual environment. (Usually pip nukes the setuptools that come with conda, and then once you want to install / upgrade anything in base environment, you discover that it's toast because conda itself depends on setuptools, but it's now broken and cannot be reinstalled).

So, in practice, if you give up and use pip to install stuff, it means that for the next project you will be reinstalling conda (and you will probably lose all your previous virtual environments). Kinda sucks.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: