Hi HN,
We're Anna, Adrian, Marcin and Matt, developers of dlt. dlt is an open source library to automatically create datasets out of messy, unstructured data sources. You can use the library to move data from about anywhere into most of well known SQL and vector stores, data lakes, storage buckets, or local engines like DuckDB. It automates many cumbersome data engineering tasks and can by handled by anyone who knows Python.
Here’s our Github: https://github.com/dlt-hub/dlt
Here’s our Colab demo: https://colab.research.google.com/drive/1DhaKW0tiSTHDCVmPjM-...
— — —
In the past we wrote hundreds of Python scripts to fit messy data sources into something that you can work with in Python - a database, Pandas frame or just a Python list. We were solving the same problems and making the similar mistakes again and again.
This is why we built an easy to use Python library called dlt that will automate most data engineering tasks. It hides the complexities of data loading and automatically generates a structured and clean datasets for immediate querying and sharing.
— — —
At its core, dlt removes the need to create the dataset schemas, react to changing data, generate append or merge statements, and to move the data in transactional and idempotent manner. Those things are automated and can be declared right in the Python code, just by decorating functions.
Add @dlt.resource decorator, give it a few hints, and convert any data into a simple pipeline that creates and updates datasets.
dlt gets the details out of your way:
1. You do not need to worry about the structure of a database or parquet files
dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.
2. You do not need to write any INSERT/UPDATE or data copy statements
dlt will push the data to DuckDB, Weaviate, storage buckets and many popular SQL stores. It will align the data types, file formats, and identifier names automatically
3. You do not need to worry when you need to add new data or update the changes.
dlt lets you declare how to load the data, how to increment it and will keep the loading state together so they are always in sync.
4. You keep how you develop and test your code
Iterate and test quickly on your laptop or in a dev container. Run locally on DuckDB and just swap destination name to go to the cloud - your code, schema and data will stay the same.
5. You can work with data on your laptop.
Combine dlt with other tools and libraries to process data locally. duckdb, Pandas, Arrow tables and Rust based loading libraries like ConnectorX work nicely with dlt and process data blazingly fast, compared to the cloud.
6. You do not need to worry if your pipeline will work when you deploy it.
dlt is a minimalistic Python library, requires no backend and works whenever Python works. You can finetune it to work on constrained environments like AWS Lambda or run with Airflow, GitHub Actions or Dagster.
dlt has an Apache 2.0 license. We plan to make money by offering organizations a paid control plane, where dlt users can track and policy what every pipeline does, manage schemas and contracts across organization, create data catalogues, and share them with the team members and customers.
However the following seems like kind of an anti-feature, which I at least would want the option to disable:
> You do not need to worry about the structure of a database or parquet files
> dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.
This is the opposite of what I want in 99% of projects. Most of the time, there is some kind of well-defined schema, even if it changes a little bit over time. If that schema is going to be depended upon by something like a data warehouse ELT pipeline, I want precise control over it. I do not want to hand that off to an opaque library.
Moreover, the work of actually writing out the schema is like 1% of the overall effort in consuming a new data source, and usually it turns out to be a constructive, useful exercise in pinning down assumptions, finding gaps in understanding, etc. So I see a little benefit in hiding it.
A schema essentially forms a business-critical contract between two major sections of the overall data pipeline, and that is absolutely not something I want to be changing dynamically without my explicit understanding and consent.
This reminds me of the temptation I have seen in some developers (several of them ostensibly "senior") to use MongoDB for a straightforward CRUD-like application. The argument that it's schema-less to me is a striking anti-feature, something I explicitly do not want!
The only time I really want this is in the rare and atypical case where I truly have no schema at all, or the schema is changing erratically and frequently in ways that I cannot reasonably anticipate and/or cannot dedicate developer resources to accommodating. That's a niche case that most people flatly do not have. Of course it's nice when a tool supports the niche use case that is very hard to deal with by conventional means (see also: OpenRefine), but it should absolutely not be the default and our tools should not encourage us to lie to ourselves that it's something we want or need.
If you just want to reduce manual grunt work effort, consider something like generating a schema from an OpenAPI specification / JSONSchema.