This is pretty exciting. DuckDB is already proving to be a powerful tool in the industry.
Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.
Yes - agree! I actually wrote a blog about this just two days ago:
May be of interest to people who:
- What to know what DuckDB is and why it's interesting
- What's good about it
- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!
I think most people in data science or data engineering should at least try it to get a sense of what it can do
Really for me, the most important thing is it makes it so much easier to design and test complex ETL because you're not constantly having to run queries against Athena/Spark to check they work - you can do it all locally, in CI, set up tests, etc.
I have the same thoughts. However my impression is also that most orgs would choose eg databricks or something for the permission handling, web ui, ++ so what is the equivalent «full rig» with duckdb and S3 / blob storage?
Yeah I think that's fair, especially from the 'end consumer of the data' point of view, and doing things like row-level permissions.
For the ETL side, where often whole-table access is good enough, I find Spark in particular very cumbersome - there's more than can go wrong vs. DuckDB and it's harder to troubleshoot.
from the blog: "This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data."
I don't think we'll ever see this, honestly.
excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.
Apache Iceberg builds an additional layer on top of Parquet files that let's you do ACID transactions, rollbacks, and schema evolution.
A Parquet file is a static file that has the whole data associated with a table. You can't insert, update, delete, etc. It's just it. It works ok if you have small tables, but it becomes unwieldy if you need to do whole-table replacements each time your data changes.
Apache Iceberg fixes this problem by adding a metadata layer on top of smaller Parquet files (at a 300,000 ft overview).
I knot you’re not OP, but and while this explanation is good, it doesn’t make sense to frame all this as a “problem” for parquet. It’s just a file format, it isn’t intended to have this sort of scope.
The problem is that the "parquet is beautiful" is extended all the time to pointless things - pq doesn't support appending updates so let's merge thousands of files together to simulate a real table - totally good and fine.
Well… when Parquet came out, it was the first necessary evolutionary step required to solve the lack of the metadata problem in CSV extracts.
So, it is CSV++ so to speak, or CSV + metadata + compact data storage in a singular file, but not a database table gone astray to wander the world on its own as a file.
It's basically a workaround for DuckDB's lack of native support. I am very happy with the Pyicerbg library as a user, It was very easy and the native Arrow support is a glimpse into the future. Arrow as an interchange format is quite amazing. Just open up the iceberg table and append Arrow dataframes to it!
Arrow is quite spectacular and it's cool to see the industry moving to standardize on it as a dataframe. For example, Clickhouse python also support arrow-based insertion:
This is a great example of how simplicity often wins in practice. Too many systems overcomplicate storage and retrieval, assuming every use case needs full indexing or ultra-low latency. In reality, for many workloads, treating S3 like a raw table and letting the engine handle the heavy lifting makes a lot of sense. Curious to see how it performs under high concurrency—any benchmarks on that yet?
Haven't tried it. S3 Tables sounds like a great idea. However, I am wary. For it to be useful, a suite of AWS services probably needs to integrate with it. These services are all managed by different teams that don't always work well together out of the box and often compete with redundant products. For example, configuring SageMaker Studio to use an EMR cluster for Spark was a multi-day hassle with a lot of custom (insecure?) configuration. How is this different from other existing table offerings? AWS is a mess.
However, my issue is the need to introduce one more tool. I feel that without a single tool to read and write to Iceberg, I would not want to introduce it to our team.
Spark is cool and all but it requires quite a bit of effort to properly work. And Spark seems to be the only thing right now that can read and write to Iceberg natively with a SQL like interface.
Check out Daft (www.getdaft.io) - we've been working really hard on our Iceberg support. Supports full reads/writes (including partitioned writes) and our SQL support is also coming along quite well!
Also no cluster, no JVM. Just `pip install daft` and go. Runs locally (as fast as DuckDB for a lot of workloads; faster, if you have S3 cloud data access) and also runs distributed if you have a Ray cluster you can point it at
S3 Tables is designed for storing and optimizing tabular data in S3 using Apache Iceberg, offering features like automatic optimization and fast query performance. SimpleDB is a NoSQL database service focused on providing simple indexing and querying capabilities without requiring a schema.
Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.