Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…

Here’s the Apache Iceberg table format specification:

https://iceberg.apache.org/spec/

As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.

This is nominally the Delta Lake equivalent:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md

I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.

Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).

My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

Editing to append this GitHub history, which is unfortunately not reassuring:

https://github.com/delta-io/delta/commits/master/PROTOCOL.md

Random features and tweaks just popping up, PR’d by Databricks engineers and promptly approved by Databricks senior engineers…




I agree with all of this. Databricks are also holding back features from open source Delta (like bloom filters), which is their right. But then you can't claim it is a community-driven open format, unless it is an animal farm version of that, where one of the versions is the Pig (some are more equal than others).


Databricks has a lot of nice closed-sourced components, e.g., Unity Catalog, Delta Live Tables and Photon (a C++ implementation of Spark).

Delta itself seems fairly open-source: https://github.com/orgs/delta-io/projects/10/views/1 and hopefully someone will implement Liquid Clustering!


I've implemented Delta support from scratch for a component of Microsoft Fabric, and my feeling is that the "spec" is fairly inadequate without additional experimentation on the Spark implementation. It also requires you to be able to support Spark SQL expressions if you want to make use of features like computed columns and check constraints, and those are even more-poorly documented.


> I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!

Sentiment echoed.

I’m ultra cautious of stuff offered by databricks in general. I think they’re only nominally open source, and shouldn’t be trusted.

I’ve also used Delta lake before, there were some really frustrating shortcomings and a lot of “sharp edges” in its usage. We ended up dropping that project entirely, but did investigate iceberg at the time as well. Iceberg and hudi had more coherently designed feature sets, but were less supported. Really hoping this changes more in future.


Thanks for this. I've been following this space for about a year or two and was wondering why Iceberg was more popular in open source.

Over the past six months I got the impression that Delta is pulling ahead in the race as Iceberg is struggling to provide tools for people not in the JVM ecosystem. Delta is a lot more accessible in that way.


DuckDB (lightweight, non-JVM, many language bindings) supports querying from Iceberg now.

https://duckdb.org/docs/extensions/iceberg.html

You still need Spark to generate the Iceberg metadata though.


Snowflake is rolling out Iceberg support and not Delta support, I think that says a lot.


Bigquery too.


I guess you are referring to delta-rs (for Python in particular). An interesting factoid here is that Databricks started delta-rs, and other companies are now driving it forward - not Databricks. I guess it is not in Databricks interest to push the non JVM ecosystem. PyIceberg is catching up. Write support is almost there - https://github.com/apache/iceberg-python/pull/41


As I remember, delta-rs was started by Scribd, not by Databricks: https://youtu.be/2jgfpJD5D6U, https://youtu.be/scYz12UK-OY


I stand corrected, then.


The fact that they use JSON for delta changes is... just stupid. For contrast, in SQL Server, it's implemented way better. Columnar storage tables (columnstore indexes, an equivalent of Parquet or ORC inside the engine) are immutable, and deltas are stored in B-Trees for compactness, ease of access and speed). At some point in time the columnstores get defragmented/merged/rebuilt in part or in whole, and the B-Tree is deleted and starts over when new changes accumulate. Doing it in JSON is, let me put it softly, a sign of bad times.

I suppose anything is better than Delta Lake. Especially Iceberg.


Microsoft is using Delta for their Fabric Lakehouse architecture and its also what OneLake is built around so now you have another massive player choosing Delta.


That’s…not exactly a winning point for Delta lake IMO.

Massive corp, with their own opaque interests and endless bodies to throw at problems has picked a favourite. That favourite being an “open” format controlled by another opaque enterprise company. I’d half expect M$ to just take it wholesale, and start modifying it to suit their own ends, until eventually the “open source” component is some skin-deep façade that completely and utterly dependent on M$ infra.


Yes, another massive player who has the resources and independent market pull to ride and steer a complex and ever-shifting “standard”.

Feels a bit like, “If Delta Lake did not exist, Microsoft would have to invent it.”


If you are a Spark shop then choosing Delta over Iceberg is a no-brainer. It's simpler and perfectly integrated. Not to mention that the Spark's Delta connector can now generate Iceberg-compatible metadata too.

The choice between the two resembles the choice between Parquet and ORC circa 2016. Two formats of broadly the same power, initially biased by a particular query engine, eventually at feature parity and universally supported.

We have got a decade of experience with OSS from Databricks so doubting their "open ecosystem" status seems a little theoretical.


I feel like this is frankly uninformed. Many iceberg shops seem to rely heavily on Spark as a primary engine. And databricks has a history of being a hostile oss force with the culture of the spark project being toxic from the start and delta’s questionable commitment to being a community project.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: