Apache Iceberg V3 Spec new features for more efficient and flexible data lakes

amluto · 2025-08-11T19:25:08 1754940308

> ALTER TABLE events ADD COLUMN version INT DEFAULT 1;

I’ve always disliked this approach. It conflates two things: the value to put in preexisting rows and the default going forward. I often want to add a column, backfill it, and not have a default.

Fortunately, the Iceberg spec at least got this right under the hood. There’s “initial-default”, which is the value implicitly inserted in rows that predate the addition of the column, and there’s “write-default”, which is the default for new rows.

drivenextfunc · 2025-08-11T19:37:20 1754941040

Many companies seem to be using Apache Iceberg, but the ecosystem feels immature outside of Java. For instance, iceberg-rust doesn't even support HDFS. (Though admittedly, Iceberg's tendency to create many small files makes it a poor fit for HDFS anyway.)

hodgesrm · 2025-08-11T20:02:22 1754942542

Seems like this is going to be a permanent issue, no? Library level storage APIs are complex and often quite leaky. That's based on looking at the innards of MySQL and ClickHouse for a while.

It seems quite possible that there will be maybe three libraries that can write to Iceberg (Java, Python, Rust, maybe Golang), while the rest at best will offer read access only. And those language choices will condition and be conditioned by the languages that developers use to write applications that manage Iceberg data.

ozgrakkurt · 2025-08-11T22:14:06 1754950446

This was the same with arrow/parquet libraries as well. It takes a long time for all implementations to catch up

hodgesrm · 2025-08-11T17:58:57 1754935137

This Google article was nice as a high level overview of Iceberg V3. I wish that the V3 spec (and Iceberg specs in general) were more readable. For now the best approach seems to be read the Javadoc for the Iceberg Java API. [0]

[0] https://javadoc.io/doc/org.apache.iceberg/iceberg-api/latest...

twoodfin · 2025-08-11T19:39:14 1754941154

The Iceberg spec is a model of clarity and simplicity compared to the (constantly in flux via Databricks commits…) Delta protocol spec:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md

eatonphil · 2025-08-11T21:35:43 1754948143

To the contrary, the Delta Lake paper is extremely easy to read and implement the basics of (I did) and Iceberg has nothing so concise and clear.

twoodfin · 2025-08-11T22:38:50 1754951930

If I implement what’s described in the Delta Lake paper, will I be able to query and update arbitrary Delta Lake tables as populated by Databricks in 2025?

(Would be genuinely excited if the answer is yes.)

eatonphil · 2025-08-11T22:40:24 1754952024

Not sure (probably not). But it's definitely much easier to immediately understand IMO.

twoodfin · 2025-08-11T22:58:17 1754953097

OK, but at least from my perspective, the point of OTF’s is to allow ongoing interoperability between query and update engines.

A “standard” getting semi-monthly updates via random Databricks-affiliated GitHub accounts doesn’t really fit that bill.

Look at something like this:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#wr...

Ouch.

sgarland · 2025-08-12T01:17:46 1754961466

I read this [0] (I also recommend reading part 1 for background) a few weeks ago, and found it quite interesting.

The entire concept of data lakes seems odd to me, as a DBRE. If you want performant OLAP, then get an OLAP DB. If you want temporality, have a created_at column and filter. If the problem is that you need to ingest petabytes of data, fix your source: your OLTP schema probably sucks and is causing massive storage amplification.

[0]: https://database-doctor.com/posts/iceberg-is-wrong-2.html

nojito · 2025-08-11T23:20:43 1754954443

It's a mismatch that this is on the official blog, but their implementation of Iceberg is still behind and doesn't have feature parity with the spec.

https://cloud.google.com/bigquery/docs/iceberg-tables#limita...

ahmetaltay · 2025-08-12T16:40:27 1755016827

(Disclaimer: I work on the BigQuery team at Google, but my opinions are my own.)

You're right — our current implementation in BigLake doesn't have full feature parity with the V3 spec yet. We're actively working on it.

The key context is that the V3 spec is brand new, having been finalized only about two months ago. The official Apache Iceberg release that incorporates all these V3 features isn't even out yet. So, you'll find that the entire ecosystem, including major vendors, is in a similar position of implementing the new spec.

The purpose of our blog post was to celebrate this huge milestone for the open-source community and to share a technical deep-dive on why these new capabilities are so important.

ahmetburhan · 2025-08-11T18:56:40 1754938600

Cool to see Iceberg getting these kinds of upgrades. Deletion vectors and default column values sound like real quality-of-life improvements, especially for big, messy datasets. Curious to hear if anyone’s tried V3 in production yet and what the performance looks like.

jamesblonde · 2025-08-11T21:07:59 1754946479

Is it out yet?

talatuyarer · 2025-08-11T17:45:48 1754934348

This new version has some great new features, including deletion vectors for more efficient transactions and default column values to make schema evolution a breeze. The full article has all the details.

jamesblonde · 2025-08-11T21:05:26 1754946326

When will open source v3 come out? It's supposed to be in Apache Iceberg 1.10, right?

talatuyarer · 2025-08-11T21:19:14 1754947154

Yes 1.10 version will be first version for V3 spec. But not all features are implemented on runners such as Spark or Flink.

fabatka · 2025-08-11T21:42:46 1754948566

I thought 1.9.0 already had at least some of the v3 features, like the variant type and column lineages? https://iceberg.apache.org/releases/#190-release

Of course I haven't seen any implementations supporting these yet.

talatuyarer · 2025-08-11T22:29:30 1754951370

Yes, the specification will be finalized with version 1.10. Previous versions also include specification changes. Iceberg's implementation of V3 occurs in three stages: Specification Change, Core Implementation, and Spark/Flink Implementation.

So far only Variant is supported in Spark and with 1.10 Spark will support nano timestamp and unknowntype I believe.

jamesblonde · 2025-08-12T05:18:08 1754975888

Any idea when 1.10 will be released?

talatuyarer · 2025-08-12T16:36:38 1755016598

I believe we are very close to release candidate. We are waiting unknown type support for Apache Spark per latest email

https://lists.apache.org/thread/gd5smyln3v6k4b790t5d1vy4483m...

robertlagrant · 2025-08-11T21:31:03 1754947863

> default column values

The way they implemented this seems really useful for any database.