Here's an excerpt that may save some folks a click or three…
> "While storing analytical and ML tables together in the data lakehouse is beneficial from a management and integration perspective, it also imposes some unique challenges. For example, it is increasingly common for ML tables to outgrow analytical tables by up to an order of magnitude. ML tables are also typically much wider, and tend to have tens of thousands of features usually stored as large maps.
> "As we executed on our codec convergence strategy for ORC, it gradually exposed significant weaknesses in the ORC format itself, especially for ML use cases. The most pressing issue with the DWRF format was metadata overhead; our ML use cases needed a very large number of features (typically stored as giant maps), and the DWRF map format, albeit optimized, had too much metadata overhead. Apart from this, DWRF had several other limitations related to encodings and stripe structure, which were very difficult to fix in a backward-compatible way. Therefore, we decided to build a new columnar file format that addresses the needs of the next generation data stack; specifically, one that is targeted from the onset towards ML use cases, but without sacrificing any of the
analytical needs.
> "The result was a new format we call Alpha. Alpha has several notable characteristics that make it particularly suitable for mixed Analytical nd ML training use cases. It has a custom serialization format for metadata that is significantly faster to decode, especially for very wide tables and deep maps, in addition to more modern compression algorithms. It also provides a richer set of encodings and an adaptive encoding algorithm that can smartly pick the best encoding based on historical data patterns, through an encoding history loopback database. Alpha requires fewer streams per column for many common data types, making read coalescing much easier and saving I/Os, especially for HDDs. Alpha was written in modern C++ from scratch in a way that allows it to be extended easily in the future.
> "Alpha is being deployed in production today for several important ML training applications and showing 2-3x better performance than ORC on decoding, with comparable encoding performance and file size."
So Nimble/Alpha (which are both seriously terrible names, btw) is basically Parquet++, is that right?
> Apache Parquet is a column-oriented data storage format. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Parquet + Arrow hopefully seem to be emerging as standards. I would much rather see those standards improved than new formats emerge. Even within those existing formats there has become enough variation than some platforms only support a subset of functionality. That and the performance and size of the libraries is poor.
The Meta's own ORC is quite popular, too, in addition to Parquet, Arrow, Iceberg, Delta, Velox, Lance or Avro. So I assume the new one will find its way into lake houses/data warehouses as well. Because we need bigger mess and bloat.
I wonder how much if any overlap there is here, and whether it was intentional or accidentally similar. Ah, "return efficient Velox vectors" is on the list, but still seems likely to be some overlap in encoding strategies etc.
The four main points seem to be: a) encoding metadata as part of stream rather than fixed metadata, b) nls are just another encoding, c) no stripe footer/only stream locations is in footer, d) FlatBuffers! Shout out to FlatBuffers, wasn't expecting to see them making a comeback!
I do wish there were a lot more diagrams/slides. There's four bullet points, and Yoav Helfman talks to them, but there's not a ton of showing what he's talking about.
> Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.
Compatible with pandas, DuckDB, Polars, and pyarrow
Can the optimizations in lance be ported to other formats without significant redesign?
The StringView/ListView and REE optimizations are already part of Arrow, and should be usable with Datafusion (Datafusion just uses Arrow datatypes everywhere).
What do you mean by Arrow is moving towards Velox? Arrow is a standard for in-memory columnar data? At most arrow might adopt innovations made in Velox into its spec that datafusion will then adopt?
I meant the Arrow ecosystem. Datafusion is the query processing project therein at present, so I was curious to know what's the future for that. As you said, it's possible Datafition will adopt some stuff from Velox.
I was really hoping to see Cap'N Proto used for the format, since that has fast access without decoding, and reasonable backwards compatibility with old files. Anyone know why Flatbuffers were used?
If I put an nd array in, do I get the same nd array out, or do I have to serialize/deserialize myself, with some custom schema/code containing hacks like stuffing data (coordinates) into the column name?
By the time data has been preprocessed for ML, it is numerically encoded as floats, so .npy/npz is a good fit and `np.memmap` is an incredible way to seek into ndim data.
Here's an excerpt that may save some folks a click or three…
> "While storing analytical and ML tables together in the data lakehouse is beneficial from a management and integration perspective, it also imposes some unique challenges. For example, it is increasingly common for ML tables to outgrow analytical tables by up to an order of magnitude. ML tables are also typically much wider, and tend to have tens of thousands of features usually stored as large maps.
> "As we executed on our codec convergence strategy for ORC, it gradually exposed significant weaknesses in the ORC format itself, especially for ML use cases. The most pressing issue with the DWRF format was metadata overhead; our ML use cases needed a very large number of features (typically stored as giant maps), and the DWRF map format, albeit optimized, had too much metadata overhead. Apart from this, DWRF had several other limitations related to encodings and stripe structure, which were very difficult to fix in a backward-compatible way. Therefore, we decided to build a new columnar file format that addresses the needs of the next generation data stack; specifically, one that is targeted from the onset towards ML use cases, but without sacrificing any of the analytical needs.
> "The result was a new format we call Alpha. Alpha has several notable characteristics that make it particularly suitable for mixed Analytical nd ML training use cases. It has a custom serialization format for metadata that is significantly faster to decode, especially for very wide tables and deep maps, in addition to more modern compression algorithms. It also provides a richer set of encodings and an adaptive encoding algorithm that can smartly pick the best encoding based on historical data patterns, through an encoding history loopback database. Alpha requires fewer streams per column for many common data types, making read coalescing much easier and saving I/Os, especially for HDDs. Alpha was written in modern C++ from scratch in a way that allows it to be extended easily in the future.
> "Alpha is being deployed in production today for several important ML training applications and showing 2-3x better performance than ORC on decoding, with comparable encoding performance and file size."