Hacker News new | past | comments | ask | show | jobs | submit login
Nimble: A new columnar file format by Meta [video] (youtube.com)
121 points by aduffy 9 months ago | hide | past | favorite | 30 comments



I learned that "Nimble" is the new name for "Alpha", discussed in this 2023 report: https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf

Here's an excerpt that may save some folks a click or three…

> "While storing analytical and ML tables together in the data lakehouse is beneficial from a management and integration perspective, it also imposes some unique challenges. For example, it is increasingly common for ML tables to outgrow analytical tables by up to an order of magnitude. ML tables are also typically much wider, and tend to have tens of thousands of features usually stored as large maps.

> "As we executed on our codec convergence strategy for ORC, it gradually exposed significant weaknesses in the ORC format itself, especially for ML use cases. The most pressing issue with the DWRF format was metadata overhead; our ML use cases needed a very large number of features (typically stored as giant maps), and the DWRF map format, albeit optimized, had too much metadata overhead. Apart from this, DWRF had several other limitations related to encodings and stripe structure, which were very difficult to fix in a backward-compatible way. Therefore, we decided to build a new columnar file format that addresses the needs of the next generation data stack; specifically, one that is targeted from the onset towards ML use cases, but without sacrificing any of the analytical needs.

> "The result was a new format we call Alpha. Alpha has several notable characteristics that make it particularly suitable for mixed Analytical nd ML training use cases. It has a custom serialization format for metadata that is significantly faster to decode, especially for very wide tables and deep maps, in addition to more modern compression algorithms. It also provides a richer set of encodings and an adaptive encoding algorithm that can smartly pick the best encoding based on historical data patterns, through an encoding history loopback database. Alpha requires fewer streams per column for many common data types, making read coalescing much easier and saving I/Os, especially for HDDs. Alpha was written in modern C++ from scratch in a way that allows it to be extended easily in the future.

> "Alpha is being deployed in production today for several important ML training applications and showing 2-3x better performance than ORC on decoding, with comparable encoding performance and file size."


Alpha has got to be one of the worst names I have ever heard for a new product. Did they want to make it impossible to find?


How could a company called Meta be so shortsighted?


Well played.


You Bet!


Alpha was also the name of the virtual assistant owned by the bad guy in Extrapolations.

https://www.imdb.com/title/tt13821126/


Yes, but before that he was helping Zordon and the Power Rangers.

https://www.imdb.com/title/tt0106064/


Haha, good times!


So Nimble/Alpha (which are both seriously terrible names, btw) is basically Parquet++, is that right?

> Apache Parquet is a column-oriented data storage format. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

https://parquet.apache.org/

https://en.wikipedia.org/wiki/Apache_Parquet


It’s really ORC++ as they mention in the video.

I have a thread with Andy Pavlo on the salient bits but basically it grew out of ORC modifications

https://x.com/andreweduffy/status/1778054712517857458?s=46


Parquet + Arrow hopefully seem to be emerging as standards. I would much rather see those standards improved than new formats emerge. Even within those existing formats there has become enough variation than some platforms only support a subset of functionality. That and the performance and size of the libraries is poor.

e.g. DuckDB / Clickhouse Parquet nanosecond compatibility. https://github.com/duckdb/duckdb/issues/9852 e.g. The arrow SQL driver is 70+MB in java.


The Meta's own ORC is quite popular, too, in addition to Parquet, Arrow, Iceberg, Delta, Velox, Lance or Avro. So I assume the new one will find its way into lake houses/data warehouses as well. Because we need bigger mess and bloat.


There's already been some interesting column format optimization work at Meta, as their Velox execution engine team worked with Apache Arrow to align their columnar formats. This talk is actually happening at VeloxCon, so there's got to be some awareness! https://engineering.fb.com/2024/02/20/developer-tools/velox-... https://news.ycombinator.com/item?id=39454763

I wonder how much if any overlap there is here, and whether it was intentional or accidentally similar. Ah, "return efficient Velox vectors" is on the list, but still seems likely to be some overlap in encoding strategies etc.

The four main points seem to be: a) encoding metadata as part of stream rather than fixed metadata, b) nls are just another encoding, c) no stripe footer/only stream locations is in footer, d) FlatBuffers! Shout out to FlatBuffers, wasn't expecting to see them making a comeback!

I do wish there were a lot more diagrams/slides. There's four bullet points, and Yoav Helfman talks to them, but there's not a ton of showing what he's talking about.


How do Parquet, Lance, and Nimble compare?

lancedb/lance: https://github.com/lancedb/lance :

> Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more. Compatible with pandas, DuckDB, Polars, and pyarrow

Can the optimizations in lance be ported to other formats without significant redesign?


What's the future for datafusion (https://arrow.apache.org/datafusion/) if Arrow is moving towards Velox?


The StringView/ListView and REE optimizations are already part of Arrow, and should be usable with Datafusion (Datafusion just uses Arrow datatypes everywhere).

https://docs.rs/datafusion/latest/datafusion/common/arrow/ar...

https://docs.rs/datafusion/latest/datafusion/common/arrow/da...


What do you mean by Arrow is moving towards Velox? Arrow is a standard for in-memory columnar data? At most arrow might adopt innovations made in Velox into its spec that datafusion will then adopt?


I meant the Arrow ecosystem. Datafusion is the query processing project therein at present, so I was curious to know what's the future for that. As you said, it's possible Datafition will adopt some stuff from Velox.


I was really hoping to see Cap'N Proto used for the format, since that has fast access without decoding, and reasonable backwards compatibility with old files. Anyone know why Flatbuffers were used?


I would love to see support in Apache Arrow to read this format. Parquet is already supported.


This makes me assume it also doesn't have proper support for multidimensional arrays.


Just curious, how would you decide if an “it” did have proper support for multidimensional arrays?


If I put an nd array in, do I get the same nd array out, or do I have to serialize/deserialize myself, with some custom schema/code containing hacks like stuffing data (coordinates) into the column name?


Is there a quick description of the structure of it anywhere?


How is this compare with parquet format?


By the time data has been preprocessed for ML, it is numerically encoded as floats, so .npy/npz is a good fit and `np.memmap` is an incredible way to seek into ndim data.


Curious about Clickhouse’s approach to this compression structure.


Hmm another conte for in the open table format space. Nice.


*contender


Fwiw, the name clashes with Nim's package manager nimble: https://github.com/nim-lang/nimble




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: