I admit I'm not crazy deep in this space but I'm _pretty_ into this space and I'...

wenc · 2024-05-30T07:30:15 1717054215

Parquet the most popular columnar format. (owing to support in Spark and various other big data tools, as well as local tools like pandas, polars and duckdb)

It's technically not the very best format (ORC has some advantages), but it's so ubiquitous and good enough -- still far better than than CSV or the next best competing format. I have not heard of Carbon -- it sounds like an interesting niche format, hopefully it's gaining ground.

It's the VHS, not the betamax.

stdbrouw · 2024-05-30T08:21:52 1717057312

"Good enough" makes it sound like barely a step up from a CSV file. I'd say its support for various encodings [1] including a great default (dictionary + run length encoding on the indices) and compression algorithms that can be set for each individual column, columnar access, partitioning, a parallelized reader out of the box, in-memory filtering and other ops concurrently with loading in the data (thanks to Arrow) etc. etc. are all really wonderful when working with medium-sized data.

[1] https://parquet.apache.org/docs/file-format/data-pages/encod...

wenc · 2024-05-30T17:57:53 1717091873

Agreed. On a scale of 10 in terms of current technology, CSV is a 1 while Parquet is 7. ORC is maybe 7.2. But parquet is far more ubiquitous than ORC (I’ve never seen ORC in prod but I also have limited sample sizes)

I’m sure there are more advanced formats.