Hacker News new | past | comments | ask | show | jobs | submit login

This is all very interesting (and makes me check out Parquet) but it's painful to see how the text goes to great lengths to describe how it avoids common problems and what they are, but doesn't lose a single word on... how they actually solved these problems. What is the actual boolean type? Which encoding are you actually working in? There's nothing concrete but an implication that it's also compressing data?



You make a great point and I'm sorry about that. To answer some of your questions and try to add more detail:

Strings are UTF-8. Internally Parquet has the concept of "physical types" (ie how bytes are arranged) and "logical types". 64 bit ints are just physical with no logical type. DateTimes are 64 bit int physical types (millis since unix epoch - aka "Javascript time") with a logical type of "datetime" so you know to read it that way.

Bool columns are bits in a row.

The "data bearing" parts of a file can be compressed and usually are. Most are using snappy, a fastish, moderate compression codec. I believe some are also using zstd. One of the common questions people ask is: "how does it compare against .csv.gz" and the answer is favourably. Gzipped csv is even slower to read and write and usually about 30% bigger on disk.


One important intuition about gzip vs snappy is that gzip isn’t generally parallelizable, while snappy is. If your Hadoop cluster is storing enormous files this shows up very quickly in read performance.


That however only impacts CSVs or other text files, where the compression is the envelope. In parquet, it is the internal chunks that are compressed, so it is as parallelizable as ever.

Also, generic CSV is not parallelizable anyway, because it allows you to have (quoted) newlines inside fields and then there is really now way to split it on newlines because you never know if you're on beginning or row or in the middle.

So parallel reading of CSVs is more of a optimization for specific CSVs that you know don't have newlines inside fields.


Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip


Thus the words “isn’t generally” as opposed to “cannot be”.


The actual representation doesn't much matter. When you read the data into something like a dataframe, the boolean values are translated into something appropriate for the supporting language. The actual encodings are documented, but generally don't matter for anybody interacting with data encoded as Parquet.

Regarding compression, yes, that is very much supported. And if you can sort your data to get lots of repetitions or near repetition in columns, it can work very well. For instance, I had a case with geo-hashes, latitudes, longitudes, times and values (which were commonly zero). When the data are sorted by geohash and then time the ZStandard compression algorithm gave 300:1 compression relative to raw CSV due to the very high repetition (geohash, lat, long are constant for long stretches, time and value barely change).

What the article didn't mention is that tools like DuckDB eat parquet and interface with packages like pandas in an incredible symbiotic fashion. This works so well that my favorite way to consume or produce CSV or Parquet files is via duckdb rather than native readers and writers.


> The actual representation doesn't much matter.

This is a site for hackers. The actual representation is probably the only interesting point for many readers.


You may find easy to write parquet from one tool and read into duckdb, but it isn't always that simple. There are multiple variants of parquet, and they are not always compatible. I have to be careful which features of parquet I use when writing from spark (which uses a Java implementation of parquet), because I have a colleague who uses matlab (which uses parquetcpp).


E.g., in duckdb partitioned parquet doesn’t work with http, only s3 transports.


The only real issue with Parquet, if you think a column based file format is the right fit for you (which it may not be) is that it was written to work on HDFS, and the specifics for block storage on HDFS. A lot of the benefits of Parquet are lost when you use it on other file systems.


A lot of the benefits are nicely present on my macbook. Compact size, fast querying with duckdb (columnar format + indexing), fast read/write for pandas, good type support… would recommend for local tabular data!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: