> Come on... who does tests like this? Why is anyone supposed to believe these numbers? What is even the point of making such claims w/o any evidence?
I’m not sure I understand the criticism you’re making. I think people do test like this all the time — they have some data in CSV, they benchmark some operations, they do the same with Parquet. I find the results very believable, given what I know about Parquet and CSV.
What if the tester made some silly error, like reading 1K records in Parquet and 10K records in CSV? What if the tester had other confounders, like, eg. Parquet file was compressed, but CSV file wasn't, or their particular setting for Pandas made it very hard for Pandas to read CSV properly (eg. there's a way to specify column types in Pandas prior to reading in CSV, which may significantly speed up things) and so on.
In general, I'd expect I/O for flat files to dominate any processing of the file format. So, if CSV file is 10X the Parquet file, I'd expect the performance of the reader to be 10X slower for CSV. Well, that is unless some complicated seeking is necessary, or some complicated work with memory is necessary etc.
Finally, the kind of data is also quite important. Consider that encoding 1-2 digit unsigned integers is quite efficient using ASCII text, while encoding very big integers is going to be a lot less efficient. Encoding string data is going to be almost as efficient no matter if (simple) binary encoding is used, or text encoding etc. The information contents (entropy) of the information being processed is also very important. Imagine reading length-encoded column of billion of nulls vs same size column of various integers and so on.
I agree with your criticism and would have been more critical as well of their claim in previous years. Nowadays, everything is relatively fast for user's needs so unless we're picking hairs...I think we can let a potential silly error slide
If everything is fast, why do you even bother measuring?
The answer is no: not everything is fast. Measuring is as important as ever. Measuring translates into money users pay for hardware, rented or purchased.
I’m not sure I understand the criticism you’re making. I think people do test like this all the time — they have some data in CSV, they benchmark some operations, they do the same with Parquet. I find the results very believable, given what I know about Parquet and CSV.