Obviously text-oriented grep doesn't work. But table oriented duckdb can work very well indeed (and it basically combines a lot of awk and grep together)
Parquet is column-oriented, grep is line-oriented. You cannot directly map one onto another, much like you cannot directly bitblt a JPEG image: you have to properly unpack it first.
Normally `parquet-tools cat` turns it into greppable lines. But chances are high you have pandas, DuckDB, or even Spark installed nearby, if you actually need things like Parquet.
I feel like there’s some unsaid intrinsic value to having text-based files versus a compiled format like parquet. Size and speed aren’t nearly as critical when size is small, and this seems like adding unneeded complexity for all but some specialized use cases.
"Unsaid"? I have the opposite perspective: This is repeated ad nauseam. Every time a binary format or any other complex optimization is mentioned, the performance-doesn't-matter people have to come out of the woodwork and bring up the fact that it "literally doesn't matter" to "$INVENTED_HIGH_PERCENTAGE of use cases".
I wonder, do people on the Caterpillar forum have this problem, where people just show up to ask, "Yeah but can you take it on the freeway? Because I can take my Tacoma on the freeway."
I’ve come here to ask why CSV is being used? CSV is easy for humans to read, and works in almost any tool. It’s Excel friendly (yes Excel has some terrible CSV/data mangling practices).
End of the day, if the format needs to be used by arbitrary third parties who maybe/probably have no technical experience beyond excel, then CSV is the best option.
If human readable and Excel support are not required, then by all means Parquet is the winner.
TLDR: The best format depends on how you need to use the data.
If you are under the impression that CSV’s-and-friends are “less complex” because they’re text, I would like to assure you that any “simplicity” in the existing text formats is a heinous lie, and the complexity gets pushed into the code of the consuming layer.
I use parquet for small stuff now, because it’s standardised and so much nicer and more reliable to use. It’s faster. It ships a schema. There’s actual data type support. There’s even partitioning support if you need it. It’s quite literally better in every useful way.
Long winded way of saying: the complexity has always been there. CSV’s act as if it’s not there, parquet acknowledges it and gives you the tools to deal with it.
> CSV is also supported by decades of software, much of which either predates parquet or does not yet support parquet.
CSV can not be supported by any software because it's not really a format. If you don't understand what I am saying by this, please write down an algorithm for figuring out which type of quoting is used in a given CSV file. Also please explain to me how you would deal with new lines in the text data.
The angle by which CSV is simple is in the flexibility of input...that is, it's a very decent authoring format, when sent into a system that knows how to import and validate "that kind of CSV" and treat it the way all user interfaces should treat input - with an eye towards telling the user where they screwed up.
Once you've got the data into a processing pipeline you do need additional semantic structure to not lose integrity, which I think Parquet is fine for, but any SQL schema will also do.
How could anybody advocate for "complexity regardless of need"? What would that even look like?
Maybe you were being sarcastic, I'm not sure. But I hope it's obvious that advocating for complexity at all is not the same thing as advocating for complexity regardless of need.
There are also significant disadvantages to text formats, most obviously in the area of compatibility. They’re easy enough that people have a dreadful habit of ignoring specs (if they even exist) and introducing incompatibilities, which are not uncommonly even security vulnerabilities (e.g. HTTP request smuggling).
CSV is a very good demonstration of these problems.
Binary formats tend to be quite a bit more robust, with a much higher average quality of implementation, because they’re harder to work with.
TFA never claimed that parquet is superior to csv in all cases. If you have a small file or you just want to be able to read the file, then by all means use a csv. Is your argument that we shouldn't bother developing or talking about specialized file formats because they're overkill in the most common case?
Sure, it's just that a lot of the parquet using people aren't using command line tools for this because it's just too slow and not practical to do that at scale. Scaling down what these tools do is of course possible but not really a priority. Tools probably exist that do this already though. And if they don't you could probably script something together with things like python pretty quickly. It's not rocket science. You could probably script together a grep like tool in a few minutes. Not a big deal. And the odds are you are a quick google search away from finding a gazillion such tools.
But the whole point of parquet is doing things at scale in clusters, not doing some one of quick and dirty processing on tiny data on a laptop. You could do it. But grep and csv work well enough for those use cases. The other way around is a lot less practical.
Tbh these days I'd use parquet down to small files, duckdb is a single binary and give me grep-like searching but faster and with loads more options for when grep itself isn't quite enough.
The utility of grep depends on the amount of data and how it's stored. Grepping very large files of the size where you might consider parquet, that may take many hours. It's almost certainly faster to sit down and build a program that extracts data from a parquet.
On the other hand if grepping is fast, parquet doesn't really make sense in that scenario. You're better off with a structured format like JSON or XML if grep finishes in less than a few minutes.
Being able to do it efficiently on remote data is fantastic. Not only is the entire data dramatically smaller, but I can read just the partition and just the columns I care about.
Sorta? In that I can think of several ways to make that work.
That said, if you are using parquet, I think it is safe to assume large data sets. Large structured data sets, specifically. As such, getting a one liner that can hit them with a script is doable, but you probably won't see it done much due to utility.
Text-based files like CSV can be `grep`-ed en masse, which I do often.
eg, to find some value across a ton of files.
Is that possible with parquet?