Hacker News new | past | comments | ask | show | jobs | submit login

Can you `grep` it?

Text-based files like CSV can be `grep`-ed en masse, which I do often.

eg, to find some value across a ton of files.

Is that possible with parquet?




Yes. Ish.

Obviously text-oriented grep doesn't work. But table oriented duckdb can work very well indeed (and it basically combines a lot of awk and grep together)

  $ duckdb <<EOQ
  select count(\*) from 'archive.parquet' where sensor = 'wifi'
  EOQ
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │      4719996 │
  └──────────────┘
  $
You can change a lot of aspects around such as the output format, but the point remains that you can do grepy things with Parquet files quite easily.


Parquet is column-oriented, grep is line-oriented. You cannot directly map one onto another, much like you cannot directly bitblt a JPEG image: you have to properly unpack it first.

Normally `parquet-tools cat` turns it into greppable lines. But chances are high you have pandas, DuckDB, or even Spark installed nearby, if you actually need things like Parquet.


Which is a very long way of saying -- "no".


Rather "yes, but".


this sounded like a "yes" to me with only the slightest qualification

> `parquet-tools cat` turns it into greppable lines

After all, you can't grep gzipped CSV either without some ancilliary tool than grep.


I feel like there’s some unsaid intrinsic value to having text-based files versus a compiled format like parquet. Size and speed aren’t nearly as critical when size is small, and this seems like adding unneeded complexity for all but some specialized use cases.


"Unsaid"? I have the opposite perspective: This is repeated ad nauseam. Every time a binary format or any other complex optimization is mentioned, the performance-doesn't-matter people have to come out of the woodwork and bring up the fact that it "literally doesn't matter" to "$INVENTED_HIGH_PERCENTAGE of use cases".

I wonder, do people on the Caterpillar forum have this problem, where people just show up to ask, "Yeah but can you take it on the freeway? Because I can take my Tacoma on the freeway."


I’ve come here to ask why CSV is being used? CSV is easy for humans to read, and works in almost any tool. It’s Excel friendly (yes Excel has some terrible CSV/data mangling practices).

End of the day, if the format needs to be used by arbitrary third parties who maybe/probably have no technical experience beyond excel, then CSV is the best option.

If human readable and Excel support are not required, then by all means Parquet is the winner.

TLDR: The best format depends on how you need to use the data.


I’m confused. Are you advocating for complexity regardless of need?


If you are under the impression that CSV’s-and-friends are “less complex” because they’re text, I would like to assure you that any “simplicity” in the existing text formats is a heinous lie, and the complexity gets pushed into the code of the consuming layer.

I use parquet for small stuff now, because it’s standardised and so much nicer and more reliable to use. It’s faster. It ships a schema. There’s actual data type support. There’s even partitioning support if you need it. It’s quite literally better in every useful way.

Long winded way of saying: the complexity has always been there. CSV’s act as if it’s not there, parquet acknowledges it and gives you the tools to deal with it.


Parquet is not better than csv in every useful way.

CSV, being row-oriented, makes it much easier to append data than parquet, which is column-oriented.

CSV is also supported by decades of software, much of which either predates parquet or does not yet support parquet.

CSV, being a newline-delimited text-based format, is much better suited for inclusion in a git repository than parquet.

I use parquet wherever I can, but I still reach for csv whenever it makes sense.


> CSV is also supported by decades of software, much of which either predates parquet or does not yet support parquet.

CSV can not be supported by any software because it's not really a format. If you don't understand what I am saying by this, please write down an algorithm for figuring out which type of quoting is used in a given CSV file. Also please explain to me how you would deal with new lines in the text data.


The angle by which CSV is simple is in the flexibility of input...that is, it's a very decent authoring format, when sent into a system that knows how to import and validate "that kind of CSV" and treat it the way all user interfaces should treat input - with an eye towards telling the user where they screwed up.

Once you've got the data into a processing pipeline you do need additional semantic structure to not lose integrity, which I think Parquet is fine for, but any SQL schema will also do.


How could anybody advocate for "complexity regardless of need"? What would that even look like?

Maybe you were being sarcastic, I'm not sure. But I hope it's obvious that advocating for complexity at all is not the same thing as advocating for complexity regardless of need.


There are also significant disadvantages to text formats, most obviously in the area of compatibility. They’re easy enough that people have a dreadful habit of ignoring specs (if they even exist) and introducing incompatibilities, which are not uncommonly even security vulnerabilities (e.g. HTTP request smuggling).

CSV is a very good demonstration of these problems.

Binary formats tend to be quite a bit more robust, with a much higher average quality of implementation, because they’re harder to work with.


> because they’re harder to work with

Or rather because parsing human readable text comes with many edge cases, than encoding stuff into binary.


It’s easy enough to write an exact spec. People are just much less inclined to, and even if they do people are much less inclined to follow it.


TFA never claimed that parquet is superior to csv in all cases. If you have a small file or you just want to be able to read the file, then by all means use a csv. Is your argument that we shouldn't bother developing or talking about specialized file formats because they're overkill in the most common case?


I learned about parquet2json from HN recently, and would use it for this:

```bash

parquet2json cat <file or url> | grep ...

```

https://github.com/jupiter/parquet2json


Sure, it's just that a lot of the parquet using people aren't using command line tools for this because it's just too slow and not practical to do that at scale. Scaling down what these tools do is of course possible but not really a priority. Tools probably exist that do this already though. And if they don't you could probably script something together with things like python pretty quickly. It's not rocket science. You could probably script together a grep like tool in a few minutes. Not a big deal. And the odds are you are a quick google search away from finding a gazillion such tools.

But the whole point of parquet is doing things at scale in clusters, not doing some one of quick and dirty processing on tiny data on a laptop. You could do it. But grep and csv work well enough for those use cases. The other way around is a lot less practical.


Tbh these days I'd use parquet down to small files, duckdb is a single binary and give me grep-like searching but faster and with loads more options for when grep itself isn't quite enough.


Also the 5-10x compression turns «big data» into «laptop size data» very nicely!


The utility of grep depends on the amount of data and how it's stored. Grepping very large files of the size where you might consider parquet, that may take many hours. It's almost certainly faster to sit down and build a program that extracts data from a parquet.

On the other hand if grepping is fast, parquet doesn't really make sense in that scenario. You're better off with a structured format like JSON or XML if grep finishes in less than a few minutes.


Slightly different use case, because of structured vs unstructured query, but you can do that fast with duckdb: https://duckdb.org/2021/06/25/querying-parquet.html


not exactly a one liner, but you can do something like parquet-tools cat <your file.parquet> | grep etc


The article explicitly mentions that parque is not streamable because its index is at the end of the file and are split into multiple.


Who cares about Grep when you can quite literally execute SQL across one/many files that might even be remote.


Being able to do it efficiently on remote data is fantastic. Not only is the entire data dramatically smaller, but I can read just the partition and just the columns I care about.


Sorta? In that I can think of several ways to make that work.

That said, if you are using parquet, I think it is safe to assume large data sets. Large structured data sets, specifically. As such, getting a one liner that can hit them with a script is doable, but you probably won't see it done much due to utility.


That's not useful at all when your file contains hundreds of columns.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: