Hacker News new | past | comments | ask | show | jobs | submit login
Parquet: An efficient, binary file format for table data (csvbase.com)
283 points by calpaterson on April 3, 2023 | hide | past | favorite | 186 comments



There is a point which one needs to be aware of with Parquet: if you are not stepping into it with tools well suited for it (that is: Go, C++, Java or Python codebase/runtime, mainstream platform...) you are going to have a bad time. For all its advantages, Parquet is _very_ complicated (IMO overcomplicated, on multiple fronts), and uses binary dependencies which you absolutely need to carry around (Thrift?..). Up to a certain data size the advantage of CSV is not that it is a better format (it is decidedly, absolutely worse/more ambiguous than Parquet) but the fact that you can write a valid CSV in a couple of lines of anything, and read in a couple of lines of anything. If you ask me, Parquet - for all its advantages - is a product of the Cambrian explosion of the Apache-big-data-projects. Very little in its design warrants (or requires) using things like Thrift, so I won't exclude that something less overcomplicated (and more welcoming, and requiring less dependencies) will appear on the scene at some point.


To be honest any file format that would do 1) strong typing/binary data, 2) string deduplication would capture most of the advantages of parquet, except being a column store format which helps in some cases and hurts in others.


And even using a columnar format but without all the complexity of Parquet would be comparatively easy (when writing using buffers)


At that point just start using an SQLite database as a data format.


SQLite is a row-oriented store, not a column-oriented store, ie, there are fundamental tradeoffs between the two.


But sqlite has no compression by default.


In addition, very few of these kind of serialization formats are useful for mutable data.

One thing that I have been struggling with is to find an in-memory, MUTABLE data format/data structure that can be used between different processes in shared memory. I'm not sure one exists.


Use Redis. It's an in-memory data structure server, first created to have the same data structures usable and accessible across processes and languages.

Or plenty of other in-memory database products (Apache Ignite) or fast key/value (ScyllaDB) or just use RDBMS (Postgres or SQLite).


I have been using SQLite DBs to that end with some amount of success. If I remember well a lot of software of yore used BDB to the same effect.


Apache Arrow?

As far as I understand it, it's more for cross-process in-memory fast access (e.g. numbers crunching), but it's mutable.


As far as I can tell, Apache Arrow is not very mutable. It basically has to reallocate the universe.

I don't need large data between processes. I need short-lived medium data ... large enough and short-lived enough to make serialization painful but mutable enough to make "stores" kind of annoying.


>but the fact that you can write a valid CSV in a couple of lines of anything, and read in a couple of lines of anything.

I seriously hope you are implying the usage of a library.

Hand rolled CSV implentations are generally wrong and ruin your data.


The beauty of CSV is that you don't -need- a library to access the data.

It's just there, right in front of you.

Hand rolled CSV implentations are generally wrong and ruin your data.

Yeah sure - asymptotically. Once you've gotten a handle on your data, you can take care of all the bulletproofing later.

None of this is meant to be cheeky, BTW. The learning curve and general cognitive overhead of Parquet is very considerable.


This is all very interesting (and makes me check out Parquet) but it's painful to see how the text goes to great lengths to describe how it avoids common problems and what they are, but doesn't lose a single word on... how they actually solved these problems. What is the actual boolean type? Which encoding are you actually working in? There's nothing concrete but an implication that it's also compressing data?


You make a great point and I'm sorry about that. To answer some of your questions and try to add more detail:

Strings are UTF-8. Internally Parquet has the concept of "physical types" (ie how bytes are arranged) and "logical types". 64 bit ints are just physical with no logical type. DateTimes are 64 bit int physical types (millis since unix epoch - aka "Javascript time") with a logical type of "datetime" so you know to read it that way.

Bool columns are bits in a row.

The "data bearing" parts of a file can be compressed and usually are. Most are using snappy, a fastish, moderate compression codec. I believe some are also using zstd. One of the common questions people ask is: "how does it compare against .csv.gz" and the answer is favourably. Gzipped csv is even slower to read and write and usually about 30% bigger on disk.


One important intuition about gzip vs snappy is that gzip isn’t generally parallelizable, while snappy is. If your Hadoop cluster is storing enormous files this shows up very quickly in read performance.


That however only impacts CSVs or other text files, where the compression is the envelope. In parquet, it is the internal chunks that are compressed, so it is as parallelizable as ever.

Also, generic CSV is not parallelizable anyway, because it allows you to have (quoted) newlines inside fields and then there is really now way to split it on newlines because you never know if you're on beginning or row or in the middle.

So parallel reading of CSVs is more of a optimization for specific CSVs that you know don't have newlines inside fields.


Decompression of arbitrary gzip files can be parallelized with pragzip: https://github.com/mxmlnkn/pragzip


Thus the words “isn’t generally” as opposed to “cannot be”.


The actual representation doesn't much matter. When you read the data into something like a dataframe, the boolean values are translated into something appropriate for the supporting language. The actual encodings are documented, but generally don't matter for anybody interacting with data encoded as Parquet.

Regarding compression, yes, that is very much supported. And if you can sort your data to get lots of repetitions or near repetition in columns, it can work very well. For instance, I had a case with geo-hashes, latitudes, longitudes, times and values (which were commonly zero). When the data are sorted by geohash and then time the ZStandard compression algorithm gave 300:1 compression relative to raw CSV due to the very high repetition (geohash, lat, long are constant for long stretches, time and value barely change).

What the article didn't mention is that tools like DuckDB eat parquet and interface with packages like pandas in an incredible symbiotic fashion. This works so well that my favorite way to consume or produce CSV or Parquet files is via duckdb rather than native readers and writers.


> The actual representation doesn't much matter.

This is a site for hackers. The actual representation is probably the only interesting point for many readers.


You may find easy to write parquet from one tool and read into duckdb, but it isn't always that simple. There are multiple variants of parquet, and they are not always compatible. I have to be careful which features of parquet I use when writing from spark (which uses a Java implementation of parquet), because I have a colleague who uses matlab (which uses parquetcpp).


E.g., in duckdb partitioned parquet doesn’t work with http, only s3 transports.


The only real issue with Parquet, if you think a column based file format is the right fit for you (which it may not be) is that it was written to work on HDFS, and the specifics for block storage on HDFS. A lot of the benefits of Parquet are lost when you use it on other file systems.


A lot of the benefits are nicely present on my macbook. Compact size, fast querying with duckdb (columnar format + indexing), fast read/write for pandas, good type support… would recommend for local tabular data!


“Parquet has both a date type and the datetime type (both sensibly recorded as integers in UTC).”

What does it mean for a date to be utc? my date, but in the utc timezone? Usually when I write a date to a file, I want a naive date; since that’s the domain of the data. 2020-12-12 sales: $500. But splitting that by some other timezone seems to be introducing a mistake.

Often I want to think in local naive time too, especially for things like appointments that might change depending on dst or whatever. Converting to utc involves some scary things. Timestamps are also useful but I don’t want to transcode my data sometimes as the natural format is the most correct.


Lot of confusion here. UTC is a time standard, not a particular time zone. An instant written down in, say, the Pacific Standard Time Zone can be a UTC-scaled time.

An example of non-UTC time is TAI, which is International Atomic Time. The difference is that UTC has leap seconds to deal with changes in the rate of rotation of the earth, while TAI marches on without any discontinuities.

So for a date to be “in UTC” really just means it uses the leap seconds published by IERS. This article says “integers in UTC” which is a little ambiguous, but probably means “integer UTC seconds since the Unix Epoch.”


> Lot of confusion here. UTC is a time standard, not a particular time zone. An instant written down in, say, the Pacific Standard Time Zone can be a UTC-scaled time.

UTC is very much a timezone, that’s why it has a timezone designator (Z).

If you record a future PST date as UTC and PST changes, your recorded date is now wrong.

> So for a date to be “in UTC” really just means it uses the leap seconds published by IERS. This article says “integers in UTC” which is a little ambiguous, but probably means “integer UTC seconds since the Unix Epoch.”

And that’s guaranteed to fuck up for future local events.


> If you record a future PST date as UTC and PST changes, your recorded date is now wrong.

Damn. That's a very good point. From now on, I'll be always recording also timezone, especially when it comes to future dates!


Any timezone you want, as long as it's UTC. (If above scares you, think about how to record an event that should happen 5th of November 2023 1:30 AM PST.)


> UTC is very much a timezone, that’s why it has a timezone designator (Z).

This is just not correct. The "Z" is a historical maritime designation for the zero-point timezone, GMT, and predates UTC by decades.

UTC is defined by the International Telecommunications Union in recommendation TF.460-6 (pdf link: https://www.itu.int/dms_pubrec/itu-r/rec/tf/R-REC-TF.460-6-2...). It's brief and you won't find any mention of time zones.

The symbolism "UTC+0" or the "Z" timezone come from ISO-8601, which is a specification for how to represent times in strings. You can find that specification here: https://web.archive.org/web/20171019211402/https://www.loc.g...

See section 2.1.12 Note 3:

> UTC is not a time zone, it is a standard. UTC is also not GMT (Greenwich Mean Time), rather, UTC has replaced GMT. UTC is more precise; the term 'GMT' is ambiguous.

That document goes to great lengths to keep this distinction between time scales and time formats. They're different, and conflating them will get you very confused. When you describe a time in PST, you are almost certainly using UTC.


I think you may have missed the idea: usually dates (not datetimes) are the legal fiction "date" NOT "an instant." I.e. timezones are irrelevant.

Birth dates, contract dates, sale dates, billing dates, insurance coverage dates, etc.

---

EDIT: UTC still plays a role -- as there is still the choice of calendar (though you'd be forgiven for assuming Gregorian) -- but it's an odd statement to decipher.


Ah! You're right, I totally missed that the commenter was griping about UTC dates as opposed to datetimes. I agree, a "UTC date" is not a clean concept. We can talk about Julian or Gregorian dates, but those are independent of time scales like UTC or TAI or UT1.


How are time zones irrelevant? The current date, at this very moment, depends on the time zone.

I think of date more like a date time simplified to day precision.

Still, I agree UTC date is unclear.


> How are time zones irrelevant?

Nobody is going to adjust your birthdate when you move abroad.

If you move east, you’ll be able to drink a little earlier than if you move west.

Nobody is adjusting Christmas Eve against the time zone of Bethlehem.

There are many situations where a date is just a number in a calendar and not a specific time on planet earth.


> Nobody is going to adjust your birthdate when you move abroad.

But that’s just a convention, right? The day you are born is still dependent on the time zone. If you are born in the US at 11pm EST then someone born at that same moment in the UK has a different birthday.

Dates have boundaries. These boundaries are dependent on time zone. We can talk about dates irrespective of time zone but day periods cannot be understood without reference to time zone.


You are mixing two things up into one conversation. You are adding time into the conversation, in which case yes you need timezones, but if you don't add time into the conversation and just have dates, then you don't have timezones.


> But that’s just a convention, right?

Yes, "just."

Dates are "just" a convention.


As you say, timezones are relevant if you need to covert an instant to a date, or vice versa.

But you can store, operate, and query on dates without the foggiest clue about instants or timezones.


But if you have a date, and a timezone...how do you do anything useful with those 2 things? If you add 8 hours to a date due to PST, what do you get?


UTC isn’t a time zone, its a specification for how many seconds are in a day. In UTC, there are 84000 seconds on most days, but the IERS may announce a “leap second” which makes some particular day either 84001 seconds or 83999 seconds (historically always the former).


That was already clear


A bit of an aside: by 2035, UTC will be a static offset from TAI, with the addition of leap seconds being phased out


That will be a wonderful change. Unfortunately we will still have to deal with leap seconds retrospectively, but it is certainly a step forward, and it will be nice to have a static list of historical leap seconds rather than all computers everywhere polling IERS every month or two.


Agreed. I think this is where Java’s latest date/time libraries shine (heavily based on Joda time lib).

It makes a solid distinction between Dates, times, instants, and zone dates/times.

And Instant is a moment in time. Exposed as a UTC time stamp.

Birthdays etc are best expressed as LocalDates. There’s also LocalDateTime for zone-agnostic features.

If I were writing software like you describe I’d want an explicitly zone-agnostic date(time) to represent it.


>If I were writing software like you describe I’d want an explicitly zone-agnostic date(time) to represent it.

I would love to see this as a data type (CalendarDate) baked into languages & databases which would prevent you from applying timezone conversions to it.


that would be the goal, but it's hard. I went down a long blog-post road about a "clockchain" which includes a way to reconcile zone-less time. tl;dr all time is relative between a minimum of two objects. all but certain there is an elegant way to do it beyond my current thinking.

https://www.seanmcdonald.xyz/p/the-clockchain-protocol-the-l...

cal.com is solving some of these problems with handling of all zones.


> The DATE logical type annotates an INT32 that stores the number of days from the Unix epoch, January 1, 1970.

So it's a naive date.

Not sure how to support naive time, though, like 23:59:59, (or even leap-second one, 23:59:60). Probably have to store it as integers, and deal with conversion on application side.


Parquet doesn't really do timestamp with time zone. Also relevantly for processing Parquet, Apache Spark doesn't really do timestamp with time zone. Meaning that if your data go through parquet (e.g. in a data lake) you have to store timestamps as strings, or lose the time zone.

It sucks.


We went for the 3. option, store the timezone in a separate column for data where the timezone is needed.


Date is confusing with a timezone (UTC or otherwise) and the doco makes no such suggestion.

The Parquet datatypes documentation is pretty clear that there is a flag isAdjustedToUTC to define if the timestamp should be interpreted as having Instant semantics or Local semantics.

https://github.com/apache/parquet-format/blob/master/Logical...

Still no option to include a TZ offset in the data (so the same datum can be interpreted with both Local and Instant semantics) but not bad really.


For anyone interested in the design approach to Parquet, here's a good interview with the Parquet designers. It covers a lot of the issues they were trying to address with the format. https://www.youtube.com/watch?v=qe0SeC0Hr_k


What is a "naive" or "natural" date? A date in your current timezone? That's often a source of bugs if the tz is not explicit.


It can be a source of bugs if the tz is explicit, as well. Consider storing people's birthdates. If you accidentally store it as UTC and use some implicit tz handling, suddenly you've got off-by-one birthdates.

More succinctly, dates don't have timezones because they don't have times.


An example I like to use is medication reminders. For some, you need them to be at absolute time intervals: UTC is likely appropriate, and omitting time offsets could be disastrous. Others are more appropriate at periods of the experienced waking day, and using UTC could be extraordinarily disruptive eg while traveling.


Another case that pops up a lot is dealing with dates from third party systems that aren't timezone aware. Trying to convert them without knowing what timezone they were supposed to be in in the first place doesn't work, you have to just display them with no conversion.


Oh yeah I hesitate to even go into this because it’s been a huge source of traumatic burnout dealing with the fallout of it being mishandled.


The best real-world example I've heard of comes from a friend who works for a travel company.

You're creating a schedule for a tour. The tour occurs some time in the future. You have a daily itinerary for the tour group.

Now the destination country changes their time zone in some fashion. You don't want to have ever stored an instant, you want the date and time to be more abstract. 9am, as it will be understood at that date and time in the future.

The same also occurs for past events, but we don't tend to encounter that too often, although the calendar for September 1752 in the UK (at the time known as the Kingdom of Great Britain) shows another wrinkle


I swear that some variant of this bug happened to me in gCal. I set myself some yearly recurring events, I think a birthday or a names-day, and clearly marked it as an “All Day” event and I set it to repeat Yearly.

I don’t remember if it was in 1 or 2 or even more years, but Google calendar absolutely did make the event off-by-one day.


Naive date breaks down as soon as your domain expands the globe and interacts with local entities. It becomes important to normalize on some “0” index - utc, and convert to local times (which are often naive) as needed.


Storing (UTC, latitude, longitude, altitude) is the holy grail, I guess. Everything else is too leaky.

I mean, that's literally the four dimensions of space-time.


UTC is non-unique because of leap seconds, so TAI + lat + long + altitude is actually required.

I work on software for astronomy, and that quadruplet is what’s used to describe an observing location. You can actually get a little in trouble because of changes in the shape of the earth over time, so latitude and longitude and altitude need to be treated as time-dependent values, which matters once you are accounting for relativistic effects.


UTC times are unique: when a positive leap second is inserted, 23:59:60 is added; when a negative leap second is inserted, 23:59:59 is skipped.

Unix timestamps, on the other hand, are not unique.


That sort of assumes there's one time zone that's being used per spacetime coordinate, which isn't guaranteed. You can get political situations where de facto and de jure time diverge, or where different authorities nominally in charge of the time in a place disagree.

Lebanon seems to have experienced exactly this recently:

https://www.bbc.com/news/world-middle-east-65079574

It does unambiguously give you a spacetime coordinate (useful) but it doesn't unambiguously tell you what local time you should use for an occurrence, and the answer would really depend on who was asking.


> Storing (UTC, latitude, longitude, altitude) is the holy grail, I guess.

It’s not.

If I set up a meeting next year in NYC at 10 and the legislature decides to change the timezone’s offset, the meeting remains at 10 NY time on that date, it does not shift in NY time. It’s the UTC which shifts.

And UTC alone is sufficient for past events, as they are fixed instants in the time-stream. Unless you’re at a stage where you need to take relativistic effects in account, then you need to add the referential.


One of the main interest of csv is that it's human readable and usable with common text tools (such as grep, cat or any texte editor) so I'm not sure the comparison should be done between parquet and csv, but between parquet and other formats like hdf5 and netcdf. What is the advantage of this new format compared to the older ones ?


STOP.

People insist on making data exchange formats used to transfer data between software 99.99% of the time "easy for humans" that 0.001% of the time, at the expense of the common case. This inevitably results in squishy, inefficient formats like HTML and JSON that become performance quagmires and security land mines: http://seriot.ch/projects/parsing_json.html

CSV is not even easy for humans to process, with or without tools like grep. Can you successfully handle quote escaping and line-wrapping?

    This ,"is ",    valid 
    "Hello",123,"''from the
    not,what,you,
    ""think"",next line"""
That's one record, by the way.

Your level one challenge is: Extract the unescaped value of the "valid" column. Write your robust, tested, "grep" command-line below.

Level two: Make that parse consistently across a dozen tools from a dozen vendors. Try ETL tools from Oracle, Sybase, and SAP. Try Java, PowerShell, and Excel. Then import it via an ODBC text driver, and also into PowerBI. I wish you luck.


This is disingenuous. There isn't really a CSV standard that defines the precise grammar of CSV.

But, suppose there was such a standard, then som of the problems you've described would become non-problems (eg. support across multiple tools). Potentially, if such a standard disallowed linebreaks, it would also make the use of grep easier.

I actually really wish there was such a standard...

Your other assertion about "squishy" formats is also... well, not really true. The format you listed became popular due to the data they encoded, not because of their quality. It's very plausible that a better format may exist, and they are probably out there. The problem is that we set the bar too low with popular formats, and that they often seem easy to implement, which results in a lot of junk formats floating around.


> There isn't really a CSV standard that defines the precise grammar of CSV.

Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

JSON has a "precise" grammar intended to be human readable. The end result is a mess, vulnerable to attacks due to dissimilarities between different implementations.

Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

Why bother, you ask? Why would anyone bother to make floating point number parsing super efficient?

JSON.


> Google put in significant engineering effort into "Ryu", a parsing library for double-precision floating point numbers: https://github.com/ulfjack/ryu

It's not a parsing library, but a printing one, i.e., double -> string. https://github.com/fastfloat/fast_float is a parsing library, i.e., string -> double, not by Google though, but was indeed motivated by parsing JSON fast https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...


> Did you read the link going to a page literally titled: "Parsing JSON is a Minefield."?

No, I didn't. But... why is this relevant to CSV?

I think I might have read an article with a similar name before. And, in general, am not a big fan of JSON, but I didn't think this link contained a standard for CSV...

Also, JSON has a defective grammar, with bad semantics, so, yeah, having a standard didn't save it... but it could have saved it. Not having a standard would've been even worse.


JSON is well defined but hard to parse. CSV is not well defined but you still have to parse and print in such a way that is compatible with every other CSV implementation otherwise what is the point? Use xlsx Spreadsheets instead, those actually work even if they are a bit bloated.


>There isn't really a CSV standard that defines the precise grammar of CSV.

If you don't implement the pseudostandard that pretty much every library uses then whatever format you have can be called CSV but you can bet that it will be compatible with nothing else as you have implemented a bespoke file format that has no guarantee of compatibility. If it works at all it is an accident and it probably shouldn't.


One problem that needs to be solved and standardized is escaping.

It's extremely common for text formats to store text. Some characters always need to be escaped, but the set of characters is always unique to the host format.


I downvoted you, not because I disagree with your point (although I do disagree with it), but because of your peremptory and dismissive usage of "STOP" and "level one challenge."

This issue is not so straightforward that you should be rudely stating that people with other views should just stop and go back to the drawing board. Not that I think one should ever do that, even if the issue were simple.


Excel is ubiquitous and makes csv easy for humans to read.

Machines do the 99.99% of the time that things are working. The .01% when humans need to get in the loop is when something goes wrong and life is better when you can read the data to figure out what went wrong. JSON is amazing for this. CSV not so much but it's still better than a binary format.

Machine efficiency is not the most valuable metric. Machines are built for human use. The easier a system is for humans to use the better. Even if it's orders of magnitude less efficient.


Excel does things to CSV data that shouldn’t be done.

Encodings are troublesome. (See ftfy, a miracle library for fixing them) Everything might be a date. Even genes. Leading zeros get dropped. Even when significant. (Like routing numbers) Long numeric strings get turned into scientific notation. (Like account numbers)


Excel doesn't make CSV easy to read.

Type information has to be communicated out of band with CSV. If you want to view the data in Excel just use xlsx files. Compared to CSV it is a joy to work with.


Accounting wants to work with a portion of the Parquet database. Your options are: give them data in CSV/Excel formats, or hire a dev to work for accounting full time and still have them constantly complaining about having to wait on the dev to get them the data they need.

The programmers, data scientists, and other technical fields should never process data in Excel, but you can’t expect that of the whole organization, so exporting to it is important. At this point, I’d recommend XSLX as the export format because it’s harder to mess up the encoding, but libraries exist for CSV and others as well.


After being constantly frustrated by the way Excel was mangling data I'd output as CSV, but being unwilling to implement real XLSX, I discovered that I can actually output a UTF-8 encoded web page which contains nothing but a charset declaration, a title, and a HTML table, and lie that it's application/vnd.openxmlformats-officedocument.spreadsheetml.sheet and Excel will open it without problems. So will Libre Office.


You can give them anything from SQL interface to BI tool like Superset to no-code tool like Thoughtspot. They might still export some to CSV, but having GUI is much better than having CSV.


I think the issue is more of recovering misformatted or incomplete data when something goes wrong, or otherwise understanding what's in the file if reading it in one format doesn't work as intended (e.g if the file format goes out of use and maintainers are unfamiliar with it). CSV has issues for sure, and yet you have posted content that's readable at some level as text. It's not a quasirandom selection of unicode characters that looks like some private key.

I think these types of examples can be misleading because they take an example of pathological CSV and don't present the comparison, which is typical binary. Sure your example is a problem but what if it were in a nested binary format? Would that be easier for humans to read if the file were mangled? (It's also often more difficult to resolve formatting errors from just one record too, so that's also misleading.)

I disagree with the idea that humans might want to look in the file using text or other software "0.001%" of the time. Sure in some settings it's rare but in others it's pretty common.

CSV is a bit of a misnomer anyway because the field separators can really be anything so you should be using a separator that is extremely unlikely to appear in a field.

Anyway, I'm not actually trying to defend CSV as the ideal as I think it is way, way overused, in part because of the issues you raise. But I also think making a format that's easy for humans to read in a failsafe file format (text usually) is an important criterion.

I guess in general I haven't been convinced that any of the data format proposals I've seen get some level of publicity are especially better than any other. For some use cases, probably, but not for others. Sure, that's how it works, but I wish there were better options getting traction or attention. It seems like when I do see something promising it often doesn't get off the ground, and there isn't a robust competition of ideas in the space. Often the reading-writing software drives data format use — that's not necessarily my preference but it seems to often be the case.


> recovering misformatted or incomplete data when something goes wrong

For CSV, what goes wrong is usually people editing the file as text or tools consisting of "a couple of lines of anything" that don't handle escaping and delimiters correctly.


HTML and JSON are not good examples of text formats designed with security or performance in mind.


That honor belongs to XML


Sarcasm, I hope


I would say it is less about being human readable directly (sorry, but CSV is NOT human readable), more about the availability of the tools to read it. Anyone can load a CSV file into a spreadsheet, and that's a big advantage! Visualizing parquet files is painful. Developers can install VisiData, that's cool, but it is much less convenient to share with less tech-savvy people.


https://www.visidata.org/

Is a great tool to bridge that gap. Store your data in efficient formats but still revert to csv and have human readable when needed.


Parquet is not really a new format - it has been around for years and years in the Hadoop world. It is mostly good for storing large data in a way that makes it effective to process it for OLAP workload, and that is it.


And if you are interested on encoding Geospatial Data there is a format called Geoparquet that is on the way of standarization: https://geoparquet.org/ Essentially adding metadata on the extensible schema metadata streucture that parquet supports. Think of WKB in a column. With more exciting stuff coming that way.


Main issue with geoparquet at the moment is theres no spatial indexing. When that comes it becomes a lot more viable to replace both other GIS file types like geopackage and also for some cases even database storage like postgis.

Very much looking forward to it though, hoping it covers pointcloud and TIN data well eventually also


Oh very interesting didn’t expect so much scepticism on parquet. We use it to store around 100 tb of data in a partitioned way on GCS and query it using BigQuery. Absolutely fantastic solution and incredibly cheap too.


Can you `grep` it?

Text-based files like CSV can be `grep`-ed en masse, which I do often.

eg, to find some value across a ton of files.

Is that possible with parquet?


Yes. Ish.

Obviously text-oriented grep doesn't work. But table oriented duckdb can work very well indeed (and it basically combines a lot of awk and grep together)

  $ duckdb <<EOQ
  select count(\*) from 'archive.parquet' where sensor = 'wifi'
  EOQ
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │      4719996 │
  └──────────────┘
  $
You can change a lot of aspects around such as the output format, but the point remains that you can do grepy things with Parquet files quite easily.


Parquet is column-oriented, grep is line-oriented. You cannot directly map one onto another, much like you cannot directly bitblt a JPEG image: you have to properly unpack it first.

Normally `parquet-tools cat` turns it into greppable lines. But chances are high you have pandas, DuckDB, or even Spark installed nearby, if you actually need things like Parquet.


Which is a very long way of saying -- "no".


Rather "yes, but".


this sounded like a "yes" to me with only the slightest qualification

> `parquet-tools cat` turns it into greppable lines

After all, you can't grep gzipped CSV either without some ancilliary tool than grep.


I feel like there’s some unsaid intrinsic value to having text-based files versus a compiled format like parquet. Size and speed aren’t nearly as critical when size is small, and this seems like adding unneeded complexity for all but some specialized use cases.


"Unsaid"? I have the opposite perspective: This is repeated ad nauseam. Every time a binary format or any other complex optimization is mentioned, the performance-doesn't-matter people have to come out of the woodwork and bring up the fact that it "literally doesn't matter" to "$INVENTED_HIGH_PERCENTAGE of use cases".

I wonder, do people on the Caterpillar forum have this problem, where people just show up to ask, "Yeah but can you take it on the freeway? Because I can take my Tacoma on the freeway."


I’ve come here to ask why CSV is being used? CSV is easy for humans to read, and works in almost any tool. It’s Excel friendly (yes Excel has some terrible CSV/data mangling practices).

End of the day, if the format needs to be used by arbitrary third parties who maybe/probably have no technical experience beyond excel, then CSV is the best option.

If human readable and Excel support are not required, then by all means Parquet is the winner.

TLDR: The best format depends on how you need to use the data.


I’m confused. Are you advocating for complexity regardless of need?


If you are under the impression that CSV’s-and-friends are “less complex” because they’re text, I would like to assure you that any “simplicity” in the existing text formats is a heinous lie, and the complexity gets pushed into the code of the consuming layer.

I use parquet for small stuff now, because it’s standardised and so much nicer and more reliable to use. It’s faster. It ships a schema. There’s actual data type support. There’s even partitioning support if you need it. It’s quite literally better in every useful way.

Long winded way of saying: the complexity has always been there. CSV’s act as if it’s not there, parquet acknowledges it and gives you the tools to deal with it.


Parquet is not better than csv in every useful way.

CSV, being row-oriented, makes it much easier to append data than parquet, which is column-oriented.

CSV is also supported by decades of software, much of which either predates parquet or does not yet support parquet.

CSV, being a newline-delimited text-based format, is much better suited for inclusion in a git repository than parquet.

I use parquet wherever I can, but I still reach for csv whenever it makes sense.


> CSV is also supported by decades of software, much of which either predates parquet or does not yet support parquet.

CSV can not be supported by any software because it's not really a format. If you don't understand what I am saying by this, please write down an algorithm for figuring out which type of quoting is used in a given CSV file. Also please explain to me how you would deal with new lines in the text data.


The angle by which CSV is simple is in the flexibility of input...that is, it's a very decent authoring format, when sent into a system that knows how to import and validate "that kind of CSV" and treat it the way all user interfaces should treat input - with an eye towards telling the user where they screwed up.

Once you've got the data into a processing pipeline you do need additional semantic structure to not lose integrity, which I think Parquet is fine for, but any SQL schema will also do.


How could anybody advocate for "complexity regardless of need"? What would that even look like?

Maybe you were being sarcastic, I'm not sure. But I hope it's obvious that advocating for complexity at all is not the same thing as advocating for complexity regardless of need.


There are also significant disadvantages to text formats, most obviously in the area of compatibility. They’re easy enough that people have a dreadful habit of ignoring specs (if they even exist) and introducing incompatibilities, which are not uncommonly even security vulnerabilities (e.g. HTTP request smuggling).

CSV is a very good demonstration of these problems.

Binary formats tend to be quite a bit more robust, with a much higher average quality of implementation, because they’re harder to work with.


> because they’re harder to work with

Or rather because parsing human readable text comes with many edge cases, than encoding stuff into binary.


It’s easy enough to write an exact spec. People are just much less inclined to, and even if they do people are much less inclined to follow it.


TFA never claimed that parquet is superior to csv in all cases. If you have a small file or you just want to be able to read the file, then by all means use a csv. Is your argument that we shouldn't bother developing or talking about specialized file formats because they're overkill in the most common case?


I learned about parquet2json from HN recently, and would use it for this:

```bash

parquet2json cat <file or url> | grep ...

```

https://github.com/jupiter/parquet2json


Sure, it's just that a lot of the parquet using people aren't using command line tools for this because it's just too slow and not practical to do that at scale. Scaling down what these tools do is of course possible but not really a priority. Tools probably exist that do this already though. And if they don't you could probably script something together with things like python pretty quickly. It's not rocket science. You could probably script together a grep like tool in a few minutes. Not a big deal. And the odds are you are a quick google search away from finding a gazillion such tools.

But the whole point of parquet is doing things at scale in clusters, not doing some one of quick and dirty processing on tiny data on a laptop. You could do it. But grep and csv work well enough for those use cases. The other way around is a lot less practical.


Tbh these days I'd use parquet down to small files, duckdb is a single binary and give me grep-like searching but faster and with loads more options for when grep itself isn't quite enough.


Also the 5-10x compression turns «big data» into «laptop size data» very nicely!


The utility of grep depends on the amount of data and how it's stored. Grepping very large files of the size where you might consider parquet, that may take many hours. It's almost certainly faster to sit down and build a program that extracts data from a parquet.

On the other hand if grepping is fast, parquet doesn't really make sense in that scenario. You're better off with a structured format like JSON or XML if grep finishes in less than a few minutes.


Slightly different use case, because of structured vs unstructured query, but you can do that fast with duckdb: https://duckdb.org/2021/06/25/querying-parquet.html


not exactly a one liner, but you can do something like parquet-tools cat <your file.parquet> | grep etc


The article explicitly mentions that parque is not streamable because its index is at the end of the file and are split into multiple.


Who cares about Grep when you can quite literally execute SQL across one/many files that might even be remote.


Being able to do it efficiently on remote data is fantastic. Not only is the entire data dramatically smaller, but I can read just the partition and just the columns I care about.


Sorta? In that I can think of several ways to make that work.

That said, if you are using parquet, I think it is safe to assume large data sets. Large structured data sets, specifically. As such, getting a one liner that can hit them with a script is doable, but you probably won't see it done much due to utility.


That's not useful at all when your file contains hundreds of columns.


S3 Select supports parquet, which means you can push query to individual S3 blobs. I think there is a possibility we might see someone build a postgres foreign data wrapper which pushes the querying via S3 select to S3 blobs and use postgres only for the final aggregations. Someone on HN had done this with sqlite files in S3 queried by lambdas using sqlite, but pushing it wholly to S3 would eliminate a lot of overhead.


Effectively a compute engine on top of S3 storage that is smart enough to push-down to S3Select where possible. Well the general idea of a compute engine on top of S3 storage where data is partitioned efficiently with the appropriate prefixing is quite common. Commonly known as an external table. Hive is an example that comes to mind.

Couple that with a meta-index/catalogue and a modern file format like parquet and you've got yourself a more modern flavor of an external table. For instance Iceberg or Deltalake.

I'm unsure whether the technologies/libraries built around interacting or managing these external table formats can actually push-down to S3Select, but I can't imagine why they can't.


While people indeed love reinventing the wheel you can literally do that today with clickhouse or duckdb. select * from a parquet file wherever it resides.


Casually querying an S3 bucket with either of those is easily a half day's work for any meaningful size of workload. S3 Select and Athena make this very straightforward, and in the case of Athena+parquet, obscenely cheap given the capability on offer. I wanted to dislike parquet as overengineered masturbation, but that's really hard when getting billed fractions of a cent for chewing through terabytes of data in a few seconds with no setup costs.


> Casually querying an S3 bucket with either of those is easily a half day's work for any meaningful size of workload.

Clickhouse and DuckDB you just type: select * from (s3 path and access credentials)

Let's see what it takes on AWS Athena:

1) "Before you run your first query, you need to set up a query result location in Amazon S3."

2) What the fuck is a AwsDataCatalog? Why can't I just point it to a file/path?

3) What the fuck is AWS Glue Crawler? Oh shit another service I have to work with.

4) Oops. Glue can't access my bucket, now I have to create a new IAM role for it.

5) Okay, great I can finally run a query!

6) It wrote my results to a text file on S3. Awesome. I feel so goddamn efficient. I'm so fucking glad you suggested this rube goldberg disaster.

And if I don't have elevated access to our AWS account, it's not even that simple, instead there's slack messages and emails and jira tickets to get DevOps to provision me and/or set it up for me (incorrect the first time of course). A week later I finally got a text file on a fucking S3 bucket. Awesome.


If someone were following the setup steps to execute a query in a serious setting (as opposed to reviewing them just for the sake of writing a sarcastic forum comment), they might find CREATE TABLE .. MSCK REPAIR TABLE quickly becomes finger memory, and the only real usability complaint about Athena is that its interactive query editor only allows 10 tabs to be open at a time


Nope. It's a pain in the ass and easily the least ergonomic thing I've ever encountered. Parquet has a schema. AWS Athena forces you to provide the schema for your parquet file, or Glue to "discover" it instead.

I'm sure your happy little setup works great for you, but to imply or suggest that it's somehow an easy solution is misleading to the point of outright lies.


Jesus, why are you so angry at the prospect of having to execute a CREATE TABLE statement? Are you okay?


Because someone needs to be angry. The dev world is full of complacent fucks who are okay with this constant backslide of building perpetually worse software.


TIL PTSD from using AWS is real.


Why bother doing any of that. Just set your bucket up with open access, and dump your files anywhere on S3, choose whatever format, compression and schema you want. Don't bother using prefixes or naming convention since that's a whole thing. Searching for what you're looking for is now really easy. Make sure you do a recursive list at the root of the bucket so you know what you're working with. It might take a while to respond, if so just recursively list again and again. Pick the files you want, download to your laptop and voila! Once you've finished munging your results drop it back in the bucket and head home. You're done!

Okay sorry for the sarcasm, but what does everyone want? Everyone makes fun of the DataSwamp (DataLake, Lakehouse, etc), but that's what you're gonna get if you're not going to put any effort into learning your data storage strategy or discipline into maintaining and securing it.

If someone wanted you to have access to this data, then have them take a few minutes to set you up with an external table. It's probably as easy as `CREATE EXTERNAL TABLE ...`.

After all, Data Engineering is just a fancy name for Software Engineering. ;)


Using a blob storage platform that charges you every time you read, write, or access is your data is a shit strategy both for cost and speed. The entire premise is flawed.

Use it for backup, sharing, and if you must, for at-rest data. But don’t use it for operational data.


Right, i don't think anyone is suggesting to use this as an operational data store.

Use it for infrequent/read/analytical/training access patterns. Set you bucket to infrequently accessed mode, partition data and build out a catalogue so you're doing as little listing/scanning/GETting as possible.

Use an operational database for operational interaction patterns. Unload/elt historical/statistical/etc data out to your data warehouse (so the analytical workloads aren't bring down your operational database) as either native or external tables (or a hybrid of both). Cost and speed against this kinda of data is going to be way cheaper than most other options mostly bc this is columnar data with analytical workloads running against it.


> Right, i don't think anyone is suggesting to use this as an operational data store.

Databricks, Snowflake, AWS, Azure, GCP, and numerous cloud-scale databases providers are 100% suggesting people do precisely that (even without realizing it, e.g. Snowflake). It's either critical to their business models or at least some added juice when people pay AWS/GCP $5 per TB scanned. That's why these shit tools and mentality keep showing up.


K, I'm seeing the crux of your contention. Seems to either revolving around the idea of using cloud services in general, or specifically using S3 (or any blob storage) as your data substrate. That's fair.

Out of curiosity, what are you suggesting one should use to store and access vast amounts of data in a cheap and efficient manner?

Regarding using something like Snowflake as your operational database, I'm not sure anyone would do that. Transactional workloads would grind to halt and that's called out all the time in their documentation. The only thing close to such a suggestion probably won't be seen until Snowflake releases their Hybrid (OLTP) tables.


S3 Select, Athena, Redshift Spectrum all doing nearly identical things. Classic AWS.


Interesting, why it would take so long ? I'm genuinely curious so that we can make it better (Disclaimer:I work at ClickHouse)

Here's a example of a query against a Parquet file you can run on your laptop: ``` ./clickhouse local -q "SELECT town, avg(price) AS avg_price FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ho...') GROUP BY town ORDER BY avg_price DESC LIMIT 10" ``` From: https://clickhouse.com/docs/knowledgebase/parquet-to-csv-jso...


Big fan of ClickHouse, but in the case of S3 there doesn't seem to be much point fighting city hall. I was the author of the original clickhouse-local-in-Lambda ticket FWIW. Even if that approach worked well, it'd still only be <50% cheaper than Athena's fractions-of-a-cent for queries on efficiently partitioned data, and still there is setup work involved. I think CH-local-in-Lambda might still easily compete on cost with Athena for raw CSV queries though


Interesting, thanks for sharing this feedback! I didn't realise intially that it was about running clickhouse-local in Lambdas.

Fyi, we recently improved the Parquet support in 23.2 https://github.com/ClickHouse/ClickHouse/pull/45878

Also, we still have optimizations for reading Parquet from S3 coming so that might improve


It doesn't seem to be the same thing though. Like duckdb doesn't seem to use s3 select. The difference is that duckdb seems to be doing the filtering here. If you use s3 select s3 will give you result of the sql query. Would be good to know if any existing solution does it.

> select * from a parquet file wherever it resides.

I am sure any tool can do this. The point was to push the query to the source if you are issuing tons of them.


DuckDB pulls a clever trick with SQL and Parquet where it uses the HTTP range header to fetch back only the data it needs to answer the query.

The design of Parquet supports this. If you run `select sum(cost) from sales` and the cost column is all in the same byte range, DuckDB will need to fetch just the header that tells it the offset of that cost column and then just the bytes that form the cost column - saving a huge amount of bandwidth.

Running `select * from sales` will still have to scan over a much larger section of the file though.


Unless the file is very small, the column would most likely not be in the same byte range. Duckdb also can't filter on multiple columns without fetching both. Being able to fully preselect data at the source is much more powerful than the parquet tricks duckdub uses. Don't get me wrong, the duckdb impl is good and generic and will support any parquet store, but it is not as powerful as S3 select.


> the cost column is all in the same byte range

Parquet is hybrid, so you have random chunks of contiguous bytes scattered throughout the file.


we are pretty much literally doing that :D


Parquet fits well into a system like Apache Arrow for good query performance, when considering partitioning (via directories), push-down predicates, and read-only-the-columns-you-need: https://arrow.apache.org/docs/python/dataset.html "Tabular Datasets"


I was surprised lately to realize there is a not a standard way to import a parquet file in MSSQL. Happy to be proven wrong


If you need a quick tool to convert your CSV files, you can use csv2parquet from https://github.com/domoritz/arrow-tools.


I think this remains a usability problem for parquet: most of the tooling sucks. Starting to get better, but parquet has been out for so many years and there are still few tools that natively let someone interact with the format. For years, most of the utilities were exclusively JVM based.

My favorite way to explore parquet is to make a virtual link through DuckDB where you can seamlessly treat it like a SQL table. Even browse it in DBeaver.


I mostly ignored Parquet until I realized DuckDB could handle it - now I'm much more likely to poke around with a Parquet file because I've got a solid toolkit available in both the CLI and as a Python library.

A couple of notes I've made about that:

Using DuckDB in Python to access Parquet data: https://til.simonwillison.net/duckdb/parquet

Use DuckDB to convert parquet to JSON and then open it in Datasette Lite: https://til.simonwillison.net/duckdb/parquet-to-json


you can also use pandas in Python to access parquet data, and then i.e. convert it to sqlite


Naive pandas is going to limit you to RAM sizes when parquet is deliberately meant for huge datasets.


Anyone reading who is currently using pandas and running into this issue-polars is generally faster, and now supports (still new though) out-of-core workloads.


We support exporting any query from Splitgraph to Parquet, try it here [0] (click the dice for a random query). You can also upload CSV or SQLite file(s) [1]. Each one will load into a table that you can query with a Postgres client, or export it (or just some subset of it you select with a query) as Parquet.

Then you can load it into Seafowl [2]. We also have an API to export from Splitgraph to Seafowl, just need to add the button (this week).

[0] https://www.splitgraph.com/query

[1] https://www.splitgraph.com/upload

[2] https://seafowl.io/docs/guides/uploading-csv-parquet


I've used Parquet here to filter/map 400 Million Twitter geotagged tweets [1]. The main advantage was that it automatically partitions data so it can be streamed and chunked during processing.

[1]: https://ad.vgiscience.org/twitter-global-preview/00_Twitter_...


> On a real world 10 million row financial data table I just tested with pandas I found that Parquet is about 7.5 times quicker to read than csv, ~10 times quicker to write and a about a fifth of the size on disk. So way to think of Parquet is as "turbo csv" - like csv, just faster (and smaller).

Come on... who does tests like this? Why is anyone supposed to believe these numbers? What is even the point of making such claims w/o any evidence?

---

PS. Also, buried somewhere in the middle of the article is the admission that the format isn't possible to stream.

And, if we expand on the file structure, then it becomes apparent that it was inspired by typical relational database storage, except simplified, which made the format awful for insertion / deletion operations. Of course, if you compare to CSV, then there's no loss here, but there are other formats which can handle limited, or even unlimited insertion / deletion with the resource use similar to writes.

Even truncation in this format is complicated (compared to CSV).

Seems like the format was trying to be the middle ground between proper relational database storage and something like CSV, but not sure if such middle ground is actually necessary...


Well, it was me who did that test and frankly I only included it to give some context for people who've never heard of Parquet. If you want evidenced benchmarks I'm sure you can google for those yourself or indeed just run your own - the benefit differs as the data does (financial data obviously is mainly numerical).

I'm not, in this post, actually talking about Parquet being faster than csv but instead about how it makes data interchange easier (mostly that is about having an embedded schema).


So why are you posting this? If I have to do the work of benchmarking?

More importantly, why should I trust you, if you didn't bother making your article trustworthy?

Working many years in storage, I can engineer a benchmark that will show that CSV is faster than Parquet. So what?

> having an embedded schema

CSV also has embedded schema, that's what the header row is. Your arguments are just poor / show insufficient research / insufficient ability to interpret results.

If you wanted to make that argument stronger / believable, you should've at least claimed that Parquet has better schema than CSV. But this is also a problematic claim, because your idea of "better" comes from a perspective of making something more universal. But, in particular cases, I could, for example, come up with a CSV header that would be able to communicate more type information than what's available in Parquet, and subsequently, beat Parquet on that metric.

Anyways. My bigger point: maybe you are right, but you don't make a compelling case.


> Come on... who does tests like this? Why is anyone supposed to believe these numbers? What is even the point of making such claims w/o any evidence?

I’m not sure I understand the criticism you’re making. I think people do test like this all the time — they have some data in CSV, they benchmark some operations, they do the same with Parquet. I find the results very believable, given what I know about Parquet and CSV.


What if the tester made some silly error, like reading 1K records in Parquet and 10K records in CSV? What if the tester had other confounders, like, eg. Parquet file was compressed, but CSV file wasn't, or their particular setting for Pandas made it very hard for Pandas to read CSV properly (eg. there's a way to specify column types in Pandas prior to reading in CSV, which may significantly speed up things) and so on.

In general, I'd expect I/O for flat files to dominate any processing of the file format. So, if CSV file is 10X the Parquet file, I'd expect the performance of the reader to be 10X slower for CSV. Well, that is unless some complicated seeking is necessary, or some complicated work with memory is necessary etc.

Finally, the kind of data is also quite important. Consider that encoding 1-2 digit unsigned integers is quite efficient using ASCII text, while encoding very big integers is going to be a lot less efficient. Encoding string data is going to be almost as efficient no matter if (simple) binary encoding is used, or text encoding etc. The information contents (entropy) of the information being processed is also very important. Imagine reading length-encoded column of billion of nulls vs same size column of various integers and so on.


Sir...this is a Wendy's \s

I agree with your criticism and would have been more critical as well of their claim in previous years. Nowadays, everything is relatively fast for user's needs so unless we're picking hairs...I think we can let a potential silly error slide


If everything is fast, why do you even bother measuring?

The answer is no: not everything is fast. Measuring is as important as ever. Measuring translates into money users pay for hardware, rented or purchased.


On the other hand because it is column storage if you want to do the sum of one column grouped by another column, and there is a 100 other columns, it is way faster than csv since you only read the parts of the file you need.


This is an oversimplification. You will have to read the index to figure out where the parts you need to read are. And if your columns don't align well with the blocks / pages of the storage you are using, this "optimization" may turn out to be worthless, or, perhaps even harmful, if your system is starving for memory.


Parquet is cool. What isn't so cool is that, e g., partitioning isn't supported on all platforms out of the box. Hence, reading a set of files created in one language may require extra work in another language. Partitioning should be supported in all languages.


I wanted to dive into Parquet internals a couple of times to see how it works but there seems to be no tool to see page level data. The only option seems to be to use Arrow’s C++ library. Is anybody aware of a higher level option?


You can query the metadata of a Parquet file using DuckDB's parquet_schema and parquet_metadata functions [1]

[1] https://duckdb.org/docs/data/parquet/metadata



i wonder why the words 'apache', 'twitter', 'cloudera', 'cutting', 'rcfile', 'orc', 'cutting', 'trevni', 'thrift', 'hadoop', 'julien', 'jco', 'tianshuo', and 'deng' don't appear in this page

i mean i understand not mentioning all of them but it's pretty disappointing to not mention any of them


Interestingly enough, there are also articles out there talking about JPEGs and SQL databases that never mention IBM.


Should they also mention Dremel? I could go on.

Within the context of the article, mentioning any of these names is pretty obviously unwarranted. It sounds like you want a different article. Write it!


crediting the creators of parquet is not unwarranted, no


Why is this necessary every time someone wants to talk about Parquet? It's perfectly fine to talk about something without wasting the top third of your article recounting history like some recipe blog.


you seem to be intentionally strawmanning my position

if you are going to write about parquet, say something like 'apache parquet' or 'parquet, originally written for hadoop' or 'parquet, originally by julien le dem' instead of making it sound like you invented it yourself

this is only the top third of your article if the article is six words long

like 'apache parquet, for free, never used'


He links to the Wikipedia article. If someone is interested enough, they're welcome to do their own research. The article would not have been improved with your recommendation in any appreciable way for the reader.


Should it also include a section on the history of csv?


I'm interested in a good tool for converting csv to pq. Seems many encoding features are not supported by most.


I've still yet to figure out where feather fits into this.


My understanding: they're like pickle files, but for programs that use the arrow standard.


Why is the picture of Sutjeska war memorial in the article?


Because Sutjeska war memorial was "Quicker, but also more convenient" :)


I reserve the right to use my holiday snaps as I want on my own blog :)


   apt-cache search parquet
   <nada>


Why should not use a simple database instead ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: