Hacker News new | past | comments | ask | show | jobs | submit login
Fast CSV Processing with SIMD (nullprogram.com)
159 points by ingve on Dec 4, 2021 | hide | past | favorite | 29 comments



CSV are nice for data interchange or even for storage if compressed (storage is interchange between the past and the future). But the very first thing one should do when working with CSV data is convert it to a binary representation. It will force you to make a full pass over all the data, understand it, think about the types of the columns, and you will increase iteration speed.

More related to TFA: does someone recommend a good book or other resource about modern code optimization? I have interiorized several lessons but it would be nice to dig deeper.


Some years ago I had huge performance increase by simply converting a csv dataset to parquet before processing it with Apache Spark.


Also, you will sleep better at night knowing that your column dtypes are safe from harm, exactly as you stored them. Moving from CSV (or god forbid, .xlsx) has been such a quality of life improvement.

One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using [1], but it only works on Windows. More portable recommendations welcome!

[1] https://github.com/mukunku/ParquetViewer


I really like Visidata for "exploring" csv-type data.

Its a vi(m) inspired tool.

It also handles xls(x), sqlitedb and a bunch of other random things, and it appears to support parquet via pandas:

https://www.visidata.org/docs/loading/


Nice one!


I used to use Zeppelin, some kind of Jupyter Notebook for Spark (that supports Parquet). But it may be better alternatives.

https://zeppelin.apache.org/


The "real" format being binary with debugging tools is absolutely the best way to go. For example, you can use `nio print` (or even just `nio p`) in the Nim multi-program https://github.com/c-blake/nio to get "text debugging output" of binary files.


> recommend a good book or other resource about modern code optimization

https://www.agner.org/optimize/


Thanks, and I must say, very old school web page:)


Agner Fog's writings are all well worth the read if you care about optimization. Some of the techniques he uses in his optimized versions of common C functions are really interesting.

For example, he uses SIMD in several of these, but one of the performance problems associated with SIMD is the last 'n' bytes. If you processing, say, 16 or 32 bytes at a time, what do you do when you get towards the end of your buffer and that process would over-read past the end of the buffer? Most code I've seen just uses a simple linear process for the last few bytes but this is often very slow. Agner's solution was to just read past the end of the buffer, but use some assembly tricks to ensure there is always valid memory available after the end of the buffer, that way reading past the end of the buffer would (almost) never cause a problem.


> but use some assembly tricks to ensure there is always valid memory available after the end of the buffer

Isn't rounding up your buffer's size an allocation technique rather than "as assembly trick"?


For the specific case of substring/byte searching, another useful trick here is to do an unaligned load that lines up with the end of the haystack. Some portion of that load will have already been searched, but you already know nothing is in it.


Another useful trick for reading the last few bytes, especially if suitable buffer padding is not possible and the read would cross a memory page boundary, is to use PSHUFB aligned against the memory page boundary to pull the last few bytes into a register. Safe and fast.


I have a one-liner alias/script csv_to_sqlite that just starts an sqlite3 shell with the csv converted to a table. It’s a lifesaver.


You can probably get another "few x" faster still:

https://branchfree.org/2019/03/06/code-fragment-finding-quot...

Based on a cursory scan this is the same problem so Geoff's trick should be directly applicable.

The ideas are quite similar: using bit manipulation to detect the in and out regions. Geoff's trick does the "smearing" of the bits in a single SIMD "carryless multiply", plus some supporting stuff. It is constant time for an input word: there is no dependence on the number of bits. So it could be many times faster for inputs with very dense quotes.

For an actual implementation, you'd want to handle the case where you don't have any quotes at all or where they are very sparse, which is common to a lot of CSV with a fast path that bails out when it sees a quote (possibly to be be resumed later if there is again a quote-free run).


Langdale also has a state machine simd idea that can be used for the full csv transition table: https://branchfree.org/2018/05/25/say-hello-to-my-little-fri...

I implemented this at some point (the "noisy" version, in Langdale's words) and got a csv tokenizer that spent about 1.5 cycles per byte.


I started to build "simdcsv" - based on similar principles to "simdjson", but succumbed to lassitude. There's still a fork kicking around.

Doing the CSV version of the quote convention is actually wildly easier than the JSON version, as you can just treat a record like "foo""bar",hi,1,"hatstand,teakettle" as having 'left the quotes and rentered the quotes" at the double-quote spot when you're busy looking for ','. This isn't, of course, much help for normalization, but for the bit where you're hoping to simply find which ',' characters are separators, it's fine to pretend that there's a gap in your "quoted stuff bitfield" that happens at the "" in "foo""bar", as of course that gap isn't going to land on a ',' anyhow. So it's much cheaper than doing a tedious shenanigan to handle some crafty user who has hit you with 100 \ characters in a row (as opposed to 99 or 101), a la JSON.

IMO the cost of doing the CLMUL all the time vs taking a conditional to handle the 'I have no quotes today' case is pretty low. It makes for shorter and more easily understood (in performance terms) code. I wasn't allowed to take this to extremes in simdjson - we did handle the "common case" for UTF-8 validation (although I wonder if that's a very culturally determined notion of "common case").


I really should write up how we did delimiter and quote detection in this library:

https://github.com/capitalone/DataProfiler

It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.

Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).


> It turns out delimited files IMO are much harder to parse than say, JSON.

I have to agree, having recently done a deep-dive into the new Utf8JsonReader in .NET. Tracking state in a JSON parsing context can actually be quite elegant. CSV is a minefield of edge cases in my experience.


I get ~50% the speed of the article's variant with no SIMD at all in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim

While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.

As commented elsewhere in this thread, but bearing (much!) repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.

The main reason not to do this is if you have no disk space for it.


Isn't casting to a struct pointer full of Undefined Behaviour? I believe you have to memcpy or bitcast and hope that the compiler optimises it out.


No.

To be more clear, you can mmap read only and just cast the pointer. You probably want the struct type to be "packed" unless you are intimately familiar with your compiler's struct padding behavior. Every C compiler I am aware of has some way to specify "packed", but it is admittedly not the most portable construct. In any event, if you ever doing memcpy calls then you still must worry about struct layout. So, the memcpy is not buying you anything.

You also do need to not exceed the size of the "file array", just like any array in C, but pre-mmap you usually open & fstat and the latter tells you the size of the file and sizeof(structType) tells you the size of the struct.

So, there is this implicit assumption that the underlying file is not shrunken between the said fstat and whenever you access it by virtual memory indirection. Outright file deletion would actually be ok - but truncating it would leave the kernel unable to satisfy the page fault.

Other code/entities shrinking files on you is really very far from what one usually means by "casting pointers and undefined behavior", though. It applies to any prog.lang. Even "safe Rust" doesn't block this problem.

Another similar aspect of your follow-up belief depends upon the OS/sys admin, really. For example, many Linux systems default to overcommitting memory. This means you could malloc and memcpy and still cast a (packed) struct, but if that were swapped out and then when the swap in starts to happen the system is low on real physical RAM you might segfault. Heuristics like out-of-memory killers are just that - heuristic not precise. If the data is on an actually persistent backing store, though, the kernel can always page that in. So, in some ways the mmap read only approach is safer than the alloc & copy approach (if you know that nothing will shrink the file on you post-fstat).

Cheers


Seems a bit overly simplistic and not realistic. Most real life data wouldn't be ASCII based. I deal with a lot of large CSVs and do a lot of processing (multi-terabyte CSVs). Null bytes, invalid unicode, all kinds of issues are very common in real world data. Not to mention improperly formatted CSVs (not closing quotes, etc).


I don't know, huge amounts of real-world data are ASCII in my experience, but it is also easy to program and cheap at runtime to scan ahead in your input to check if the high bit is universally clear. Current computers can do this at absurd speeds (~64B/cycle or ~200GB/s).


The author seems to assume that the bit patterns representing commas or quotation marks won't show up in any multibyte (not just unicode) patterns and each byte can be interpreted in isolation, but is that actually safe?


Yes, assuming the input is valid UTF-8.

UTF-8 bytes less than 0x80 are only used for the first 128 Unicode codepoints. Characters who need multiple bytes in UTF8 representation encode into bytes with high bits set, i.e. >= 0x80: https://en.wikipedia.org/wiki/UTF-8#Encoding


For any multibyte representation no however I'd say it's safe to just say "all encodings must be valid UTF-8" and require the user/service to validate that first.

There are a number of algorithms out there that can validate UTF-8 with significantly less than 1 instruction per byte. I'd imagine the overhead for pipelining the two is significantly cheaper than trying to handle the cases in the same pass.


This is an excellent post and I'm excited that my post on handling CSVs in AWK[1] at least somewhat inspired it.

Some responses here are missing the idea that csvquote enables the use of CSV files with standard UNIX tools like AWK, head and so on. If you are building some data processing pipeline then converting to a binary format makes sense, but I'm not sure that is where tools like this shine either. It's for a different case.

[1] https://earthly.dev/blog/awk-csv/


That was a really fun and accessible read, thanks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: