More related to TFA: does someone recommend a good book or other resource about modern code optimization? I have interiorized several lessons but it would be nice to dig deeper.
One thing I miss though is how easy it is to inspect .csv and .xlsx. I kinda solved it using , but it only works on Windows. More portable recommendations welcome!
Its a vi(m) inspired tool.
It also handles xls(x), sqlitedb and a bunch of other random things, and it appears to support parquet via pandas:
For example, he uses SIMD in several of these, but one of the performance problems associated with SIMD is the last 'n' bytes. If you processing, say, 16 or 32 bytes at a time, what do you do when you get towards the end of your buffer and that process would over-read past the end of the buffer? Most code I've seen just uses a simple linear process for the last few bytes but this is often very slow. Agner's solution was to just read past the end of the buffer, but use some assembly tricks to ensure there is always valid memory available after the end of the buffer, that way reading past the end of the buffer would (almost) never cause a problem.
Isn't rounding up your buffer's size an allocation technique rather than "as assembly trick"?
Based on a cursory scan this is the same problem so Geoff's trick should be directly applicable.
The ideas are quite similar: using bit manipulation to detect the in and out regions. Geoff's trick does the "smearing" of the bits in a single SIMD "carryless multiply", plus some supporting stuff. It is constant time for an input word: there is no dependence on the number of bits. So it could be many times faster for inputs with very dense quotes.
For an actual implementation, you'd want to handle the case where you don't have any quotes at all or where they are very sparse, which is common to a lot of CSV with a fast path that bails out when it sees a quote (possibly to be be resumed later if there is again a quote-free run).
I implemented this at some point (the "noisy" version, in Langdale's words) and got a csv tokenizer that spent about 1.5 cycles per byte.
Doing the CSV version of the quote convention is actually wildly easier than the JSON version, as you can just treat a record like "foo""bar",hi,1,"hatstand,teakettle" as having 'left the quotes and rentered the quotes" at the double-quote spot when you're busy looking for ','. This isn't, of course, much help for normalization, but for the bit where you're hoping to simply find which ',' characters are separators, it's fine to pretend that there's a gap in your "quoted stuff bitfield" that happens at the "" in "foo""bar", as of course that gap isn't going to land on a ',' anyhow. So it's much cheaper than doing a tedious shenanigan to handle some crafty user who has hit you with 100 \ characters in a row (as opposed to 99 or 101), a la JSON.
IMO the cost of doing the CLMUL all the time vs taking a conditional to handle the 'I have no quotes today' case is pretty low. It makes for shorter and more easily understood (in performance terms) code. I wasn't allowed to take this to extremes in simdjson - we did handle the "common case" for UTF-8 validation (although I wonder if that's a very culturally determined notion of "common case").
While in Nim, the main logic is really just bout 1 screenful for me and should not be so hard to follow.
As commented elsewhere in this thread, but bearing (much!) repeating, a better approach is to bulk convert text to binary and then operate off of that. One feature you get is fixed sized rows and thus random access to rows without an index. You can even mmap the file and cast it to a struct pointer if you like (though you need the struct pointer to be the right type). When DBs or DB-ish file formats are faster, being in binary is the 0th order reason why.
The main reason not to do this is if you have no disk space for it.
To be more clear, you can mmap read only and just cast the pointer. You probably want the struct type to be "packed" unless you are intimately familiar with your compiler's struct padding behavior. Every C compiler I am aware of has some way to specify "packed", but it is admittedly not the most portable construct. In any event, if you ever doing memcpy calls then you still must worry about struct layout. So, the memcpy is not buying you anything.
You also do need to not exceed the size of the "file array", just like any array in C, but pre-mmap you usually open & fstat and the latter tells you the size of the file and sizeof(structType) tells you the size of the struct.
So, there is this implicit assumption that the underlying file is not shrunken between the said fstat and whenever you access it by virtual memory indirection. Outright file deletion would actually be ok - but truncating it would leave the kernel unable to satisfy the page fault.
Other code/entities shrinking files on you is really very far from what one usually means by "casting pointers and undefined behavior", though. It applies to any prog.lang. Even "safe Rust" doesn't block this problem.
Another similar aspect of your follow-up belief depends upon the OS/sys admin, really. For example, many Linux systems default to overcommitting memory. This means you could malloc and memcpy and still cast a (packed) struct, but if that were swapped out and then when the swap in starts to happen the system is low on real physical RAM you might segfault. Heuristics like out-of-memory killers are just that - heuristic not precise. If the data is on an actually persistent backing store, though, the kernel can always page that in. So, in some ways the mmap read only approach is safer than the alloc & copy approach (if you know that nothing will shrink the file on you post-fstat).
It turns out delimited files IMO are much harder to parse than say, JSON. Largely because they have so many different permutations. The article covers CSVs, but many files are tab or null separated. We’ve even seen @ separated with ‘ for quotes.
Given the above, it should still be possible to use the method described. I’m guessing you’d have to detect the separators and quote chars first, however. You’d have to also handle empty rows and corrupted rows (which happen often enough).
I have to agree, having recently done a deep-dive into the new Utf8JsonReader in .NET. Tracking state in a JSON parsing context can actually be quite elegant. CSV is a minefield of edge cases in my experience.
UTF-8 bytes less than 0x80 are only used for the first 128 Unicode codepoints. Characters who need multiple bytes in UTF8 representation encode into bytes with high bits set, i.e. >= 0x80: https://en.wikipedia.org/wiki/UTF-8#Encoding
There are a number of algorithms out there that can validate UTF-8 with significantly less than 1 instruction per byte. I'd imagine the overhead for pipelining the two is significantly cheaper than trying to handle the cases in the same pass.
Some responses here are missing the idea that csvquote enables the use of CSV files with standard UNIX tools like AWK, head and so on. If you are building some data processing pipeline then converting to a binary format makes sense, but I'm not sure that is where tools like this shine either. It's for a different case.