Hacker News new | past | comments | ask | show | jobs | submit login
Loading CSV File at the Speed Limit of the NVMe Storage (liuliu.me)
198 points by todsacerdoti on Oct 10, 2020 | hide | past | favorite | 66 comments



Thank you for this article!

I wrote a CSV parser in Javascript for my site https://www.csvplot.com.

Even the "industry standard" PapaParse JS library was a lot slower than my no-frills implementation, so I thought I was onto something.

Then I read the "So You Want To Write your own CSV Code" [0]. I realized if I supported every corner case I would lose my entire speed advantage, which in hindsight is obvious. That newlines can exist inside a column if quoted ( Something like "my \n value with newline") is specifically what caused the most issues. Incidentally the article also points at that this ruins the possibility of processing each row independently.

I started reading about highly efficient text parsing, but most of what I could find was dedicated to JSON parsing, with simdjson as one example. I briefly considered using webassembly to leverage existing fast implementations, but that never happened. My motivation for this specific side project ran out the minute it was fast enough for my own needs.

[0] http://thomasburette.com/blog/2014/05/25/so-you-want-to-writ...


I wonder if you can speculatively assume there are no line ends in quoted blocks and then try to go fast and then fall back to a slower method if you detect that to be the case.

Sort of like branch prediction where a failed prediction is costly but on average you are right.


How do you detect it without ruining the performance of the fast path? Seems like you need to make some kind of assumption. Perhaps in the number of columns in each row. But then what happens if you need to parse data with a different number of columns in some row? (Which is not that uncommon.)

The problem with csv is that many people consuming it don't actually choose that format. They have to parse what they are given.


Assume all linebreaks aren't part of quoted strings.

Split the data in linebreaks.

Process each line in parallel (massive speedup, can use many cores or even GPU).

Record "errors" (ie. Where a quoted block doesn't end before the newline)

Remove and reprocess all records containing errors or following a record with errors. Repeat until no errors exist.


Yeah, you kind of fell right into the trap I was talking about. Your approach either requires being able to memory map the CSV data or storing the CSV data in memory. And to be honest, it's not clear to me that your approach would end up being faster anyway. There's a lot of overhead buried in there, and it sounds like it's at least two passes over the data in common cases. And in cases where there are line breaks in the data, your approach will perform very poorly I think.


The parser from the article's performance is closer to 360mb/sec, the 2gb/sec is only achieved by over-allocating an incredible amount of hardware. CSV parsing is such a simple problem there is no reason to waste a whole machine on it, and in many contemporary scenarios (AWS Lambda), it's simply not possible to waste a whole machine on it.

I never finished packaging and writing tests, but https://github.com/dw/csvmonkey hits 1.9gb/sec on certain data sets with one thread, and that implementation is not even representative of the limits of single thread performance. https://github.com/geofflangdale/simdcsv follows a similar approach to simdjson, where multiple passes over the input are used, but with lower latency vector instructions (csvmonkey used one of the slowest). Character set decoding is also fused into the parser. There are no benchmarks published in that repo, but I expect the approach trounces my effort even if the current implementation doesn't.

Why be satisfied saturating one NVMe drive when a few threads could saturate the PCIe bus instead?


> In C++, this can be done in zero-copy fashion with string_view. In C, every string has to be null-terminated. Thus, you need to either manipulate the original buffer, or copy it over. I elected the latter.

Is the null-termination requirement due to returning the value of cells as C strings? I couldn't find an easy answer looking through the provided code. Otherwise if its just an internal detail, I would think a (pointer,length) string type would be ideal to allow you to use the buffer without modification or copying.

> It is about 2x slower than csv2. This is expected because we need to null-terminate strings and copy them to a new buffer.

I think it is strange to settle for a 2x slowdown in order to not modify the original buffer. Substituting a null byte for the separator / newline should be simpler than copying and null-terminating. So is keeping the original buffer free from modifications important? In most cases preserving the original buffer seems unimportant.


My assumption is that support for C strings (null terminated) was a design requirement.

Note that just replacing separators and newlines with nulls isn't enough to parse CSVs because the file format supports escape sequences and optional quotes around values. If you're modifying a buffer in place, you'll still be stuck copying data to replace (longer) escape sequences with their (shorter) final values.

I don't understand the line you've quoted from the article regarding why this code is slower than csv2. The csv2 code seemingly does copy from the original buffer rather than manipulating in place. Indeed, the csv2 parser copies data from a buffer into a C++ STL container object, which I would expect to be quite an expensive operation. Something does not add up here, either in my understanding or in the mechanics/description of the benchmark.


> Note that just replacing separators and newlines with nulls isn't enough to parse CSVs because the file format supports escape sequences and optional quotes around values. If you're modifying a buffer in place, you'll still be stuck copying data to replace (longer) escape sequences with their (shorter) final values.

I had forgotten that escape sequences needed to also be changed. Unescaping in place does make it non-trivial.

> I don't understand the line you've quoted from the article regarding why this code is slower than csv2. The csv2 code seemingly does copy from the original buffer rather than manipulating in place. Indeed, the csv2 parser copies data from a buffer into a C++ STL container object, which I would expect to be quite an expensive operation. Something does not add up here, either in my understanding or in the mechanics/description of the benchmark.

I'm not familiar with the csv2 parser, but if what you say is true, I guess the slowdown would come from doing a csv2 doing a larger copy(s) and the given code doing many smaller copies.


> I had forgotten that escape sequences needed to also be changed. Unescaping in place does make it non-trivial.

Assuming escapes are rare, backshift unescaping is totally an option, though it scales poorly (interpreting escapes is linear, backshifting is quadratic) and is thus susceptible to slowdowns with malicious inputs. It does need very little code and no extra memory at all, though.


Backshifting is not quadratic :-)

  dst = src = ptr;
  while(*src != '\0') {
    if(src[0] == '"' && src[1] == '"') {
      src++;
    }
    *dst++ = *src++;
  }


This bothered me a little, so I went and dug out the only instance were I remember using this technique, and it's quite literally what you wrote. All these high level immutable languages must be getting to me. Thanks :)


Re csv2 code: if you look at the benchmark, they really just loop through each cell, without accessing the content: https://github.com/p-ranav/csv2/blob/master/benchmark/main.c...

Thus, the code triggered in the benchmark path is on: https://github.com/p-ranav/csv2/blob/master/include/csv2/rea... and https://github.com/p-ranav/csv2/blob/master/include/csv2/rea...

It doesn't access any of the cell content during the benchmark.


"If you're modifying a buffer in place, you'll still be stuck copying data to replace (longer) escape sequences with their (shorter) final values."

You can pad, no? What comes after the NUL is ignored in C.


If you don’t modify the original buffer then it can me an mmapped file..?


When reading mostly sequentially, is a memory mapped file any faster than reading into a buffer? The data is presumably still copied into memory, it's just done on a page fault rather than an explicit function call.


Apparently there is one less memory copy when using mmap, since read() copies from the intermediate buffer that mmap provides access to. Neat!

https://medium.com/@sasha_f/why-mmap-is-faster-than-system-c...


I was thinking about using MAP_PRIVATE to prevent write-back. I not sure that doesn't come with a performance penalty on write, but it should at least be better copying in userspace.


That would likely trigger the “copy” part of copy-on-write and not be particularly fast.


Do you have any benchmarks that show that? It seems plausible, but a lot of COW systems optimize for the case where there is only 1 reference and modify in-place. I don't MMAP much, but it would be nice to know if Linux doesn't do the optimization I think is possible.

In this particular case, I would hope it only copies to memory from disk. If only one program is accessing a file, I would expect the "copy" of copy-on-write to be the copy to memory.


Sadly I do not, and I’m on a device running Darwin that I can’t readily do benchmarks on :( But yes, if you are the exclusive owner of a page you won’t get an extra copy. However, you’re still not really going any faster, as this is just the same as the kernel giving you the file in a buffer that you can freely modify…


It would be great if a comparison to the rust csv library or xsv [1] could be added - it's single threaded, but it has a zero copy option as far as I know. I've used it to parse / filter large PostgreSQL log files (20GB+) with great success.

[1] https://github.com/BurntSushi/xsv


As the author of xsv, I was surprised to see this comparison missing too.

The other bit to mention is `xsv index`. If you can afford to make one single threaded pass over the CSV data, then future operations truly do become embarrassingly parallel. It's a simple low-tech solution, although the UX is worse because you need to take a manual step.

As for zero copy, could you say where you saw that? I don't think it makes sense to describe xsv that way. The csv parser does have a lower level zero allocation API, but there is no zero copy API.


Yeah, I meant both the ByteRecord/read_byte_record API, as well as using Serde with a struct containing &str or &[u8] references.


Ah yeah, those are zero allocation. A copy is always made to account for quoting/escaping. You can implement a zero copy csv parser, but at some point you have to deal with quoting. The csv crate chooses to eagerly deal with it, always.


Many people wouldn't have fuzzed such a simple parser...yet libFuzzer found issues that were fixed.

Parsing is hard folks. All parsers should be analyzed statically and fuzzed at a bare minimum.


I love seeing that as the standard-of-care. Really, really good practice.


Unfortunately every time I have seen someone write their own "CSV" parser or serializer all they did was split(',') or ','.join()...


My desktop computer has 64GB of memory, 1TB NVMe (SUPER fast sequential read) storage, and my Ryzen has 12 cores. Yet most of this sits unutilized, even while doing heavy workloads. There's a lot of extra performance just sitting out there if we throw the book of default assumptions about hardware away.


What do you mean by throw default assumptions about hardware away? It is just sitting unused, but don't we need apps that need or can use that performance? I only ever use full cpus while rendering and a lot of memory can be needed there, but really not so much. The stuff that can use full usage is designed that way and the stuff that doesn't can't probably get much faster spread over cores?


To give a concrete exemple in gaming.

Mantle, followed by Vulkan and all, threw out the assumptions about cpu/gpu starving that were made in the 90s to adapt to modern hardware, extracting sometime massive performance boost "simply" by changing the API.

Right now you have RTX IO (nvidia) and DirectStorage (direct x) who aime to extract massive disk IO performance boost from modern NVMe drive for gaming by throwing away assumptions made about them in pre-ssd times and providing a new API.

These change often come down to the same thing: the way we call them, the amount of locking and round trips we do, synchronous or not, the connecting interfaces and their own speed limits, ... Rules made in an era were drive were X orders of magnitude slower than RAM start limiting us when X has been divided by two or three.


I followed Vulkan benchmarks in the beginning and therw weren't massive speedups in shipping games, has this changed?


The fact that it is sitting idle means it is doing a good job at handling your processing needs. Imagine if it wasn't fast enough (e.g. in deep learning). Waiting for your computer isn't fun.


shameless insertion:

"raw csv processing at ~20GB/s" has been demonstrated in one my project as the tooling byproduct[1] based on Rayon(great Rust data parallelism library)[2] and wrapping of modified simdcsv[3] into one simple .rs.

just a little more:

1. simdcsv has severe bugs, so do not use it beyond demo.

2. the processing model of simdcsv still has rooms to good improvements(estimated 2x more, a.k.a. ~40GB+/s in memory in single modern socket should be achievable in some scenarios(no heavy string to complex language object conversions)).

[1] https://tensorbase.io/2020/08/04/hello-base.html#benchmark

[2] https://github.com/rayon-rs

[3] https://github.com/geofflangdale/simdcsv


> simdcsv has severe bugs, so do not use it beyond demo

Are these bugs documented anywhere? A glance at the issue tracker doesn't seem to reflect this claim.


Neat! I worked on a project where CSV parsing was a significant bottleneck (https://openaddresses.io/, it processes street addresses). Roughly 50% of the processing in the time is spent parsing CSV and decoding to Unicode. I dinked around a bit and tuned it to read at about 30 MB/second and figured that was good enough. So this is a 100x speedup in reading (assuming all the stars align) which would cut that 50% down to nearly zero. Just as soon as it works transparently with a single Python import statement I'll use it ;-)


Would a change of representation (away from CSV) significantly change parsing speed ?


That's a good question! The new V4 platform https://batch.openaddresses.io/data outputs data as line delimited GeoJSONs, but haven't gotten time to test throughput in any common tools other than jq


Modern NVMe reads can reach over 7GB/s (like Kingston DCP1000 or 980 Pro), so this is not at the limit yet.


I'm a little surprised that the IO performance in the article is so poor. I think it is because they effectively have at most 1 operation outstanding per thread. Thus there is no benefit to having the MD setup. I think they would get the same performance with a single SSD.


I put together some hacky shell scripts as a poc for grep. I wouldn’t use it for anything other than a test of concept.

https://github.com/gpapilion/pgrep

So in order to make it work I used temp files to paper over some correctness and ordering issues.

So the initial tests showed a 2-3x speed up on a crappy laptop ssd. Of course my usage of temp files made this much slower on multiple runs because of the page cache where traditional grep beat the approach I took because of memory speed. It wasn’t able to beat pre spout files though.

Most of these tools were designed expecting disks to roughly behave serially. With Ssds and raid devices you can achieve more parallelism than was imagined in the 70s 80s and 90s.

I think it’s an area that is ripe for more innovation.


The article talks about processing the CSV file in memory, not about loading it.

What am I missing?


It does load it - mmap doesn't copy the file content into a buffer, it merely allows you to operate on a file as if it were in memory. Memory reads correspond to file read operations.


Sort of. mmap absolutely copies the file contents into the kernel file system cache which is a buffer, it just lets you map the filesystem cache into your address space so you can see it. And memory reads don't translate to file reads unless not in the cache already.


> mmap absolutely copies the file contents into the kernel file system cache which is a buffer

Isn't this a bit misleading? mmaping a file doesn't cause the kernel to start loading the whole thing into RAM, it just sets things up for the kernel to later transparently load pages of it on demand, possibly with some prefetching.


The GP's comment on the other hand seemed to imply it was completely unbuffered.


"Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference. About the best you can do in general is not do any buffering yourself, and usually explicitly bypass whatever buffering is going on at the next level of abstraction down. Ensuring you've cut out all buffering in the entire IO stack takes real effort.


> "Completely unbuffered" is almost unattainable in practice, so I'm not sure that's a reasonable inference.

While absolutely true, I've found that fact to be very surprising to a lot of engineers.


But I think the important part is that the file starts on disk and ends parsed. The rate of that was NVME limited (per article).


> The rate of that was NVME limited (per article).

The article shows that he's getting half the throughput of parsing a CSV that's already in RAM. But: he's using RAID0 of two SSDs and only getting a little more than half the throughput of one of those SSDs. As currently written, this program might not be giving the SSDs a high enough queue depth to hit their full read throughput. I'd like to see what throughput is like with an explicit attempt to prefetch data into RAM (either with a thread manually touching all the necessary pages, or maybe with a madvise call). That could drastically reduce the number of page faults and context switches affecting the OpenMP worker threads, and yield much better CPU utilization.


I thought queue depth related to supporting outstanding/pending reads. For a serial access such as csv parsing, what would you do other than having a readahead - somehow, see my other question - which would presumably maintain the queue depth at about 1.

Put another way, what would you do to read in the CSV serially to increase speed that would push the queue depth above 1?


For sequential accesses, it usually doesn't make a whole lot of difference whether the drive's queue is full of lots of medium-sized requests (eg. 128kB) or a few giant requests (multiple MB), so long as the queue always has outstanding requests for enough data to keep the drive(s) busy. Every operating system will have its own preferred IO sizes for prefetching, and if you're lucky you can also tune the size of the prefetch window (either in terms of bytes, or in terms of number of IOs). Different drives will also have different requirements here to achieve maximum throughput; an enterprise drive that stripes data across 16 channels will probably need a bigger/deeper queue than a consumer drive with just 4 channels, if the NAND page size is the same for both.

However, optimal utilization of the drive(s) will always require a queue depth of more than one request, because you don't want the drive to be idle after signalling completion of its only queued command and waiting for the CPU to produce a new read request. In a RAID0 setup like the author describes, you need to also ensure that you're generating enough IO to keep both drives busy, and the minimum prefetch window size that can accomplish this will usually be at least one full stripe.

As for how you accomplish the prefetching: the madvise system call sounds like a good choice, with the MADV_SEQUENTIAL or MADV_WILLNEED options. But how much prefetching that actually causes is up to the OS and the local system's settings. On my system, /sys/block/$DISK/queue/read_ahead_kb defaults to 128, which is definitely insufficient for at least some drives but might only apply to read-ahead triggered by the filesystem's heuristics rather than more explicitly requested by a madvise. So manually touching pages from a userspace thread is probably the safer way to guarantee the OS pages in data ahead of time—as long as it doesn't run so far ahead of the actual use of the data that it creates memory pressure that might get unused pages evicted.


WILLNEED really just reads the entire file asynchronously into the page cache, at least on Linux.


Is that still true if the size_t parameter to the madvise call is less than the entire file size? I would think that madvise hints could be issued at page granularity and not affect the entire mapping as originally allocated.


With no limit? What if the file is huge, will it evict other things in the cache?


Yes, probably. With hindsight, it is probably a mistake to use mmap. I probably can do better to just read file myself, since I have to make a mirror buffer later for some data manipulation anyway.


That makes sense, thanks!


Well, it copies into the kernel buffer as you access it as a sort of demand paging that isn’t actually all that bad depending on what you’re doing. It’s dramatically different from a typical “read everything into a buffer” that most programs do.


General question: if mmap pulls in data as you ask it and not before, you're going to have CPU waits on the disk, followed by processing on the CPU but no disk activity, alternating back and forth. I'd assume that to be optimal is to have them both working at once, so to have some kind of readahead request for the disk. How is this done, if at all?

Edit: just seen this which kind of touches on the same https://news.ycombinator.com/item?id=24737186


Generally the OS should see if you’re doing a long sequential access and prefetch this data before you access it.


Not sure if you know how mmap works, but regardless you can't say that memory reads correspond to file reads.

There is literally no io being done on your data access paths. Synchronising mapped pages with file contents happens in background write back threads.


I wrote something like this but used a GPU to do all the heavy lifting. It was mostly just to learn how Numba's GPU support worked so I never benchmarked it. It was very fast.

I think most people don't appreciate that CSV parsing is actually a compute-bound problem because the rows and column positions are known in advance.

Overall, I'm not sure the complexity justifies the performance gains unless parsing is in your critical path.


Storing larger data sets in CSV format is a recipe for disaster. As tech industry we should really come together With a standard binary format for data exchange. Maybe Arrow?


Arrow is designed for in-memory processing. It can be saved on disk so you can open it directly (memory map) but it's not a great storage format. Parquet or ORC is a better choice, but they don't have as much tooling for import/export. CSV is just the simplest way to transfer data.

You might be interested in DuckDB though which trying to create a new standard for passing datasets: https://duckdb.org/


why not just pass around sqlite databases?


how does this compare to elite tooling like https://SHAKTI.COM ?




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: