

Show HN: Parsing CSV files with GPU - antonmks
https://github.com/antonmks/nvParse

======
andrewguenther
This title is incredibly misleading.

* This isn't parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)

* You're comparing a single-threaded run on a low-end CPU to a top-tier GPU.

* Your dataset can fit into GPU memory.

* There is a pull request for a missing semicolon, which means the posted version of the code won't even compile, so couldn't have been the version used to generate the benchmarks.

* The amount of branching in the GPU code makes it hard for me to believe that it actually ran that fast. GPU parallelism does not work well with branching since all cores in a cube must executing in lock-step, if you branch, then you now have to go back and execute all of your branches separately.

~~~
antonmks
Sorry if you are misled :-) The program does parses selected fields, there are
strings-to-binaries procedures.

I tested this approach on multiterabyte files, take a look at my alenka
project, it uses the same method to load large CSV files into databases. It
just have to be done in chunks.

The program compiles fine, that pull request was referring to incorrect
earlier version of test.cu file.

Test it for yourself, see if you get similar results.

------
castratikron
I'm skeptical of the 8x speedup for several reasons, the main one being that
this particular problem does not fit the paradigm of problems that work well
on the GPU; the GPU cache is not used at all, and there are also many
branches. You need to be able to use the cache of the GPU in your application,
otherwise your performance is guaranteed to be memory-bound. The reason you
want to avoid branches is that there is only one control unit per a number of
cores on the GPU, which means that if some threads follow one branch they will
have to stall until the other threads complete. Generally the only code that
maps well to the GPU is that which contains large for loops and has good
spacial locality (e.g. matrix multiplication).

The author is comparing a GPU to a CPU, yet the CPU is only running a single
thread (supposedly, the author did not provide the CPU code used in the
comparison). For a true comparison the full capability of the CPU should be
exposed by means of a multithreaded application (and, as someone else has
already mentioned, vector instructions such as SSE). Think performance per
socket, not performance per thread.

~~~
pcwalton
> Generally the only code that maps well to the GPU is that which contains
> large for loops and has good spacial locality (e.g. matrix multiplication).

You also need high _arithmetic intensity_ (the ratio of arithmetic/logical
operations to memory loads). Of common CPU-bound tasks, CSV parsing has one of
the lowest arithmetic intensities imaginable.

~~~
castratikron
Right, maybe I should have said that instead. In my mind locality implies
arithmetic intensity, but intensity may not necessarily imply locality.

------
kazinator
Though there is no standard definition of CSV, de facto processing it properly
requires recognizing quotes, and also escapes of literal quotes using double
quoting:

    
    
        this, "is, like, CSV", "with three so-called ""fields"""
    

Note that unquoted leading and trailing whitespace, and whitespace around the
commas, is deleted, too.

(See CSV page in the Wikipedia)

A GPU-accelerated string split could be useful but it's not quite "parsing
CSV".

~~~
haberman
This is a perfect example of how text parsing is really inherently non-
parallelizable. It's very rare that you can do anything useful with a buffer
of text without knowing the precise state of the parse at the beginning of
that buffer.

The kinds of patterns that would make parsing more parallelizable, like
marking the beginning of a delimited region with its length, are human
unfriendly so would never be part of an actual text format. Who would ever
want to write this?

    
    
        # Update "19" whenever string length changes.
        x = {19}"String of length 19"

~~~
pcwalton
> This is a perfect example of how text parsing is really inherently non-
> parallelizable. It's very rare that you can do anything useful with a buffer
> of text without knowing the precise state of the parse at the beginning of
> that buffer.

There are two mechanisms that are usually used to get around this:

(1) Perform a fast, sequential "skeleton parsing" pass before the main parse
that scans _just enough_ to find "split points" that are consumed by the
parallel parser. This is what some of the parallel XML parsing work [1] did.

(2) Guess the state you're in based on some heuristics, and roll back on
failure. This actually works surprisingly well in practice for many grammars,
for example HTML [2].

[1]:
[http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4100476&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4100476)

[2]:
[http://www.cs.wm.edu/~xshen/Publications/taco14.pdf](http://www.cs.wm.edu/~xshen/Publications/taco14.pdf)

~~~
haberman
> (1) Perform a fast, sequential "skeleton parsing" pass before the main parse
> that scans just enough to find "split points" that are consumed by the
> parallel parser.

I'm not able to access the full text of this paper. But from the description I
wouldn't really consider this "parallel parsing." For the "skeleton parser" to
be correct, it must transition through a state machine that is exactly as
complex as the real parser. I suspect (again, not being able to read the paper
right now) that what makes the "skeleton parse" faster than the "real parse"
is not the speed of the parser itself, but the speed of the "load" on the
parser.

For the skeleton parse, the "load" on the parser is just finding split points
(cheap). At the application level, the "load" on the parser in many cases is
building a tree of some sort. The tree-building is often significantly more
expensive than the parse itself because it usually involves a lot of dynamic
memory allocation.

So yes, if you do a preliminary parse that chunks up the document, and then a
second parse that has a heavier load on it like tree-building, the second
parse can indeed be parallelized. But I wouldn't consider this parallelizing a
parser, I would consider it parallelizing the tree-building. In raw terms you
have probably spent _more_ CPU on the actual parsing logic than in the single-
threaded case.

I don't mean for this to be a semantic quibble. I'm really interested in
parsing architectures that decouple the parser itself from its "load." Event-
based parsers like SAX parsers do this. I'm interested specifically in the
parser part itself, and the limits of how it can be optimized.

> (2) Guess the state you're in based on some heuristics, and roll back on
> failure. This actually works surprisingly well in practice for many
> grammars, for example HTML [2].

Looks like an interesting paper, I'll have to dig more into that.

------
paulmd
I kinda suspect he might be measuring the time it takes to launch a kernel
rather than the time it takes the kernel to complete.

Thrust device calls, like those of the underlying CUDA library, are
asynchronous by default. The only exception is calls that result in a memcpy,
which are synchronous. To wait until an async call is completed you need to
call one of the synchronize commands, like cudaDeviceSynchronize.

Looking through his test.cu file, he snaps a timestamp using std::clock right
after doing the kernel launch with for_each. Ignoring the fact that this is
not an accurate way to benchmark a GPU (you need to use events to accurately
benchmark the kernel) what you're capturing will just be the processor time it
takes to make the async kernel launch. Std::clock measures CPU time, which is
(rightly) close to 0 for a program that runs on the GPU.

It's entirely possible that you're not even getting valid results out of the
other end - note that you don't show output. I don't know if thrust's magic
device memory access function triggers a synchronization or not. I kinda
remember having to make an explicit call when I did a GPU simulation.

I don't have access to a CUDA box at the moment, I'd have to add those
cudaDeviceSynchronize calls after the for_each invocations to be sure.

~~~
antonmks
Thrust CUDA calls are synchronous to each other. You can add an explicit
synchronization call cudaDeviceSynchronize() and there won't be a difference
in results.

------
roel_v
I don't understand the benchmark. How can a 750GB file be read from disk in
1.5 seconds, let alone be parsed? He mentions it's a 2TB drive, so it's not
even SSD presumably?

~~~
bbrks
The README has just been updated to change 750GB to MB.

[https://github.com/antonmks/nvParse/commit/fab4c4728096003bc...](https://github.com/antonmks/nvParse/commit/fab4c4728096003bcbe2fa98ffa2e10aa50e5603)

------
victorNicollet
Very interesting. From my experience, the hard part about parsing CSV isn't to
identify the individual cells, but rather parsing those cells afterwards (as
numbers, dates, etc).

What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to
Unix timestamps) when performed on the GPU ?

My company actually picked another optimization strategy, by making the
tokenization significantly longer, but it de-duplicates the tokenized cells so
that each distinct cell value (a date, a number, a string) can be parsed
exactly once. We have seen some fairly good results out of this, compared to
the naive approach of stream-token-parse:

[https://github.com/Lokad/lokad-flatfiles](https://github.com/Lokad/lokad-
flatfiles)

~~~
TheLoneWolfling
Makes sense for low-entropy data. Though I can see that approach choking on
some datasets. What happens if every entry has a GUID, for example?

May be better to do a best-effort deduplication instead of an exhaustive
approach.

~~~
victorNicollet
At some point, we considered tweaking by dropping any strings longer than a
certain length from the deduplication (it also helps with memory usage when
streaming the data).

Our method makes most sense for many-to-many data (several orders per product,
several orders per day), which happens to be the largest data sets we
manipulate (by 3 orders of magnitude). I can certainly see situations where
this would not be the case (e.g. web crawler logs).

------
mholt
I gotta admit, having written a high-speed, multi-threaded streaming CSV
parser for the browser[1], I'm quite impressed by your project. I've done CUDA
programming before and it's not easy (albeit libraries do help). Good work!

[1] [http://papaparse.com](http://papaparse.com)

~~~
pestaa
I just used Papa at the office today. Remembered it from a while ago, it
really turned out to be effortless. Excellent work.

(While I'm at it, may I suggest to provide CDN URLs on the website? IMO it
would make the otherwise awesome page perfect.)

Thank you again.

------
hannibalhorn
For file sizes where parsing speed really makes a difference, loading the
entire file into memory doesn't seem feasible. Maybe some sort of a hybrid
approach (load chunks into memory and parse them via the GPU) would provide
some real benefits though.

~~~
bane
And if you're going to load it, you may as well parse it on the way in.

~~~
victorNicollet
A streaming CSV parser is very difficult to get right, once you get past a
certain level of complexity.

Sometimes, you are not lucky enough to have perfect control over the encoding,
number format and date format of the input, so you need to look ahead at a
value sample to try and find out what those are.

Sometimes, you cannot even assume that the software that produced the file
didn't mangle the quotes around fields, and you have to detect that you are
40960 bytes into a quoted field, decide that it's probably an error, and
backtrack.

If you have enough memory to load the entire file, you will save a lot of time
by giving up on streaming processing.

~~~
maxhou
if you use mmap(), there is virtualy no difference between the streaming/non
streaming parsing code.

------
avarsheny
I don't think CPU and GPU is going to make much difference here. You speed up
in GPU run could be because of hot cache of file system. Try running GPU run
first and non-GPU after that and plz post the results.

~~~
antonmks
I used average time for multiple runs, both GPU and CPU benchmarks ran from
memory.

------
WhitneyLand
I like your creativity in applying the GPU to a less typical task.

Is it the case that the CPU version could be sped up dramatically by using
multiple cores and a variation on the line splitting technique?

~~~
hvidgaard
You could use a lockfree queue, fill it, and spawn a suitable number of
consumers that will put the parsed data back into a new queue. If the data
order matters, then some extra boilerplate is needed, and I do not know how
it's done on the GPU, or if it's needed at all.

Seeing that it takes 14.5s with the handrolled code to parse 750GB, I however
doubt the optimization is needed - it's more than 50GB every second and you
need some really really exotic hardware to generate that much data. You still
need to do something with the data and it's likely significantly more
computational intensitive - that should be on a different thread.

That said, using the GPU would free up the CPU to do meaningful work with the
data. But to be fair, assuming 1GB/s datarate, it's still 12,5m to read it all
and you only gain 14.5s or less than 2% speedup if you're CPU bound.

~~~
antonmks
It should be 750MB, not 750GB. There is no way a single hard drive can read
750GB in 14 seconds.

~~~
cjg_
Heh, you cant even DMA 750GB from memory to the graphics card (assuming
16GB/s, using 32 lanes) in 14s.

------
rdc12
"However this method of parsing is CPU bound because * it doesn't take
advantage of multiple cores of modern CPUs. * memory bandwidth limitations"

So not CPU bound at all then.

------
snissn
I think it would be really cool to integrate a GPU into a database to speed up
certain operations if an optimizer can decide that certain parts of a query
will benefit from it. The thrust docs [0] indicate that the C++ gpu library
can be used effectively for sorting, I wonder if sorts on non-indexed fields
can be sped up by attaching a GPU to my database!

[0] [http://thrust.github.io/](http://thrust.github.io/)

------
maxhou
In the real world, the slow part of "parsing" a CSV file is IO: reading the
file content from disk to memory, and from memory to CPU cache.

You would avoid reading the file content more than once if you had to parse
it.

> The first line counts the number of lines in a buffer (assuming that file is
> read into memory and copied to gpu buffer d_readbuff).

but this is what is done here, first search to find all \n, then multi-core
GPU stuff for each line content.

~~~
azurezyq
Things have been changed in a world of SSDs and machines w/ much memory.
Actually, parsing a CSV in a single thread will never reach several hundred
megabytes per second.

------
ape4
In the real world... read the CSV into a database (doesn't really matter how
fast or slow it is). Access the data from the database.

~~~
dilap
In the real world, sometimes you have an external data source, in csv (which
you can't control), which is changing, which you want to reread as quickly as
possible. :)

~~~
ldng
You then use PostgreSQL Foreign Data Wrapper (aka SQL/MED)

------
aftbit
Am I crazy? With a hot disk cache, the cut command he gave takes 1.2 seconds
on my machine.

