
ParaText: CSV parsing at 2.5 GB per second - flashman
http://www.wise.io/tech/paratext
======
justinsaccount
This is impressive, but...

"A fast reader exploits the capabilities of the storage system"...

the graphs show that their storage system is doing 4.00 GB/sec

I wonder what processor this is running on and what their storage system this
is.. multiple PCIe SSD?

I tried running a quick test but only succeeded in OOMing my 8G laptop.

Even just doing

    
    
      import paratext
      it = paratext.load_csv_as_iterator("/dev/shm/tmp/c.log", expand=True, forget=True)
      x = it.next()
    

Starts eating up all my ram after about a minute of spinning the cpu... so I
think they have a slightly different definition of an iterator as everyone
else.

Compared to

    
    
      cut -d , -f 5 < c.log > /dev/null
    

which runs in a few seconds, or a slightly more domain specific and optimized
version of 'cut'[1] that runs even faster (300-500MB/s on a single core
depending on which fields you want)

    
    
      $ du -hs c.log ;wc -l c.log 
      2.1G	c.log
      16197412 c.log
    

I also wonder if that is 2.5 GB/s per core.

[https://github.com/BurntSushi/rust-csv](https://github.com/BurntSushi/rust-
csv) does 241 MB/s in raw mode, so I find it a little hard to believe that
this is 10x faster... unless that is while maxing out multiple cores.

[1] [https://github.com/bro/bro-aux/blob/master/bro-cut/bro-
cut.c](https://github.com/bro/bro-aux/blob/master/bro-cut/bro-cut.c)

~~~
ygra
Does cut work with quoted fields correctly? My understanding was that it was
just a dumb line tokenizer and CSV is a little more complex than that, e.g.,
CSV rows can span more than one line when fields contain line breaks.

~~~
rspeer
Which means an effective thing to do when you control the data pipeline is to
ban tabs and line-breaks from your values, and then use `cut` on tab-separated
files.

~~~
Twirrim
Or just not use CSV as a data transfer format :D

~~~
rspeer
I think you're missing the point. Unix tools like `cut` are really effective
ways to deal with plain text data, who cares if it's CSV.

------
zeveb
It really makes me sad that CSV even exists: ASCII defines field ('unit') &
record separator characters (also group & file, but those are less-useful), as
well as an escape character. With those few characters, _all_ of the mess of
CSV encoding could be solved with these few rules:

    
    
        - all records are separated by an RS character (#x1e)
        - all fields within a file are separated by a US character (#x1f)
        - all instances of RS, US & ESC within a field are prefixed with an ESC (#x1b)
        - there are no more rules
    

It's remarkable to me that ASCII defines a pretty full-featured mechanism for
information interchange (start of header, file transfer &c.) and instead we
continue to build mechanisms atop its alphabetic characters.

It's like the original sin of computing is Not Invented Here (no doubt someone
will pipe up with a story of how ASCII itself was the product of NIH!).

~~~
arjie
I used to religiously believe in this, but in practice it isn't useful. The
whole point is to be roughly human-readable and using non-printing characters
defeats that. You can't even easily enter these things via the command line.

If we're abandoning human-readability, why even bother with ASCII? Just use a
binary format. Has anyone actually used ASCII unit and record separator
delimiters successfully? I'd be curious about what advantages they had over a
binary format, even just a protobuf or Thrift serialized form. If we want to
preserve schemalessness, there's stuff like Sereal.

~~~
zeveb
> The whole point is to be roughly human-readable and using non-printing
> characters defeats that.

You're assuming that separator characters are not human-readable. If ASCII had
been used as originally intended, they'd be just as readable as line breaks.
Err, carriage returns.

Computing depresses me some days …

~~~
nostrademons
Most people build systems to work in the world we have, not the world we wish
we had.

There are all sorts of examples like this. UNIX was initially designed to work
with pipes of line-oriented streams, so why did we get scripting languages
where every UNIX command is reinvented as a function call or RPC frameworks
where the pipe is replaced by a binary message? PHP was initially a templating
language, so why did we get Smarty, Wordpress, and PEAR templates? The web was
supposed to come with full support for editing & creating pages via WYSIWYG
editor and have a built in mechanism (hyperlinks) for associating pages with
other people, so why did we need Facebook to introduce the idea of "sharing
content with other people".

In each case, there were real, pragmatic reasons that people invented new
systems instead of doing what they were "supposed" to do. Line-oriented files
are clumsy for representing hierarchical data or conditionals. PHP is too hard
to use for most end-users, despite being built for pragmatic "just toss a
webpage up" use. The editing features in the web disappeared early on, with
Netscape, and they needed the critical mass of college students that Facebook
provided before people felt they had an audience for anything they did.

The moral for system designers is that you can't just throw a feature out
there and say "Use this." You have to adapt it to how people actually _do_ use
it, even if that usage seems brain-dead to you.

------
amelius
The main advantage of CSV is that it is human-readable. But if you have this
much data, then why not use some binary format that you can just read in as a
blob, and don't even need to parse?

Also, the approach mentioned here is only useful if reading+parsing the data
is the bottleneck (or close to it). For example, if reading+parsing takes only
10% of the total processing time, then optimizing this stage will only give at
most a 11% increase in performance.

~~~
bladecatcher
Csv is an ubiquitous data format. Especially when you're receiving data from
external sources which you don't have control over, which often happens to be
the case in banking and finance (streaming csv dumps are quite common)

~~~
Chyzwar
CSV is stupid format. XML, JSON, YAML are better to read and easier to
parse/validate. After compression these formats takes similar amount of space
or even less when cross-referencing.

CSV, tab-delimited files and in general flat files are retarded idea to be
used in banking. I know because I work for bank, these are source of all
misery.

------
dantiberian
When looking at the graphs, remember they are in log scale. The results are a
lot more impressive than they look at first glance.

~~~
vanderZwan
Woah, thanks for that warning. I was wondering when NumPy got so fast for a
second!

------
bsg75
> Despite extensive use of distributed databases and filesystems in data-
> driven workflows, there remains a persistent need to rapidly read text files
> on single machines.

This comment makes me wonder how often a distributed approach is used out of
some odd sense of convenience or interest, instead of taking time to create an
optimized single node approach?

In other words, how often is an Hadoop-like system used when unnecessary,
adding unnecessary complexity?

------
uudecode
Did the authors benchmark against kdb?

e.g., run a single k interpeter on each CPU, then divide and conquer

Isn't it faster to do parsing in memory and avoid I/O wherever possible?

~~~
geocar
Best of three, on an i7 8GB ram:

Loading a 156mb csv file in kdb 32-bit free version, single thread:

    
    
        \t trade:`sym`time`ex`cond`size`price!("STCCXH";",")0:`t.csv
        1850
    

in paratext, 64-bit:

    
    
        >>> timeit.timeit('paratext.load_csv_to_dict("t.csv",num_threads=4)', setup="import paratext", number=1)
        3.1176819801330566
    

No improvement.

Loading a 1.5GB csv file in kdb:

    
    
        \t quote:`sym`time`ex`bid`bsize`ask`asize`mode!("STCHXHXC";",")0:`q.csv
        14135
    

in paratext:

    
    
        >>> timeit.timeit('paratext.load_csv_to_dict("q.csv",num_threads=4)', setup="import paratext", number=1)
        12.962939977645874
    

Not too shabby! An almost 10% improvement over KDB by turning my fans on and
burning my lap!

However, I think they should probably make their parser faster before they
waste heat trying to make slow code finish sooner.

------
aldanor
HDF5 benchmark looks bogus, it's not clear what exactly the author meant by it
and how the data was stored/retrieved. With proper filters set up, it should
beat any csv reader by both (uncompressed) throughput and runtime

~~~
frankmcsherry
It looks like they are comparing (in the chart) throughput on each of their
file formats in bytes, rather than records. So, they are slower than HDF5, but
eat up more resources doing so (which I think is meant to suggest they are
leaving less to waste).

------
crescentfresh
Unrelated, but holy crap is that typeface color (#79888e) difficult to read.

------
falaki
I wish there was more details on the benchmark. For example, I am interested
to know if schema inference was turned on for spark-csv?

~~~
gcommer
The benchmarking scripts are in their repo: this line seems to answer your
question: ("inferschema" was turned on, probably to make it an apples-to-
apples comparison since static schemas is still on their TODO)
[https://github.com/wiseio/paratext/blob/ca347a552a53595b680c...](https://github.com/wiseio/paratext/blob/ca347a552a53595b680c5adc3e1373ebc267864f/bench/run_experiment.py#L113)

------
ben_jones
Not MIT licensed

~~~
wesm
Spurious argument. See
[http://www.apache.org/legal/resolved.html#category-a](http://www.apache.org/legal/resolved.html#category-a)

~~~
ben_jones
If I'm thinking about adding a third party library to my code base I don't
want to have to ask my lawyer to go through the license. Because then I get a
bill in the mail. It's that simple.

~~~
scrollaway
The APL is a common license. It would serve you well to learn about it instead
of talking hypotheticals and expecting everything to be MIT.

