
Golang – encoding/csv: Reading is slow - chinmaymk
https://github.com/golang/go/issues/16791
======
djur
It seems pretty common for languages to start out with a relatively
unoptimized CSV parser (if they have one at all) and then get a faster one
contributed by the community once there's enough interest. Ruby had that
happen with FasterCSV.

The Java comparison here seems inapt, because it doesn't do as much as the
other two. It's just a naive "split on commas" implementation that wouldn't
handle quoted cells. Really, if Go's CSV reader is only 200% slower than that
and 50% slower than Python's optimized C implementation, that's pretty good
already.

~~~
masklinn
ignore this, I somehow missed that the poster specifically mentioned Python 3,
which does have an encoding-aware CSV module.

~~It's also unclear which version of Python is being used, the Python 2 csv
module is byte-based and encoding-unaware which can lead to unexpected
behaviours.~~

Go's CSV package apparently only does UTF-8, and suggestions for speeding it
up in the tracker is to just remove that and work on raw bytes (FFS)

~~~
geofft
> suggestions for speeding it up in the tracker is to just remove that and
> work on raw bytes (FFS)

This is valid, because UTF-8 was designed to make this valid. The UTF-8
encoding of a comma, 0x2C (also the ASCII encoding of a comma), does not
appear as a part of any other UTF-8 encodings. Same with the UTF-8 encoding of
the double quote, 0x22. So scanning for 0x22 and 0x2C bytes, without stopping
to decode other UTF-8 sequences along the way, will produce the correct result
for a valid UTF-8 input string. Then you fully decode UTF-8 for the individual
fields when needed (and if you're doing a string-compare for some target value
that's already UTF-8, you never need to decode UTF-8 for that field at all).

~~~
snissn
That's cool about utf8 - what downsides are there to not treating utf-8 as raw
bytes?

~~~
geofft
The big things are related to string length not matching byte count. strlen()
is O(n) because you have to see how many sequences are actually in the string.
More than that, splitting/slicing/indexing a string based on byte offsets
doesn't work. For a 100-byte ASCII string, you're guaranteed that you can
split it into two 50-byte strings and things will still work: you can output
them separately, you can get the total length by adding strlen() on each half,
you can find a character by doing strchr() on each half, etc. For a 100-byte
valid UTF-8 string, splitting it into two 50-byte strings will possibly get
you an invalid string, because a character could be split in half. So strlen()
(even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a
string in two halves works properly as long as the receiver buffers its input,
and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would
delete one _byte_ , not one _character_. Newer UNIX kernels have code in the
terminal implementation to decode UTF-8 enough to backspace an entire
character.

~~~
weberc2
To clarify, Letting the length of a UTF-8 string in Go is O(1); it's computed
and stored on the string header at creation.

~~~
burntsushi
To clarify even more: that length is the number of _bytes_ (or UTF-8 code
units) in the string. It doesn't corresponding to the number of characters
(which one may either consider to be Unicode codepoints, or more technically
correct, Unicode grapheme clusters).

If you want to count the number of codepoints in a string (called "rune" in
Go), then you need to do so explicitly:
[https://golang.org/pkg/unicode/utf8/#RuneCountInString](https://golang.org/pkg/unicode/utf8/#RuneCountInString)

~~~
weberc2
Touché

------
zephyrfalcon
Python's csv module uses an internal module _csv which is written in C. So I'm
not sure it's all that surprising that a Go implementation is a bit slower.

~~~
0xFFC
How about Java? It is quite funny Java version is much faster than Python
version even when Python version does use C ? Something fishy going on.

~~~
nradov
The Java code is defective. It's not checking for double quotes. The CSV
format allows for commas inside column values by surrounding with double
quotes, and then you can also put double quotes within such values by escaping
them as double double quotes. Fix those defects and the Java code will be a
little slower.

With modern JVMs, Java can occasionally actually be faster than native
compiled languages due to dynamic optimization at runtime.

~~~
dalke
A few minutes ago (and after your comment) one of the commenters of that issue
tested against Apache Commons CSV and found that Java was 1.9x faster than Go,
rather than the original 3x:
[https://github.com/golang/go/issues/16791#issuecomment-24456...](https://github.com/golang/go/issues/16791#issuecomment-244562984)

~~~
merb
actually the java one is still amazing since it's a cold jvm. when it would be
a big file I would think that java is far ahead of both. with an aggressive
jit. maybe pypy is faster than all 3 :D

------
paulddraper
The Java code is _not_ a CSV parser.

I added the results of using Apache Commons CSV to the GitHub thread.

After using that, Python was actually by far the fastest. Hooray for
performance sensitive code in C :)

~~~
paulddraper
FYI, I later went back and ran PyPy, which implements csv in pure Python. Even
including start-up time, it was nearly at CPython speed.

------
endymi0n
On a related note, also the Go stdlib regex package is pretty naive and
imperformant compared to a full blown and modern backtracking PCRE
implementation (at 1/10 the LOC and complexity) - same thing goes for the
reflection based JSON package (which is still kinda "fast enough").

The focus wasn't so much on performance but on initial completeness, good
interface, versatility, clarity and simplicity - with faster or more
specialized implementations left to the community.

There might be different opinions about that, but I personally like the
approach of having a solid and ordered programming pocket knife - that also
doesn't replace a Katana for cutting.

~~~
dsymonds
The standard regexp package, unlike PCRE, is actually a proper regular
expression parser/matcher. Anything doing backtracking is at risk of
exponential blowup and isn't safe.

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
endymi0n
which is exactly what I meant with less specialized but safe :)

~~~
jonlawlor
It is also often much faster than pcre.

------
tmaly
I wrote my own in Go that is blazing fast using bytes. I know the data is
ascii so I was able to use that to my advantage.

~~~
piinbinary
I did the same [0]. It runs almost as fast as the Java implementation.

I'd be interested to see how yours works if you are willing to share it.

Edit: Plus one that is ~2x faster than Java by avoiding allocations [1].

[0]
[https://gist.github.com/jmikkola/6ac96ad6d6f66e772c33ec41ed2...](https://gist.github.com/jmikkola/6ac96ad6d6f66e772c33ec41ed27f057)

[1]
[https://gist.github.com/jmikkola/7ded8392226b7659c881f5540be...](https://gist.github.com/jmikkola/7ded8392226b7659c881f5540be539f5)

~~~
sorokod
Note that internally Java represents strings as utf-16

~~~
grandinj
Depends. The API for strings in java is mostly UTF16 but the latest JVM will
magically use UTF8 as its internal representation.

~~~
pjmlp
Only the Oracle one, it doesn't apply to other vendors.

------
SeanDav
Effectively one is comparing library performance here and not language
performance. Granted, that line can get very blurry indeed, but in this case
this says very little about golang the language and far more about a current
implementation of one of the golang libraries.

~~~
jonlawlor
It is kind of both; go doesn't allow some approaches in native go code that
can make it slower than other languages. (I love go, but that is my
experience.)

In this case the choice to use utf-8 everywhere, including in the csv
delimiters, is making it slower.

~~~
weberc2
This isn't what makes CSV parsing slow, there's nothing about Go that requires
you to deal in UTF8.

------
lcarlson
This may be a bit off topic but I've found sqlite to be quite a powerful csv
parser. Once posted you can manipulate the data in lots of ways. When you're
working with reports that need to get back into some sort of table format,
it's very intuitive and easy for SQL people.

------
WestCoastJustin
Related to this, was a Reddit thread from a few days ago in /r/golang about
improving a csv/reader. See:
[https://www.reddit.com/r/golang/comments/50ncer/implementing...](https://www.reddit.com/r/golang/comments/50ncer/implementing_a_streaming_csv_reader_reduced/)

------
deno
Better than node.js

    
    
        import * as csv from 'csv-parse';
        import * as fs from 'fs';
        
        type Line = [string,string,string,string,string,string];
        
        const parser = new csv.Parser({});
        
        parser.on('data', (line: Line) => { 
        if (line[0] === '42') {
                console.dir(line);
            } 
        });
        
        fs.createReadStream('mock_data.csv').pipe(parser);
        
        $ /usr/bin/time node parse_csv.js
        43.61user 0.85system 0:45.61elapsed 97%CPU (0avgtext+0avgdata 60076maxresident)k
        
        $ node --version
        v6.4.0
    

Edit: Using fast-csv

    
    
        24.28user 0.20system 0:24.58elapsed 99%CPU (0avgtext+0avgdata 91780maxresident)k

~~~
masklinn
csv-parse is hardly the only CSV parser for node, and it is by far the
slowest: [https://github.com/phihag/csv-
speedtest](https://github.com/phihag/csv-speedtest) (csv2json depends on csv-
parse, so it's unsurprising that it's even slower)

~~~
deno
I chose the most popular one on npm because Go and Python are using stdlib.

~~~
elmigranto
But you still wrote "faster than node.js" and not "faster that most popular
npm module" (which aren't always of a great quality or performance-oriented).

~~~
deno
Yup[1]. Sorry!

[1]
[https://meta.wikimedia.org/wiki/Cunningham%27s_Law](https://meta.wikimedia.org/wiki/Cunningham%27s_Law)

------
twotwotwo
As someone notes on the bug, if you were rolling your own, there are some
other things you could do--return a [][]byte that's a pointer to its internal
buffer, only usable until the next row is read.

Making a version of encoding/csv that retains most of its features (custom
delimiters, handling backslashes and quoting and \r) but streams like that
would be a fun open source project for someone who likes Making Things Go
Fast.

~~~
peterwaller
I did this a couple of months ago and got a >5x speedup. It's at the expense
of dropping quoting though, so no commas or newlines can be in the input data.

[https://github.com/pwaller/usv](https://github.com/pwaller/usv)

------
petters
Any CSV reader should be limited only be disk access right? I wrapped together
a C++ program solving this problem and got 0.124 seconds. But that does not do
quotations etc.

~~~
Roboprog
Good point. But assume if you run a series of tests on the same file, it's in
RAM. (discard the time of the first run or two, assume data source - network,
DB, file - makes access time moot)

------
burntsushi
On the CSV Game benchmark, the Go csv reader is around 10x slower than the
fastest: [https://bitbucket.org/ewanhiggs/csv-
game](https://bitbucket.org/ewanhiggs/csv-game)

------
Roboprog
So, where's the Perl (+ CPAN...) version for comparison?

I guess "Nobody cares about your dead religion" :-)

Still, it would probably be faster, at the expense of being unreadable.

