
XSV – A fast CSV toolkit in Rust - mseri
https://github.com/BurntSushi/xsv
======
burntsushi
Author here. I was really hoping to get binaries for Windows/Mac/Linux
available before sharing it with others, but clearly I snoozed. I do have them
available for Linux though, so you don't have to install Rust in order to try
xsv:
[https://github.com/BurntSushi/xsv/releases](https://github.com/BurntSushi/xsv/releases)

Otherwise, you could try using rustle[1], which should install `xsv` in one
command (but it downloads Rust and compiles everything for you).

While I have your attention, if I had to pick one of the cooler features of
xsv, I'd tell you about `xsv index`. Its a command that creates a very simple
index that permits random access to your CSV data. This makes a lot of
operations pretty fast. For example:

    
    
        xsv index worldcitiespop.csv  # ~1.5s for 145MB
        xsv slice -i 500000 worldcitiespop.csv | xsv table  # instant, plus elastic tab stops for good measure
    

That second command doesn't have to chug through the first 499,999 records to
get the 500,000th record.

This can make other commands faster too, like random sampling and statistic
gathering. (Parallelism is used when possible!)

Finally, have you ever seen a CLI app QuickCheck'd? Yes. It's awesome! :-)
[https://github.com/BurntSushi/xsv/blob/master/tests/test_sor...](https://github.com/BurntSushi/xsv/blob/master/tests/test_sort.rs)

[1] - [https://github.com/brson/rustle](https://github.com/brson/rustle)

~~~
simi_
I'm looking forward to playing with _cool_ languages like Rust, Nim, and Elm.
But when I read stuff like this I remember why I love using Go every day.
Generating binaries for multiple platforms is braindead easy, as is building
from source on any system with Go installed.

That aside, really great work OP! I quite like the CSV format and had 2 ideas
based on my experience with it that I'd love to get an opinion on:

1\. markdown compiler plugin to expand ![title](filename.csv)

2\. barebones, imgur-like website for quick CSV file[s] upload, maybe also a
public gallery to showcase interesting data (obviously all uploads marked
public/unlisted/private)

~~~
burntsushi
> But when I read stuff like this I remember why I love using Go every day.

Me too! I wrote a window manager in Go[1] that I've been using for years now.
I love that it takes <30 seconds to download and compile the whole thing. No C
dependencies (compile or runtime) at all.

With that said, doing it with Rust should be almost as easy. There's no `cargo
install` command, but I think it's only a matter of time. :-)

Your ideas seem cool, by the way! Sharing CSV data would be especially nice.

[1] -
[https://github.com/BurntSushi/wingo](https://github.com/BurntSushi/wingo)

~~~
simi_
Thanks for your support, I'll start working on the idea then. I just checked
your Gh repos, your productivity is astonishing. Chapeaux! Also, cute handle.

I suspect the problem with CSV will be working around the myriad of broken
implementations and making sense of malformed data.

~~~
burntsushi
> I suspect the problem with CSV will be working around the myriad of broken
> implementations and making sense of malformed data.

Indeed. My CSV parser (and Python's) is pretty interesting in that regard.
There are very few things that actually cause a parse error. You can see
here[1] that the only two errors occur if there are unequal length records
(which can be disabled by enabling the "flexible" option) and invalid UTF-8
data (which can be avoiding by reading everything into plain byte strings).
That means that _any_ arbitrary data gets parsed into _something_. There are
various mechanisms in the CSV parser's state machine that make decisions for
you. Mostly, I used the same types of decisions that Python makes. For
example:

    
    
        >>> import csv
        >>> from StringIO import StringIO
        >>> list(csv.reader(StringIO('a, "b,c')))
        [['a', ' "b', 'c']]
        >>> list(csv.reader(StringIO('a,"b,c')))
        [['a', 'b,c']]
    

Whaaaa? Yeah, if our CSV parsers were conformant with the spec, then both of
these examples should fail. But they succeed and result in slightly different
interpretations based on whether a space character precedes the quote.
Therefore, "good" CSV parsers tend to implement a superset of RFC 4180 when
parsing, but usually implement it strictly when writing.

(My CSV parser ends up with the same parse as Python here, because it seemed
like a good decision to follow its lead since it is used _ubiquitously_.)

[1] -
[http://burntsushi.net/rustdoc/csv/enum.ParseErrorKind.html](http://burntsushi.net/rustdoc/csv/enum.ParseErrorKind.html)

------
dbro
Here's another suggestion for the criticism section (which is a good idea for
any open-minded project to include):

Instead of using a separate set of tools to work with CSV data, use an adapter
to allow existing tools to work around CSV's quirky quoting methods.

csvquote
([https://github.com/dbro/csvquote](https://github.com/dbro/csvquote)) enables
the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work
properly with CSV data.

~~~
burntsushi
That's a wicked cool tool! Thank you for sharing.

I do think there is room for both tools though. One of the cooler things I did
with `xsv` was implement a very basic form of indexing. It's just a sequence
of byte offsets where records start in some CSV data. Once you have that, you
can do things like process the data in parallel or slice records in CSV
instantly regardless of where those records occur.

It helps when the CSV parser has support for this:
[http://burntsushi.net/rustdoc/csv/struct.Reader.html#method....](http://burntsushi.net/rustdoc/csv/struct.Reader.html#method.byte_offset)

------
tbrownaw
From the "criticisms" section: _You shouldn 't be working with CSV data
because CSV is a terrible format._

Er, what's wrong with it? Or is this a case of, people using it for things
other than what it's meant for? Is there a better format for sending data
between different companies using different enterprisey database systems?

My complaint about csv is that people frequently generate it manually and
don't understand how to quote text fields, so they don't double any quote
characters that are part of the data. Which means I have to spend time
cleaning up malformed files.

~~~
steveklabnik
"CSV" is bad because it's not well-formed. There are tons of "CSV" parsers in
the wild, and they all make reasonable, but different, choices when it comes
to some behaviors. INI is the same way.

~~~
valevk
CSV RFC:
[http://tools.ietf.org/html/rfc4180](http://tools.ietf.org/html/rfc4180)

~~~
tomjakubowski
Existence of an RFC doesn't mean that people or tooling conform to it.

The CSV RFC doesn't specify an IETF standard, by the way.

------
101914
Did you try benchmarking against kdb+?

Seems like there are always HN commenters lambasting CSV. I am sure they have
very good reasons.

But, as for me, CSV is one of my favorite formats. (Sort of like how people
like XML or JSON I guess.) I like the limitations of CSV because I like
simple, raw data.

I wish the de facto format that www servers delivered was CSV instead of HTML
(for reason why, see below). Or at least I wish there was an option to receive
pages in CSV in addition to HTML.

Users could create their own markup, client side. Users could effectively use
their "spreadsheet software" to read the information on the www. Or they could
create infinitely creative presentations of data for themselves or others
using HTML5 or some other tool of personal expression.

It is easy to create HTML from CSV but I find it is a nuisance creating CSV
from HTML.

Because I have a need for CSV I write scanners with flex to convert HTML to
CSV.

I often wonder why I cannot access all the data I need from the www in CSV
format. Many have agreed over the years that the www needs more structure to
be more valuable as a data source. If data is first created in CSV, then you
have some inherent structure to build on; you can _use it_ to create markup
and add infinite creativity without destroying the underlying structure.

If data (cf. art or forms of personal expression) cannot be presented in CSV
then is it really raw data or is it something else, more subjective and
unwieldy?

Whatever. Back to reality. Pay no mind.

~~~
burntsushi
> Did you try benchmarking against kdb+?

xsv is never ever never going to compete with a real database. Full stop.

It's just a command line tool that tries to make some things faster when
slicing and dicing CSV data.

------
btown
If you need to do an indexing step anyways, why not simply import the data
into a SQL database, or build this as a wrapper that introspects the CSV file,
builds a database schema, and does the import for you? Is the issue limited
scratch space?

~~~
burntsushi
See the "Motivation" section:
[https://github.com/BurntSushi/xsv#motivation](https://github.com/BurntSushi/xsv#motivation)

There's a line somewhere between "conveniently play with large CSV data on the
CLI" and "the full power of a RDBMS." It's blurry and we won't all agree on
where it lays, but it certainly exists for me. (And based on feedback, it
exists for lots of others too.)

Also, there are already tools that look at a CSV file and figure out a schema.
No need to replicate that.

Finally, the indexing step is blindingly fast and only uses `N * 8` bytes,
where `N` is the number of records.

------
userbinator
Looks like it's based on this CSV parser:

[https://github.com/BurntSushi/rust-csv](https://github.com/BurntSushi/rust-
csv)

and it claims to be RFC4180-compliant, which is a good thing.

------
brazzledazzle
This is one of the things I really love about PowerShell. Import, manipulation
and export of formatted raw data like CSV is dead simple.

