
XSV: A fast CSV command-line toolkit written in Rust - tosh
https://github.com/BurntSushi/xsv
======
cube2222
I like how all those rust/go implemented, fast, cross-platform tools are
starting to get mainstream. (ripgrep's another great one).

Really useful if you're a programmer preferring Windows but mainly using Unix
tools and developing for Unix OSes.

By the way, I've been using xsv when analysing 8 GB csv's (the Amazon review
dataset) and have been nothing but happy with it.

~~~
vram22
>I like how all those rust/go implemented, fast, cross-platform tools are
starting to get mainstream. (ripgrep's another great one).

Both xsv and ripgrep are by BurntSushi, and ripgrep was mentioned twice on HN
recently, once in a release note thread, IIRC, and another time in the CLI:
Improved thread.

Both were mentioned before too.

~~~
coldtea
Crazy, I know, but some people don't read HN 24/7.

~~~
vram22
I don't either :) I just happened to have read those earlier threads because
of interest in the topics, so thought of sharing the info that I did.

------
domoritz
This is great. For now, my go to tool is csvkit (Python) which has all kinds
of neat tools. In particular loading data into databases (csvsql) is just
plain awesome. Check it out at
[https://csvkit.readthedocs.io/en/1.0.3/](https://csvkit.readthedocs.io/en/1.0.3/).

------
Ecco
Seems great at first, but how is that better than piping the whole CSV file
into SQLite and then doing the processing there? I think CSV is great for data
exchange but not so great for data processing.

By using SQLite (or any other DB actually) you can decide which data to index,
and write arbitrarily complex queries in a rather understandable language. I
think XSV is kind of reinventing the wheel here.

~~~
burntsushi
You could. But then you need to write a schema. How do you know what schema to
write? You could ask csvkit to do it, but it might not be fast enough on, say,
a 40GB CSV file. Or maybe it isn't quite accurate enough. xsv might be the
tool you use to figure out what the schema should be.

SQLite (or any SQL database) does not cover all use cases. For example, if you
need to _produce_ CSV data, and you can fit your transformation into xsv
commands, then it might be hard to beat the performance in exchange for the
effort you put in.

This is probably an expression problem. Tools don't always neatly fit into
orthogonal buckets. If you think in terms of shell pipelines and want to
attack the data as it is, then xsv might be good for you. SQLite is, IMO, a
pretty massive hammer to apply every single time you want to look at CSV data.

~~~
jgord
Have to pipe up here and say an enthusiastic / visceral / emotional 'thankyou'
to BurntSushi for all your great contributions - xsv, rust-csv, ripgrep etc -
not to mention your superb blog articles.

Im using xsv to wrangle 500gb of data, it is phenomenally useful and fast. The
main datastore is postgresql and it is great, but there are some things pg
just isn't fast enough for [hash joins anyone]. I have to do a lot of pre-
processing, and xsv is incredible for that.

Perhaps the best complement of all - I am seriously looking at moving from
node.js to Rust as my daily data/systems programming language _because_ of how
performant and elegant xsv / rust-csv / ripgrep are, and how readable the rust
code is.

I also have renewed respect for the staple of olde unix tools - sort sed grep
wc etc

xsv is a brilliant addition to that canon of unix lore.

~~~
burntsushi
Thanks for your kind words, I really appreciate them! :-) Please reach out if
you'd like any Rust advice!

~~~
shaklee3
Which of the two available rust books would you recommend for someone who
knows c/c++?

~~~
burntsushi
Hard to say. I own them both but haven't thoroughly read through either. I
would probably go with The Rust Programming Language, though, I suspect you
can't go wrong with either.

------
miguelmota
BurntSushi is a badass; always cranking out awesome tools. I use ripgrep [1]
on a daily basis as a grep replacement

[1]
[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

~~~
bpicolo
Their FST library is also awesome

[https://github.com/BurntSushi/fst](https://github.com/BurntSushi/fst)

------
damageboy
Would be interesting to see how xsv compared to miller
([https://johnkerl.org/miller/doc/index.html](https://johnkerl.org/miller/doc/index.html))
in terms of perf, this tool comes exactly as I am about to munge 1TB of
gzipped csv files.

Unfortunately, the main operation I need is not supported by xsv...

~~~
notimetorelax
What is the operation that you need? Can you send a PR?

------
Dowwie
I like the pager UI of the VisiData utility, written in Python, for exploring
data. Since XSV is written in Rust, it could theoretically be imported and
used in VisiData.

[https://github.com/saulpw/visidata](https://github.com/saulpw/visidata)

------
radarsat1
I'm not too familiar with rust tools, but I wanted to check this out. Can
someone explain to me why this worked,

    
    
        sudo apt-get install cargo rustc
        cargo install xsv
    

but this did not:

    
    
        git clone <xsv github>
        cd xsv
        cargo build --release
    

The latter gave me a ton of compilation errors on the package crossbeam-
channel. The former installed the program and compiled a bunch of crates, but
did it install a pre-built binary for xsv? "downloading" then "installing" of
xsv was the _first_ step, _then_ all the crates were downloaded. ldd seems to
report that the installed xsv binary has no dependencies so I don't see why it
would need to download and compile a bunch of crates in that case. On the
other hand if it was compiling xsv then I don't understand why I don't get the
same errors as in the latter case.

It seems Ubuntu has the following version:

    
    
        $ rustc --version
        rustc 1.25.0
    

I realize it's probably not optimal to use the Debian-packaged rustc but I
didn't feel like figuring out the whole ecosystem just to install one program
to test it out.

~~~
burntsushi
Because current master requires a newer version of the Rust compiler than the
current release on crates.io. If current master were a release, then `cargo
install xsv` would have failed with the same errors.

If you like, you can download static binaries from github:
[https://github.com/BurntSushi/xsv/releases](https://github.com/BurntSushi/xsv/releases)

~~~
radarsat1
Ah, ok thanks. Anyways I got it installed, was just wondering about certain
actions that cargo took. I found I could copy the resulting binary and delete
my entire .cargo directory and it continues to work, so it seems cargo did
some unnecessary things for `install`, but no worries.

~~~
burntsushi
To add to this: Cargo is more of a development tool for Rust programmers
rather than a package manager for end users. Both types of tools have a lot in
common, which is why things like `cargo install` exist and are very useful. In
the case of `xsv`, it's a useful escape hatch since your distro doesn't
package `xsv`, but `cargo install` is not like `apt install`. That is, it
downloads all of the source code dependencies of `xsv` and builds them. This
also requires sync'ing with crates.io's registry list. None of this is
unnecessary in the standard Cargo workflow in order to build xsv, but the
build artifacts are certainly unnecessary in order to _run_ xsv.

------
go_prodev
This looks great, and I'm very keen to try it out.

I have a malformed 115GB CSV which took some work to process with SSIS, but
I'm really interested to try again using xsv to split off the bad rows and see
if that would have been an easier option.

Very cool!

------
rookwood102
Burntsushi mentions in the readme that it is often the case that people
receive large csv files needing analysis but he also mentions that valid
criticisms of the tool are that you could use an SQL database. Why can't you
use a database in some scenarios? And when blazing in memory speed is useful,
why can't you use something like this:
[https://github.com/jtablesaw/tablesaw](https://github.com/jtablesaw/tablesaw)

Is XSV faster? Is it the command line convenience (although surely it was not
that convenient to write a new tool to do it)

Genuinely curious and I do not mean to belittle the project as it looks well
implemented and useful nonetheless.

------
kunashe
BurntSushi, dude you're awesome.

------
colanderman
OK, dumb question, because there have been a LOT of these types of stories
lately:

Why does it matter (beside to the implementor, or for pedagogy) what language
a CLI tool is implemented in?

~~~
cube2222
Rust and Go are languages with ecosystems, in which writing everything in a
cross-platform way is the default. They also compile to a single statically
linked binary.

You don't need an additional runtime + load of dependencies like node or
python.

They are lightweight and fast to start up as opposed to the jvm.

~~~
shawn
Python is installed literally everywhere. If it’s a CLI tool, and it’s written
in python, the chance that it won’t work is slim.

Go has a massive number of problems. For example, I tried to run Keybase’s
standard “go get” build instructions. It failed with 200 import errors. That
was the end of my attempt to install keybase on my raspberry pi. Others had
said that it works.

Rust requires a massive amount of hard drive space and takes a long time to
build. You also have to build it. That’s antithetical to rapid development.

I can't wait until the pendulum swings back away from static typing and the
next generation of programmers discover the benefits of literally ignoring
everybody and doing your own thing. It’ll be painful, but at least it’ll be
effective. And you won’t have to compile anything.

~~~
h1d
> Python is installed literally everywhere.

I figured Ubuntu 18.04 didn't have Python pre installed. Besides, installing
all the dependencies with pip is another step to do and gets annoying when
deploying to many servers.

For something that gets distributed, a single static binary is very welcomed.

~~~
ptman
I think it has python 3, but not python 2

------
mastrsushi
I started to write a similar application in C++. Have up because I didn't
think people would find it useful. After seeing this in the front page, I feel
it might be a good idea to get back in.
[https://github.com/tfili001/line](https://github.com/tfili001/line) If I get
back into this, CSV parsing will be my next step.

~~~
burntsushi
Check out csvmonkey for a really fast C++ csv parser:
[https://github.com/dw/csvmonkey](https://github.com/dw/csvmonkey)

------
adamnemecek
Burntsushi is like the most productive programmer.

~~~
person_of_color
BurntSushi will soon be BurntOut ;)

------
safgasCVS
Thank you for this! I've been looking for windows equivalent to csvkit and
this is just the ticket

------
convivialdingo
Really nice!

I think I’ve written at least a couple of these in some basic form myself. At
some point in a business someone is going to hand you CSV or excel dataset and
you end up having to deal with it.

------
sunpazed
Have been using xsv (in combination with any-json, and jq) for a few months
wrangling big csv files at work. Found it to be a better / faster process than
rolling my own code.

------
fithisux
I have used the simple and beautiful tool

[https://github.com/Clever/csvlint](https://github.com/Clever/csvlint)

It is in golang.

------
adiusmus
Good to have another tool for csv debris management. Especially those
multigigabyte gifts that need to be in the database “yesterday”. This has
happened several days or weeks in a row more times than I can count. And no,
people won’t provide such things in sqlite or something sane.

I feel like I should also recommend this: [https://digital-
preservation.github.io/csv-validator/](https://digital-
preservation.github.io/csv-validator/)

I think something like this rewritten in go would be great.

~~~
burntsushi
> I think something like this rewritten in go would be great.

Why not use it how it is? There are static binaries provided on github.

------
gregw2
no CSV diff capability? find which columns are different, which rows? ability
to ignore spaces or leading/trailing 0s or quote marks?

~~~
burntsushi
I imagine you could get pretty far with

    
    
        diff <(xsv input old.csv) <(xsv input new.csv)
    

In particular, xsv prints valid CSV, and quoting rules will be applied
consistently so that they can be diffed. Ignoring spaces and 0s might not work
though.

If you'd like to file an issue with a specific use case (sample data from a
real world problem would be great), then that would be appreciated!

------
rixed
Also relevant, this forgotten gem:

[http://www.rseventeen.com/](http://www.rseventeen.com/)

------
jancsika
Ok, I'm curious and may have follow-ups:

What exactly is being checksummed in cargo.lock? Is it the source code, a
binary, something else?

~~~
e12e
[https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-
lo...](https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html)

~~~
dbaupp
That doesn't seem to talk about the series of "checksum ..." entries in the
[metadata] section at the end of the Cargo.lock.

~~~
e12e
Hm, appears to be an alternate (new? Old?) syntax for:

"Cargo will take the latest commit and write that information out into our
Cargo.lock when we build for the first time. That file will look like this:

[[package]] name = "hello_world" version = "0.1.0" dependencies = [ "rand
0.1.0 (git+[https://github.com/rust-lang-
nursery/rand.git#9f35b8e439eeed...](https://github.com/rust-lang-
nursery/rand.git#9f35b8e439eeedd60b9414c58f389bdc6a3284f9\)"), ]

[[package]] name = "rand" version = "0.1.0" source =
"git+[https://github.com/rust-lang-
nursery/rand.git#9f35b8e439eeed...](https://github.com/rust-lang-
nursery/rand.git#9f35b8e439eeedd60b9414c58f389bdc6a3284f9")

You can see that there’s a lot more information here, including the exact
revision we used to build."

~~~
dbaupp
No, the git hashes are still there in the same syntax as described in that
document. In fact, the checksum entry of a git dependency seems to be
"<none>".

------
lerax
Awesome tool. Pretty useful to manage big-scary csv files.

------
tigrezno
What about using awk?

~~~
burntsushi
awk doesn't support csv.

------
tajen
XML-separated values? Fortunately no: command line program for indexing,
slicing, analyzing, splitting and joining CSV files.

~~~
ubernostrum
This comment shouldn't have been downvoted/killed; at the time it was made,
the title of the post was completely uninformative, leaving people to guess
what on earth "XSV" might be.

