
Doing a database join with CSV files - chmaynard
https://www.johndcook.com/blog/2019/12/31/sql-join-csv-files/
======
bumbledraven

      sqlite> .mode csv
      sqlite> .header on
      sqlite> .import weight.csv weight
      sqlite> .import person.csv person
      sqlite> select * from person, weight where person.ID = weight.ID;
      ID,sex,ID,weight
      123,M,123,200
      789,F,789,155
      sqlite>

~~~
f311a
I was trying to join two 300M lines CSV files using SQLite and Postgres and it
took days to merge them on c4.xlarge.

Later, I tried Miller and it took less than 4-6 hours.

It also takes some time to tune Postgres to make it faster for this particular
task.

~~~
faizshah
When I've had to do similar, the easiest and fastest way has always been
install spark -> Create an external spark sql table from the csvs -> use spark
sql to join and aggregate/write. If you haven't used spark before it sounds
complicated but its actually really quick and easy.

In general the operation on my laptop can get up to 200 MB/s, it's basically
IO limited to the SSD of your machine.

~~~
lolive
Would be nice if you could write a small article about that.

------
teh_klev
Not that I want to be the "well actually" guy...but MS has a much underrated
tool called Log Parser that's been around for years that lets you do this
kinda thing and more.

If you're a Windows admin and haven't bumped into this utility then it's well
worth a look.

edit: forgot to mention that it can also scripted from VBScript and Powershell
for added fun.

[https://www.microsoft.com/en-
gb/download/details.aspx?id=246...](https://www.microsoft.com/en-
gb/download/details.aspx?id=24659)

~~~
lapnitnelav
Seconding for log parser.

There's even a UI there : [https://techcommunity.microsoft.com/t5/exchange-
team-blog/in...](https://techcommunity.microsoft.com/t5/exchange-team-
blog/introducing-log-parser-studio/ba-p/601131)

------
NateEag
The `join` command does a similar job and is included by default on at least
Mac OS X (10.14) and Ubuntu 16.04.

I think it's much more common than that implies, but have not looked beyond
two machines I had immediate access to.

The interface is clumsier in some ways, but it's already there, which is often
a win when writing scripts.

~~~
burntsushi
xsv author here.

There is very little overlap between what xsv does and what standard Unix
tools like `join` do. Chances are, if you're using xsv for something like
this, then you probably _can 't_ correctly use `join` to do it because `join`
does not understand the CSV format.

If your CSV data happen to fall into the subset of the CSV format that does
not include escaped field separators (or record separators), then a tool like
`join` could work. Notably, this might include TSV (tab separated files) or
files that use the ASCII field/record separators (although I have literally
never seen such a file in the wild). But if it's plain old comma separated
values, then using a tool like `join` is perilous.

I didn't write xsv out of ignorance of standard line oriented tools. I wrote
it specifically to target problems that cannot be solved by standard line
oriented tools.

You might also argue that data should not be formatted in such a way, and
philosophically, I don't necessarily disagree with you. But xsv is not a
philosophical tool. It is a practical tool to deal with the actual data you
have.

~~~
wenc
In case folks don't know, burntsushi is also the author of ripgrep (rg),
written in Rust, and possibly the fastest grep in the world at the moment.

rg has all but replaced grep for me.

~~~
jhayward
> _burntsushi is also the author of ripgrep_

They are also the source of much highly practical domain-specific knowledge
and advice on how to write Rust code that does e.g.; efficient text file
reading, streaming, state machines, etc.

Many thanks for these contributions.

~~~
girvo
And that advice applies further than just Rust, too. Highly recommend checking
their blog and work out!

------
georgewfraser
If your goal is to join large CSV files using a local program, the ideal tool
is not SQLite but DuckDB:

[https://www.duckdb.org/docs/current/sql/copy.html](https://www.duckdb.org/docs/current/sql/copy.html)

The reason why DuckDB is a better fit for this job is because DuckDB is a
column store and has a block-oriented vectorized execution engine. This
approach is orders of magnitude faster when you’re doing batch operations with
millions of rows at a time.

In contrast, SQLite would be orders of magnitude faster that DuckDB when
you’re operating on one row at a time.

~~~
bane
I've used SQLite quite a few times, but never heard of DuckDB. Can anybody
provide some more information about it?

~~~
mytherin
Main author of DuckDB here, I did not expect to see this mentioned here.
DuckDB is a relational DBMS geared towards efficiently handling large
analytical-style workloads locally. It is similar to SQLite in the sense that
it operates locally on your machine, is easy to run and install and has zero
dependencies. However, DuckDB uses modern processing paradigms (vectorized
processing, columnar storage) that make it much faster when processing large
amounts of data.

It's still in an early stage currently, however, most of the functionality is
there (full SQL support, permanent storage, ACID properties). Feel free to
give it a try if you are interested. DuckDB has Python and R bindings, and a
shell based off of the sqlite3 shell. You can find installation instructions
here:
[https://www.duckdb.org/docs/current/tutorials/installation.h...](https://www.duckdb.org/docs/current/tutorials/installation.html)

------
fyp
I wonder what niche these tools fill. Maybe when you want to do some ad-hoc
analysis and want every step in your shell history?

Otherwise it seems more flexible to just fire up a python interpreter and do
it in like 3 lines of pandas (or with sqlite and .import like another
commenter mentioned)

~~~
DasIch
For me it's ad-hoc analysis on large CSV files. Large meaning well beyond what
Excel would be capable of, often larger than fits into memory on my local
machine (10s of GiB).

Sometimes I also use xsv to just do a step of the analysis and dive deeper on
some subset using pandas.

In my experience both SQLite and Pandas aren't as fast as fast for large
files. So they are not really good options.

Pandas is especially bad because it uses a column oriented data structure
internally so reading from or writing to CSV is incredibly slow in Pandas. If
you can use parquet that's not a problem but unfortunately parquet is not
nearly is ubiquitous as csv :(

~~~
nooorofe
If Pandas is slow, than you can use Spark. For such big files laptop is not an
option anyway. SQLite can be fast if you index your data (but I've worked with
files < 10G). Nowadays I am just uploading CSV to some cloud database and work
with data there.

~~~
makapuf
> For such big files laptop is not an option anyway

Too big for excel is not big data, and my laptop can load this 10G in RAM (not
that it necessarily need all of it) so why not if the data is here and the
laptop on your lap ?

------
nradov
For Windows, Microsoft offers a CSV ODBC driver that can make the files look
like any other relational database.

[https://support.microsoft.com/en-us/help/850320/creating-
an-...](https://support.microsoft.com/en-us/help/850320/creating-an-odbc-data-
source-with-the-microsoft-text-driver)

~~~
jmulho
Oracle has a feature called external tables that lets you map regular files to
the database and run SQL against them. Works with csv or any format that can
be loaded to the database with the normal database loading utilities. This is
very useful for getting transformation and loading done in one step!

~~~
mike22
MySQL has a barebones CSV storage engine:
[https://dev.mysql.com/doc/refman/8.0/en/se-csv-
limitations.h...](https://dev.mysql.com/doc/refman/8.0/en/se-csv-
limitations.html)

More usefully, MySQL has LOAD DATA INFILE facilities for bulk loading of flat
files.

------
DasIch
`xsv` can do far more than just join and select, such as searching, filtering,
partitioning, ...

For me it's the first choice when I find myself with a CSV file to work with.
I've also encountered situations where just getting some raw data from a
(slow) database and "querying" it with xsv ended up being the fastest option
to get the results I wanted.

------
inimino
Also along these lines is tsv-utils:

[https://github.com/eBay/tsv-
utils/blob/master/README.md](https://github.com/eBay/tsv-
utils/blob/master/README.md)

There are benchmark results comparing tsv-utils to a variety of similar tools
including xsv.

------
kragen
It's interesting that there doesn't seem to be a command-line tool out there
that lets you do this with LevelDB. LevelDB is about an order of magnitude
faster at importing a lot of data than SQLite or Postgres is (although I don't
know if it's faster than the Postgres CSV foreign data wrapper for the first
query, it sure will be for the second) and it can handle files a lot bigger
than your RAM.

It's interesting to note that around 1990 The Mark Williams Company wrote a
fairly full-fledged database system using, basically, CSV files. It was called
"/rdb". You wrote your queries as shell scripts, but using a set of /rdb
utilities that handled CSV files with labeled columns, and they had a screen-
based data-entry UI based on vi. I wouldn't want to use it instead of a
database --- you had to write your query plan in the shell script, rather than
a high-level SQL-like query, because the authors didn't really understand SQL
--- but it's interesting that the approach is still useful in 2019.

------
ctoth
I was expecting the answer to involve SQLite and .import and am happy to have
learned about this cool new tool.

~~~
SahAssar
This is using SQLite behind the scenes, so it's pretty much what you expected.

~~~
tomjakubowski
On the contrary, xsv doesn't use SQLite at all

~~~
SahAssar
Ah, sorry, I mixed it up with
[http://harelba.github.io/q/](http://harelba.github.io/q/)

------
orev
Shout out to Perl DBD::CSV. Gives you full SQL query functionality.

------
cube2222
Hey, just wanted to chime in with another tool, OctoSQL, which I'm a co-author
of.

It allows you to query and join data from multiple datasources simultaneously,
which may include CSV files.

Currently available datasources are SQL databases, Redis, JSON and CSV (more
coming...).

[1]:[https://github.com/cube2222/octosql](https://github.com/cube2222/octosql)

------
jillesvangurp
There are many ways to do this. A valid reason for wanting to do this is that
doing database imports takes time and sometimes you just want to get to the
joined data set without having to wait for that.

A few years ago, I was trying to turn open streetmaps dumps into json
documents. OSM ships as a huge bzipped XML dump of what is essentially 3
tables that need to be joined to do anything productive. One of those tables
containes a few billion nodes. Importing that into postgreql takes many hours
and takes up diskspace too. Bzipped, this stuff was only around 35GB. But it
unpacks to >1TB. So, I wrote a few simple tools that processed the XML using
regular expressions into gzipped files, sorted those on id and then joined
files on id by simply scanning through multiple of them. Probably far from
optimal but it got me results quick enough. Without taking ages or consuming
lots of disk.

------
zmmmmm
Any info on whether it does this in a streaming manner? For small files there
are a plethora of solutions, but it gets more tricky / interesting when you
can't assume both files fit in memory and (ideally) you want to do it in a
single pass because I/O itself is a bottleneck.

~~~
burntsushi
It's almost done in a streaming fashion. `xsv join` uses memory proportional
to the number of records in one of the inputs and the size of the join columns
in one of the inputs. Namely, it creates an in-memory map from the join
column's values to the corresponding record index. Combined with another in-
memory index that maps record index to the corresponding record's byte offset,
it can jump to the corresponding record in constant time while streaming over
the other input.

There are better ways to do this that don't involve using this much memory,
but this struck a good balance between memory usage and implementation
complexity. I may improve on this some day.

So basically, as long as you have enough memory to store the entire join
columns along with `8 * len(one_input)` bytes for the record index, then you
should be good.

~~~
zmmmmm
great! It may address some particular cases that I don't have a good tool for
currently then. Thanks!

------
trimbo
Pretty old now and unsupported but "crush tools" was fantastic for this kind
of thing

[https://github.com/google/crush-tools](https://github.com/google/crush-tools)

~~~
tuesday20
This doesn’t seem be worked upon either?

------
yawgmoth
Haven't tried XSV. In the same space is Q:
[https://harelba.github.io/q/](https://harelba.github.io/q/)

~~~
gcmrtc
Also csvkit, xsv is way faster but less feature-complete (based on my
experience last time I used both)

------
systems
Linq can also be use for possibly similar use cases

[https://docs.microsoft.com/en-
us/dotnet/csharp/programming-g...](https://docs.microsoft.com/en-
us/dotnet/csharp/programming-guide/concepts/linq/how-to-join-content-from-
dissimilar-files-linq)

~~~
fumplethumb
I haven't used xsv, but one of the nice feature appears to be that it will
work on arbitrarily large datasets. From the README[0]:

> These commands are instantaneous because they run in time and memory
> proportional to the size of the slice (which means they will scale to
> arbitrarily large CSV data).

The example given with LINQ reads the whole files into memory:

    
    
      string[] names = System.IO.File.ReadAllLines(@"../../../names.csv");  
      string[] scores = System.IO.File.ReadAllLines(@"../../../scores.csv");
    

As a fellow LINQ (probably with LINQPad in this case for a quick and dirty
script) user, I'd love to know LINQ can be used to read the files in "slices"
like xsv. I'm sure this could be accomblished with enough code, but is there
quick / easy way to do it?

[0]: [https://github.com/BurntSushi/xsv#a-whirlwind-
tour](https://github.com/BurntSushi/xsv#a-whirlwind-tour)

~~~
Arnavion
>The example given with LINQ reads the whole files into memory:

The examples indeed use `File.ReadAllLines` which returns a `string[]`, but
the same thing can be done with `File.ReadLines` which returns an
`IEnumerable<string>` - a lazy sequence of lines read on demand as the
sequence is enumerated.

------
santa_boy
XSV is extremely fast. I have a high frequency trading data api that powers my
analytics. The data coming in as json which I write to a RAMDISK. Then I
transform json to csv using jq and use xsv to get the data segments and strats
I need.

Despite pretty basic code, the performance is very good.

------
cpach
Another tool that can do this and other operations on CSV files is CSVkit.
I’ve had very good use for it.

[https://csvkit.readthedocs.io/en/1.0.2/](https://csvkit.readthedocs.io/en/1.0.2/)

------
PretzelFisch
Sigh, this is so much a case of "I have this nice hammer and I will use it for
this screw". Maybe this works ok on simple joins with small CSV, but let's
just import the CSV into database temp tables and get the job.

------
djrobstep
Surely you are better off simply running PostgreSQL locally, loading it into a
couple of tables, and running a simple query?

~~~
macintux
As always, the correct answer depends on what you have installed, what you’re
familiar with, how much data you have, and whether you’re going to be doing it
on a regular basis and want to automate it.

~~~
djrobstep
Sure, but I think there is a widespread hesitation to use a db unless one's
data reaches a higher level of size and complexity, which is misguided.

What I'm saying is that it's good to start with a database from the start,
even if the problem seems too trivial for it. That way, if your problem grows
into something that actually needs it, you already have it in a suitable
database and the code to deal with that.

~~~
burntsushi
xsv can be inserted into a shell pipeline at the drop of a hat, regardless of
how much data you have. Using a DB is typically much more effort. Just look at
the other comments in this thread. Tons of complaints about it being slow and
trying to finagle indices to make it better.

Sometimes functionality isn't the only thing that's important. Expression can
be just as important.

------
cardosof
This is awesome. Would love to see some performance benchmarks on this tool.

~~~
DasIch
Some commands are benchmarked here:
[https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md](https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md)

------
sergiotapia
Very cool tool!

------
asdf21
Full join is missing the Id on 456 and the Key on 999..

------
goodside
For an advanced challenge: do it for a hundred large bz2 files too big to fit
on disk in an uncompressed state, such as the monthly-compiled hourly dumps of
Wikipedia Pagecounts-EZ ([https://dumps.wikimedia.org/other/pagecounts-
ez/](https://dumps.wikimedia.org/other/pagecounts-ez/)). This is feasible with
CLI utils, but not trivial.

------
hermitcrab
You can also join CSV or Excel files very easily using our new tool: Easy Data
Transform:
[https://www.easydatatransform.com](https://www.easydatatransform.com)

It is a GUI tool for Windows and Mac. No syntax to remember. Just drag the two
files on to EDT and click the 'Join' button then choose the columns to join.

Join is just one of 36 transforms available. There is a 7 day free trial.

~~~
NicoJuicy
You were downvoted because it's off-topic and more like an ad. Hn doesn't like
ads, self promotions are allowed.

The title mentions joins, I'm not expecting a GUI-tool.

~~~
hermitcrab
Thanks for the explanation. I get that you don't want overly promotional
posts. But: -Easy Data Transform performs SQL-style joins on CSV files. It is
one of its main use cases. -The OP didn't mention anything about it having to
be command line.

~~~
NicoJuicy
I think it's more HN related than something else.

