
Show HN: Command Line Tool to Sort CSV and TSV Files by Multiple Headings in Go - johnweldon
https://github.com/johnweldon/sortcsv
======
z1mm32m4n
I’m a huge fan of csvkit, which includes a similar utility along with lots
more:

[http://csvkit.readthedocs.io/en/1.0.2/scripts/csvsort.html](http://csvkit.readthedocs.io/en/1.0.2/scripts/csvsort.html)

Some of my favorites tools it includes are csvsql and csvlook.

~~~
johnweldon
Cool, looks like a nicely built set of utilities in python. Thanks for the
link.

------
sigil
Equivalent sort(1) invocations for your examples:

    
    
        sort -k2 -k1 -k3 contacts.tsv
        sort -k1 -k2 -k3 contacts.tsv
    

This assumes TSV input, but there are plenty of reasons to prefer that to CSV.
If I'm working from CSV sources I usually convert to TSV first thing in my
shell pipeline.

~~~
feelin_googley
When sort is used on really large files, it will automatically attempt to use
disk, putting temp files in TMPDIR. This can be really slow.

To overcome the slowdown of disk I/O, perhaps a workaround could be to use mfs
or tmpfs, maybe something like:

    
    
       mkdir /dir
       mount -t tmpfs tmpfs /dir
       TMPDIR=/dir sort -k2 -k1 -k3 contacts.tsv
       TMPDIR=/dir sort -k1 -k2 -k3 contacts.tsv
    

Personally, I gave up on sort for large files and use k/kdb+. I suspect it is
faster for sorting than sort or the Go libraries, but I could be wrong.

~~~
sigil
For a dataset larger than physical memory, using a memory filesystem like
tmpfs for the merge stage will either swap (|tmpfs| < |ram|) or deadlock
(|tmpfs| >= |ram|).

Instead, your best bet in that case is to give sort as much physical memory as
you can spare:

    
    
        sort -S 95% -k1 huge.tsv
    

Extra disk I/O is inevitable since your dataset doesn't fit in memory. At
least during a merge sort your disk reads will be O(N) and sequentially
ordered.

Note: in the special case that your dataset is slightly larger than physical
memory, splitting it up in advance such that one of the `sort -m` input files
lives on a tmpfs should indeed be faster.

Other things to check out if you need Very Fast Large Sorts:

\- Use `sort --parallel=N` to use multiple cores. By default it only uses 1.

\- Use `sort --batch-size=NMERGE` to increase the number of files merged at
once. Otherwise you may be doing more mergesort stages than are necessary.

------
feelin_googley
Can anyone provide sample input and output for the example? I find it
difficult to evaluate text processing software quickly against existing
solutions when there is no example given, such as: here is some sample input
and here is the desired output, as is done at, e.g., unix.com.

~~~
johnweldon
I updated the README.md with some example usage and output. Thanks for the
feedback.

------
bfrog
I've been using
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv) which
is quite nice and has a few other very handy csv tools.

~~~
johnweldon
I hadn't seen that tool, thanks for pointing it out.

~~~
bfrog
Indeed, perhaps it will give you some fun ideas!

------
mdaniel
While not "in Go", Homebrew showed me this tool a while back and I like it
bunches:

> Miller is like awk, sed, cut, join, and sort for name-indexed data such as
> CSV, TSV, and tabular JSON.

[https://github.com/johnkerl/miller#readme](https://github.com/johnkerl/miller#readme)

------
johnweldon
Works great with previously shared Go command line tool jw4.us/to8 when input
files are not UTF8.

Use to8 to convert from UTF(32|16)(LE)? etc. to UTF8 first, then sort with
this tool.

~~~
donatj
Is there an advantage to to8 over iconv?

I've used iconv for years and it's never let me down.

~~~
johnweldon
I wrote this tool because I don't want to explicitly know the original
encoding, I just want _any_ encoding to be converted to UTF8. AFAIK, iconv
requires the source encoding to be specified on the command line.

