Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Command Line Tool to Sort CSV and TSV Files by Multiple Headings in Go (github.com)
29 points by johnweldon on Jan 5, 2018 | hide | past | web | favorite | 19 comments

I’m a huge fan of csvkit, which includes a similar utility along with lots more:


Some of my favorites tools it includes are csvsql and csvlook.

Cool, looks like a nicely built set of utilities in python. Thanks for the link.

Thanks! csvkit looks awesome!

Equivalent sort(1) invocations for your examples:

    sort -k2 -k1 -k3 contacts.tsv
    sort -k1 -k2 -k3 contacts.tsv
This assumes TSV input, but there are plenty of reasons to prefer that to CSV. If I'm working from CSV sources I usually convert to TSV first thing in my shell pipeline.

When sort is used on really large files, it will automatically attempt to use disk, putting temp files in TMPDIR. This can be really slow.

To overcome the slowdown of disk I/O, perhaps a workaround could be to use mfs or tmpfs, maybe something like:

   mkdir /dir
   mount -t tmpfs tmpfs /dir
   TMPDIR=/dir sort -k2 -k1 -k3 contacts.tsv
   TMPDIR=/dir sort -k1 -k2 -k3 contacts.tsv
Personally, I gave up on sort for large files and use k/kdb+. I suspect it is faster for sorting than sort or the Go libraries, but I could be wrong.

For a dataset larger than physical memory, using a memory filesystem like tmpfs for the merge stage will either swap (|tmpfs| < |ram|) or deadlock (|tmpfs| >= |ram|).

Instead, your best bet in that case is to give sort as much physical memory as you can spare:

    sort -S 95% -k1 huge.tsv
Extra disk I/O is inevitable since your dataset doesn't fit in memory. At least during a merge sort your disk reads will be O(N) and sequentially ordered.

Note: in the special case that your dataset is slightly larger than physical memory, splitting it up in advance such that one of the `sort -m` input files lives on a tmpfs should indeed be faster.

Other things to check out if you need Very Fast Large Sorts:

- Use `sort --parallel=N` to use multiple cores. By default it only uses 1.

- Use `sort --batch-size=NMERGE` to increase the number of files merged at once. Otherwise you may be doing more mergesort stages than are necessary.

Thanks - I've used sort quite a bit, and I like it. I wrote this partly to just fulfil my desire to sort by named fields rather than column indexes.

I know the annoyance you're talking about, but I think you're better off wrapping sort(1) with something translates from column names to indexes. Among the reasons: sortcsv buffers all input into memory [1], while sort(1) uses a divide-and-conquer merge sort to avoid this.

[1] https://github.com/johnweldon/sortcsv/blob/55818bd8e5f9feecc...

That's a good point. I may consider a stand alone utility to wrap this[1] code that converts names to indexes.

[1] https://github.com/johnweldon/sortcsv/blob/master/main.go#L1...

Can anyone provide sample input and output for the example? I find it difficult to evaluate text processing software quickly against existing solutions when there is no example given, such as: here is some sample input and here is the desired output, as is done at, e.g., unix.com.

I updated the README.md with some example usage and output. Thanks for the feedback.

I've been using https://github.com/BurntSushi/xsv which is quite nice and has a few other very handy csv tools.

xsv is fantastic. I'm a longtime user and fan of csvkit but I've slowly switched some of my habitual usage to xsv. Note that csvkit -- not being a single program like xsv but rather a collection of utilities -- contains a few branches of functionality that xsv doesn't aim to replicate, namely csvsql (convert CSV into SQL create and insert statements) and in2csv (convert XLS and JSON to CSV).

I hadn't seen that tool, thanks for pointing it out.

Indeed, perhaps it will give you some fun ideas!

While not "in Go", Homebrew showed me this tool a while back and I like it bunches:

> Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.


Works great with previously shared Go command line tool jw4.us/to8 when input files are not UTF8.

Use to8 to convert from UTF(32|16)(LE)? etc. to UTF8 first, then sort with this tool.

Is there an advantage to to8 over iconv?

I've used iconv for years and it's never let me down.

I wrote this tool because I don't want to explicitly know the original encoding, I just want _any_ encoding to be converted to UTF8. AFAIK, iconv requires the source encoding to be specified on the command line.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact