
Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - john_kerl
http://johnkerl.org/miller/doc/
======
dajobe
There is also the (impossible to google)
[https://github.com/harelba/q](https://github.com/harelba/q) that lets you SQL
over CSV and is unix pipeline friendly.

~~~
lambda
There's also
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv) which
is fairly similar in how it works to Miller, though it can also index a CSV
file and then future operation will be sped up by utilizing the index.

~~~
dmix
xsv is awesome, just used it for a mini-project. Thanks for the heads up.

------
michaelmior
I love csvkit[0] for this purpose

[0] [https://github.com/onyxfish/csvkit](https://github.com/onyxfish/csvkit)

~~~
loevborg
csvkit is excellent. I like `csvlook` for display csv files in the terminal.

Can anyone recommend an easy way to convert csv to xls on Linux? I know that
libreoffice and gnumeric can do it in headless mode, but it seems terrible
overkill.

~~~
technofiend
It is mega overkill but Pentaho does conversion between a long list of file
and database formats. I'm more mentioning it so you can check it out in case
you ever see a more complex requirement, or want to learn the tool. Once I
learned it the one off stuff is trivial. It takes me 5-10 mins to import an
xls, let the tool examine the file for field names and types, hook that to a
table load step, generate the ddl, edit the ddl to remove spaces from field
names and then execute the ddl followed by the loader job.

Since you're going flat file to flat file it would be faster for you as no
field name edits are required. You just create a csv input step, xls output
step and that's it, save and run the transformation and you're done.

------
burntsushi
> And initial experiments with Rust are intriguing. Yet with none of them
> could I obtain the throughput I get in C: see the Performance page.

I'd be curious to see how `xsv` (written in Rust) compares performance wise to
Miller. (I can't compile Miller---I commented on the issue tracker.)

> Thus the absence of in-process multiprocessing is only a slight penalty in
> this particular application domain — parallelism here is more easily
> achieved by running multiple single-threaded processes, each handling its
> own input files, either on a single host or split across multiple hosts.

I disagree. CSV files can be indexed[1] very simply, providing random access.
This makes it trivial to split a large CSV file into chunks which can be
operated on in parallel. This lifts the burden off the user to do the
splitting and merging manually, which can be become quite tedious when you
want to do frequency analysis on the fields.

This type of indexing is not usually seen because it requires support from the
underlying CSV parser to produce correct byte offsets. But you've written your
own parser here in Miller, so you should be able to do something similar to
`xsv`. Moreover, random access gets you slicing proportional to the slice.
i.e., extract the last 10 records in a 4M row CSV file is instanteous. This
opens up other doors too, like potentially faster joining and fast random
sampling.

P.S. I love the name. :-)

[1] -
[http://burntsushi.net/rustdoc/csv/index/index.html](http://burntsushi.net/rustdoc/csv/index/index.html)

------
ketralnis
Not exactly the same thing, but I often use a tiny tool I wrote called
shellql[1] that lets you just type in SQL like

    
    
        cat foo.csv | shellql "select f1, count(*) from tbl where f2='thingiwant' group by f1 order by count(*) desc limit 5"
    

It has some limitations on the formats of CSV that it can take, namely that
string fields have to be quoted, but you can see from the code that there's
not much to it if you wanted to change that. It'd also be easy enough to take
in multiple CSV files, join them together, etc.

sqlite's pretty badass.

[1]
[https://gist.github.com/ketralnis/8124b3ead9d94e00dcb2](https://gist.github.com/ketralnis/8124b3ead9d94e00dcb2)

------
rwmj
Or csvtool (written 2005, available in all good Linux distros):

[https://github.com/Chris00/ocaml-
csv/blob/master/examples/cs...](https://github.com/Chris00/ocaml-
csv/blob/master/examples/csvtool.ml#L510)

------
justhjkust
There's also xsv:

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

------
trimbo
Also check out Google's Crush-tools

[https://github.com/google/crush-tools](https://github.com/google/crush-tools)

------
jakobegger
It looks like this tool doesn't support semicolon-separated CSV files, quotes,
or numbers using a comma as decimal separator. Too bad.

It's a pity that the CSV file format is so fragmented. It's all too common
that a CSV file written by tool A can't be read by tool B.

Only very few tools do it right. For example, I think the ruby CSV module has
a heuristic to automatically detect the "style" of CSV files, which is pretty
neat.

~~~
iampims
I found the pandas tests to be a great source for implementing robust CSV
parsers:

[https://github.com/pydata/pandas/blob/master/pandas/io/tests...](https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_parsers.py)

------
kazinator
You _can_ name fields in awk so that you don't have to refer to them
numerically as $8, $13 or whatever. Example:

    
    
       BEGIN { first=1; second=2; third=3 }
    
       $first > 42 { print $third }
    

Here, $first means $ applied to the variable first. Since first evaluates to
1, it means $1.

Broken CSV handling that doesn't respect quotes is easily done in awk using
comma as the field separator.

~~~
cben
What about non-broken CSV? Is there an easy way to tell awk to skip escaped or
quoted delimeters?

------
mhd
I sometimes dumb down CSV to what sqlite understands and then just import it
there. Gonna have a look at this, but whether it's SQL or Ruby/Python/Perl
structures, I generally prefer to spend the least amount possible within
actual CSV. Even more so if it contains hierarchical data.

------
sordina
I made a little tool called SQLBong that does something similar by embedding
Sqlite3 under the hood:
[https://github.com/sordina/SQLBong](https://github.com/sordina/SQLBong)

------
john_kerl
Several feedbacks incorporated at

[https://github.com/johnkerl/miller/commit/a1d117d3b299bbf273...](https://github.com/johnkerl/miller/commit/a1d117d3b299bbf27309f157b0a8c90fea7ac066)

[https://github.com/johnkerl/miller/commit/7aead02c71a76fb281...](https://github.com/johnkerl/miller/commit/7aead02c71a76fb281273563e7e65d3be018bf52)

Also there is one issue closed (thanks epilanthanomai!) and two open.
Remaining feedback items on the miller todo.txt.

Thanks all for the initial feedback -- exactly what I was looking for by
posting here!

------
pessimizer
Whenever somebody posts a csv manipulation tool, it's obligatory for everyone
else to post their favorite one too:
[http://neilb.bitbucket.org/csvfix/](http://neilb.bitbucket.org/csvfix/) is
mine; manual at
[http://neilb.bitbucket.org/csvfix/manual/csvfix16/csvfix.htm...](http://neilb.bitbucket.org/csvfix/manual/csvfix16/csvfix.html)

------
pentestercrab
Does anyone happen to know of similar or generally useful tools for dealing
with YAML files? Would ideally like to be able to run SQL-like queries.

------
acveilleux
I have my own little toolkit of csv parsing tools that do most of these (using
python's csv module.)

I suspect most people who do deal with bulk CSV files all the time do as well.
I'd be curious about how gracefully it handles large CSV files.

The lack of quoting support kills it for my use case however. I wrote my own
tools, starting with csv_cut _because_ cut(1) didn't do quoting.

~~~
ggrothendieck
Regarding quoting support, check out
[https://github.com/dbro/csvquote](https://github.com/dbro/csvquote)

~~~
acveilleux
Not sufficient, but close. It substitutes the quoted character for another
one.

~~~
dbro
Hi- I'm curious to know what your use case is. Can you explain why
substitution is not sufficient?

~~~
acveilleux
(1) I filter on column content using regex and dealing with a sub character
adds complexity.

(2) Many of my columns are free-form text containing commas, carriage returns,
new lines, tab, vertical tabs and file separator (0x1c). Occasionally, text is
in UCS-2/UTF-16 or uses UTF-8 and foreign characters (a non-trivial quantity
of the text I process is in French for example.)

(If you read between the lines here, some columns can contain MLLP-encoded HL7
messages, others contain free-form text and I'm in the medical field.)

------
natch
This is lovely. Portable C, awesome.

~~~
ghshephard
Not totally portable. They have built some ctags option dependencies that I'm
going to have to futz with in order to get the makefile to run.

    
    
        12:58 shephard:c shephard$ make all
        ctags -R .
        ctags: illegal option -- R
        usage: ctags [-BFadtuwvx] [-f tagsfile] file ...
        make: *** [tags] Error 1
    

[Edit - No "-R" option for ctags in OS X 10.8.5, or in OpenBSD Current -
[http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-
current/man1/...](http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-
current/man1/ctags.1?query=ctags)

Edit2 - Now, it seems to be breaking on lemon not being explicitly called out
in the path during a make:

    
    
      lemon filter_dsl_parse.y
      make[1]: lemon: No such file or directory
      make[1]: *** [filter_dsl_parse.c] Error 1
      make: *** [dsls] Error 2
      13:07 shephard:c shephard$
    
    ]

~~~
john_kerl
ctags etc I'll split out. effectively there are make targets to make dev nice,
and a smaller set just to make the binary. ctags definitely in the former
category. thanks for the feedback!

~~~
john_kerl
fixed in master

------
jongraehl
To build: install flex and ctags, and -Wno-error=unused-but-set-variable

------
chromano
I see a lot of value on this tool, because I've been in the same situation as
you. I can't thank you enough for this, a tool that I've been wanting to
implement for so long.

~~~
john_kerl
thanks! i use it regularly. but there are surely many corner cases & bugfixes
i haven't encountered yet. i'd be happy to get your feedback.

------
jazzido
the little known csvfix
([https://bitbucket.org/neilb/csvfix](https://bitbucket.org/neilb/csvfix)) is
pretty good too.

------
nileshtrivedi
This is nice! It'd be great to download a .deb package for it.

~~~
john_kerl
good idea. I also have a request for a homebrew formula, which looks simple
enough to do.

~~~
avn2109
Request seconded, Homebrew would be great.

------
john_kerl
RFC4180 CSV support:
[https://news.ycombinator.com/item?id=10132831](https://news.ycombinator.com/item?id=10132831)

~~~
john_kerl
[https://github.com/johnkerl/miller/releases/tag/v2.0.0](https://github.com/johnkerl/miller/releases/tag/v2.0.0)

------
jevgeni
This thing is convenient, but I don't understand why it needs a DSL? Why do I
need to use `mrl --csv sort ...`, instead of `mrl --csv ... | sort`?

~~~
asdfaoeu
Because otherwise it will sort the header line. It looks like the reason for
the --csv is because csv isn't the default format.

------
san_dimitri
I am surprised no ones looking at Apache Drill.
[https://drill.apache.org](https://drill.apache.org)

~~~
technofiend
I've used it out of curiosity and it works, but honestly I just fell back to
my primary weapon of choice: Pentaho. This isn't because I'm not comfortable
searching data at the command line, it's more because of momentum in that all
my other data is already in a database and I usually need to join datasets,
not just search them.

It's trivial to create a two hop transformation in Pentaho. The first step
reads your data out of a CSV, the second loads it to a table. Pentaho will
generate the DDL needed to create the table based on discovered field names
and types found in the import step based on data introspection.

Having said that I've often wished for a tool exactly like this that's lighter
weight than drill (and doesn't expose my data on a web page like drill does)
so I will definitely bookmark this for the next time it's needed. Hopefully it
makes it into the Redhat EPEL.

------
djrobstep
Nothing wrong with this approach I suppose but wouldn't most people be better
off to just use a database for this sort of thing?

~~~
dmix
Depends on your familiarity with the unix toolset vs use cases, an amusing (if
somewhat contrived) example of this:

[http://aadrake.com/command-line-tools-can-be-235x-faster-
tha...](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-
hadoop-cluster.html)

Most CSV analysis is static, not something ongoing with inserts/updates/etc,
so using a combination of unix tools to handle the data processing seems like
a sufficient solution for many scenarios.

~~~
john_kerl
+1 and nice link :)

------
Pamar
Can anyone suggest something similar for Windows?

~~~
Cyykratahk
PowerShell can handle CSV files natively.

Here are the equivalent commands from the article:

    
    
      Import-Csv example.csv | Select-Object year, price   | Export-Csv -NoTypeInformation example-cut.csv
      Import-Csv example.csv | Sort-Object   year          | Export-Csv -NoTypeInformation example-sorted.csv
      Import-Csv example.csv | Where-Object  year -eq 1999 | Export-Csv -NoTypeInformation example-filtered.csv
    

PowerShell can also deal with JSON and XML in much the same way.

~~~
useerup
This neatly really illustrates _why objects matter_.

As cool as miller is, _jq_ is also decidedly cool, but for json. Both claim
that they are "like" awk, sed, cut, paste etc, but for $Format, where $Format
is csv/tsv and json, respectively.

They both deal with much the same tasks, and there is _considerable_ overlap.

What your PowerShell examples do not say, is how - after Import-Csv - you are
no longer constrained to tools written for csv's (or json or ...). At that
point the records of the csv file have become _objects_ in the PowerShell type
system (which is an extension of the .NET type system).

Another point about the corresponding functionality in PowerShell is how you
_do not_ need to convert back to csv, json or whatever. Your _Export-Csv_
commands in your examples are only needed if you want to finally save the
results as csv. Otherwise you would just keep the objects in a variable or
flowing through the pipeline/script.

So TFAs 2nd example could be written

    
    
        ipcsv flins.csv | group county -n
    

If the data had been in json, one would write

    
    
        cat flins.json | convertfrom-json | group county -n
    

Which also means that data can be export through any cmdlet that exports or
converts a stream of objects: Export-Csv, ConvertTo-Csv, ConvertTo-Html,
ConvertTo-Json, ConvertTo-Xml, ...

------
chatman
License is unclear.

~~~
john_kerl
bsd but i'll emphasize

------
smegel
If you work with this sort of data a lot, you might want to look at R.

~~~
brudgers
Your comment brings up an interesting point. Sed, Awk, etc. are single purpose
tools and that's one of the reasons their so long lived according to the
common wisdom of *Nix. The other side of that wisdom is that once you build a
swiss-army-knife, it gets harder to argue against just using a full blown
programming language, like R, TCL, Perl, Python, etc.

~~~
uxcn
I use R, and I still use ad-hoc commands from time to time too. I guess _mlr_
and _xsv_ lie somewhere between, which I've personally never really
considered.

R is specifically designed for manipulating datasets though, and has a _lot_
more functionality/dependencies. If you're already familiar with the language,
and okay with the dependencies, it is vastly more powerful, more succinct, and
arguably easier to use. Otherwise, something in between probably still makes
sense.

