Hacker News new | past | comments | ask | show | jobs | submit login
Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV (johnkerl.org)
222 points by john_kerl on Aug 15, 2015 | hide | past | favorite | 76 comments



There is also the (impossible to google) https://github.com/harelba/q that lets you SQL over CSV and is unix pipeline friendly.


There's also https://github.com/BurntSushi/xsv which is fairly similar in how it works to Miller, though it can also index a CSV file and then future operation will be sped up by utilizing the index.


xsv is awesome, just used it for a mini-project. Thanks for the heads up.


Hi. q's developer here. Thanks for the mention and kind words everyone.

I've considered the searchability issue when deciding on a name for it, but eventually favored the day-to-day minimum-typing short name over better searchability.

Anyway, you can search for "harelba q" in order to find it if needed.

Harel @harelba


Miller looks very useful, and so does q. Thanks to the op and to you for the introductions.


indeed, I hadn't found q. thanks for the intro!


This is great. Wonder why they didn't go with "cq" or something, though.


does it bring any advantages over https://metacpan.org/pod/DBD::CSV? (not being perl is not an advantage).


Well, DBD::CSV is just a driver for DBI, not a command that is "unix pipeline friendly"... So there's that.


Perl is pipeline friendly though ...


DBD::CSV is a library, which you could use to build a command, but doesn't seem to come with one out of the box. q is just a command, but not a library.

One could use DBI + DBD::CSV + Perl to build something similar to q, but that's a batteries not included solution.


I love csvkit[0] for this purpose

[0] https://github.com/onyxfish/csvkit


simple csvkit tutorial:

"

- Convert Excel to CSV: in2csv data.xls > data.csv

- Convert JSON to CSV: in2csv data.json > data.csv

- Print column names: csvcut -n data.csv

- Select a subset of columns: csvcut -c column_a,column_c data.csv > new.csv

- Reorder columns: csvcut -c column_c,column_a data.csv > new.csv

- Find rows with matching ells: csvgrep -c phone_number -r 555-555-\d{4}" data.csv > matching.csv

- Convert to JSON: csvjson data.csv > data.json

- Generate summary statistics: csvstat data.csv

- Query with SQL: csvsql --query "select name from data where age > 30" data.csv > old_folks.csv

- Import into PostgreSQL: csvsql --db postgresql:///database --insert data.csv

- Extract data from PostgreSQL:: sql2csv --db postgresql:///database --query "select * from data" > extract.csv

"

see more: http://csvkit.readthedocs.org/en/0.9.1/#


csvkit is excellent. I like `csvlook` for display csv files in the terminal.

Can anyone recommend an easy way to convert csv to xls on Linux? I know that libreoffice and gnumeric can do it in headless mode, but it seems terrible overkill.


It is mega overkill but Pentaho does conversion between a long list of file and database formats. I'm more mentioning it so you can check it out in case you ever see a more complex requirement, or want to learn the tool. Once I learned it the one off stuff is trivial. It takes me 5-10 mins to import an xls, let the tool examine the file for field names and types, hook that to a table load step, generate the ddl, edit the ddl to remove spaces from field names and then execute the ddl followed by the loader job.

Since you're going flat file to flat file it would be faster for you as no field name edits are required. You just create a csv input step, xls output step and that's it, save and run the transformation and you're done.


A good way to find those kinds of conversion tools is to search using the x2y pattern. For example, searching "csv2xls" returns:

https://github.com/ferkulat/csv2xls

https://github.com/alexprengere/csv2xls

http://www.bitools.eu/csv2xls/


As others have mentioned, you can find variations of this around online, but if you wish to roll you own (perhaps to add other features or what have you), you could knock it out in about 20 lines of Python using the built in CSV library and Openpyxl [1].

[1] https://openpyxl.readthedocs.org/en/latest/


> And initial experiments with Rust are intriguing. Yet with none of them could I obtain the throughput I get in C: see the Performance page.

I'd be curious to see how `xsv` (written in Rust) compares performance wise to Miller. (I can't compile Miller---I commented on the issue tracker.)

> Thus the absence of in-process multiprocessing is only a slight penalty in this particular application domain — parallelism here is more easily achieved by running multiple single-threaded processes, each handling its own input files, either on a single host or split across multiple hosts.

I disagree. CSV files can be indexed[1] very simply, providing random access. This makes it trivial to split a large CSV file into chunks which can be operated on in parallel. This lifts the burden off the user to do the splitting and merging manually, which can be become quite tedious when you want to do frequency analysis on the fields.

This type of indexing is not usually seen because it requires support from the underlying CSV parser to produce correct byte offsets. But you've written your own parser here in Miller, so you should be able to do something similar to `xsv`. Moreover, random access gets you slicing proportional to the slice. i.e., extract the last 10 records in a 4M row CSV file is instanteous. This opens up other doors too, like potentially faster joining and fast random sampling.

P.S. I love the name. :-)

[1] - http://burntsushi.net/rustdoc/csv/index/index.html


Not exactly the same thing, but I often use a tiny tool I wrote called shellql[1] that lets you just type in SQL like

    cat foo.csv | shellql "select f1, count(*) from tbl where f2='thingiwant' group by f1 order by count(*) desc limit 5"
It has some limitations on the formats of CSV that it can take, namely that string fields have to be quoted, but you can see from the code that there's not much to it if you wanted to change that. It'd also be easy enough to take in multiple CSV files, join them together, etc.

sqlite's pretty badass.

[1] https://gist.github.com/ketralnis/8124b3ead9d94e00dcb2


Or csvtool (written 2005, available in all good Linux distros):

https://github.com/Chris00/ocaml-csv/blob/master/examples/cs...



Also check out Google's Crush-tools

https://github.com/google/crush-tools


It looks like this tool doesn't support semicolon-separated CSV files, quotes, or numbers using a comma as decimal separator. Too bad.

It's a pity that the CSV file format is so fragmented. It's all too common that a CSV file written by tool A can't be read by tool B.

Only very few tools do it right. For example, I think the ruby CSV module has a heuristic to automatically detect the "style" of CSV files, which is pretty neat.


I found the pandas tests to be a great source for implementing robust CSV parsers:

https://github.com/pydata/pandas/blob/master/pandas/io/tests...


Auto-detecting format sounds like a recipe for injection attacks.


semicolons etc you get with FS RS OS etc. number commas I can probably implement with locale; i'll look. and quotes are definitely to do. :)


CSV stands for 'Comma-Separated Values', not 'Semicolon-Separated Values'.

Seeing as this was downvoted, I'm just pointing out that bemoaning how the CSV format is fragmented and then using it incorrectly is the precise reason it is fragmented. It wasn't originally meant to be extended in so many ways, so there really isn't room to complain when a library doesn't magically figure out your personal delimiter or qualifier.


You can name fields in awk so that you don't have to refer to them numerically as $8, $13 or whatever. Example:

   BEGIN { first=1; second=2; third=3 }

   $first > 42 { print $third }
Here, $first means $ applied to the variable first. Since first evaluates to 1, it means $1.

Broken CSV handling that doesn't respect quotes is easily done in awk using comma as the field separator.


What about non-broken CSV? Is there an easy way to tell awk to skip escaped or quoted delimeters?


I sometimes dumb down CSV to what sqlite understands and then just import it there. Gonna have a look at this, but whether it's SQL or Ruby/Python/Perl structures, I generally prefer to spend the least amount possible within actual CSV. Even more so if it contains hierarchical data.


I made a little tool called SQLBong that does something similar by embedding Sqlite3 under the hood: https://github.com/sordina/SQLBong


Several feedbacks incorporated at

https://github.com/johnkerl/miller/commit/a1d117d3b299bbf273...

https://github.com/johnkerl/miller/commit/7aead02c71a76fb281...

Also there is one issue closed (thanks epilanthanomai!) and two open. Remaining feedback items on the miller todo.txt.

Thanks all for the initial feedback -- exactly what I was looking for by posting here!


I see a lot of value on this tool, because I've been in the same situation as you. I can't thank you enough for this, a tool that I've been wanting to implement for so long.


thanks! i use it regularly. but there are surely many corner cases & bugfixes i haven't encountered yet. i'd be happy to get your feedback.


Whenever somebody posts a csv manipulation tool, it's obligatory for everyone else to post their favorite one too: http://neilb.bitbucket.org/csvfix/ is mine; manual at http://neilb.bitbucket.org/csvfix/manual/csvfix16/csvfix.htm...


This is lovely. Portable C, awesome.


Not totally portable. They have built some ctags option dependencies that I'm going to have to futz with in order to get the makefile to run.

    12:58 shephard:c shephard$ make all
    ctags -R .
    ctags: illegal option -- R
    usage: ctags [-BFadtuwvx] [-f tagsfile] file ...
    make: *** [tags] Error 1
[Edit - No "-R" option for ctags in OS X 10.8.5, or in OpenBSD Current - http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man1/...

Edit2 - Now, it seems to be breaking on lemon not being explicitly called out in the path during a make:

  lemon filter_dsl_parse.y
  make[1]: lemon: No such file or directory
  make[1]: *** [filter_dsl_parse.c] Error 1
  make: *** [dsls] Error 2
  13:07 shephard:c shephard$

]


ctags etc I'll split out. effectively there are make targets to make dev nice, and a smaller set just to make the binary. ctags definitely in the former category. thanks for the feedback!


fixed in master


yes, there is an issue open. I have . in my path. will post a diff.


He says he used GNU C99, how is that portable C?

http://johnkerl.org/miller/doc/whyc.html#C_is_old-school


my aim is portability within a 21st century context ... I need to find a succinct way to say that. the word "portable" is the wrong choice; sounds like it would compile on everything ever which I don't aim for.


Does anyone happen to know of similar or generally useful tools for dealing with YAML files? Would ideally like to be able to run SQL-like queries.


I have my own little toolkit of csv parsing tools that do most of these (using python's csv module.)

I suspect most people who do deal with bulk CSV files all the time do as well. I'd be curious about how gracefully it handles large CSV files.

The lack of quoting support kills it for my use case however. I wrote my own tools, starting with csv_cut because cut(1) didn't do quoting.


Regarding quoting support, check out https://github.com/dbro/csvquote


Not sufficient, but close. It substitutes the quoted character for another one.


Hi- I'm curious to know what your use case is. Can you explain why substitution is not sufficient?


(1) I filter on column content using regex and dealing with a sub character adds complexity.

(2) Many of my columns are free-form text containing commas, carriage returns, new lines, tab, vertical tabs and file separator (0x1c). Occasionally, text is in UCS-2/UTF-16 or uses UTF-8 and foreign characters (a non-trivial quantity of the text I process is in French for example.)

(If you read between the lines here, some columns can contain MLLP-encoded HL7 messages, others contain free-form text and I'm in the medical field.)


quoting is key but i've not yet stumbled for lack of it. but it's very much worth doing.

miller handles large files well. it's fast enough, and doesn't hold entire file contents unless it has to (e.g. sort). so gigabytes of input data are fine.


A similar toolkit, RecordStream, with support for many types of inputs beyond CSV: https://github.com/benbernard/RecordStream


To build: install flex and ctags, and -Wno-error=unused-but-set-variable


the little known csvfix (https://bitbucket.org/neilb/csvfix) is pretty good too.


This is nice! It'd be great to download a .deb package for it.


good idea. I also have a request for a homebrew formula, which looks simple enough to do.


Request seconded, Homebrew would be great.




This thing is convenient, but I don't understand why it needs a DSL? Why do I need to use `mrl --csv sort ...`, instead of `mrl --csv ... | sort`?


Because otherwise it will sort the header line. It looks like the reason for the --csv is because csv isn't the default format.


I am surprised no ones looking at Apache Drill. https://drill.apache.org


I've used it out of curiosity and it works, but honestly I just fell back to my primary weapon of choice: Pentaho. This isn't because I'm not comfortable searching data at the command line, it's more because of momentum in that all my other data is already in a database and I usually need to join datasets, not just search them.

It's trivial to create a two hop transformation in Pentaho. The first step reads your data out of a CSV, the second loads it to a table. Pentaho will generate the DDL needed to create the table based on discovered field names and types found in the import step based on data introspection.

Having said that I've often wished for a tool exactly like this that's lighter weight than drill (and doesn't expose my data on a web page like drill does) so I will definitely bookmark this for the next time it's needed. Hopefully it makes it into the Redhat EPEL.


Nothing wrong with this approach I suppose but wouldn't most people be better off to just use a database for this sort of thing?


Depends on your familiarity with the unix toolset vs use cases, an amusing (if somewhat contrived) example of this:

http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

Most CSV analysis is static, not something ongoing with inserts/updates/etc, so using a combination of unix tools to handle the data processing seems like a sufficient solution for many scenarios.


+1 and nice link :)


I use CSV all the time because I am constantly generating new datasets and it is nice to be able to literally cat|pbcopy them and paste them into DataGraph to see the results. Every major language has some sort of csv reader (usually into a dataframe type object) which makes it very easy to work with. Databases in my experience are laborious to set up and I would need to learn the API rather than just println("a,b,c,d").


Can anyone suggest something similar for Windows?


PowerShell can handle CSV files natively.

Here are the equivalent commands from the article:

  Import-Csv example.csv | Select-Object year, price   | Export-Csv -NoTypeInformation example-cut.csv
  Import-Csv example.csv | Sort-Object   year          | Export-Csv -NoTypeInformation example-sorted.csv
  Import-Csv example.csv | Where-Object  year -eq 1999 | Export-Csv -NoTypeInformation example-filtered.csv
PowerShell can also deal with JSON and XML in much the same way.


This neatly really illustrates why objects matter.

As cool as miller is, jq is also decidedly cool, but for json. Both claim that they are "like" awk, sed, cut, paste etc, but for $Format, where $Format is csv/tsv and json, respectively.

They both deal with much the same tasks, and there is considerable overlap.

What your PowerShell examples do not say, is how - after Import-Csv - you are no longer constrained to tools written for csv's (or json or ...). At that point the records of the csv file have become objects in the PowerShell type system (which is an extension of the .NET type system).

Another point about the corresponding functionality in PowerShell is how you do not need to convert back to csv, json or whatever. Your Export-Csv commands in your examples are only needed if you want to finally save the results as csv. Otherwise you would just keep the objects in a variable or flowing through the pipeline/script.

So TFAs 2nd example could be written

    ipcsv flins.csv | group county -n
If the data had been in json, one would write

    cat flins.json | convertfrom-json | group county -n
Which also means that data can be export through any cmdlet that exports or converts a stream of objects: Export-Csv, ConvertTo-Csv, ConvertTo-Html, ConvertTo-Json, ConvertTo-Xml, ...


PowerShell has solved this problem incredibly well. It often feels more Unix in its philosophy, than some of the newer *nix tools, I've encountered.


Take a look at LogParser:

https://technet.microsoft.com/en-us/scriptcenter/dd919274.as...

http://www.microsoft.com/en-us/download/details.aspx?id=2465...

It handles around 20 different input formats and is pretty fast. Has a COM API and is extensible. we use it to capture event log data from our windows platform. Has an edge over PowerShell when it comes to speed for certain types of tasks.


q has a windows installer as well.

http://harelba.github.io/q/

Harel


License is unclear.


bsd but i'll emphasize


If you work with this sort of data a lot, you might want to look at R.


Your comment brings up an interesting point. Sed, Awk, etc. are single purpose tools and that's one of the reasons their so long lived according to the common wisdom of *Nix. The other side of that wisdom is that once you build a swiss-army-knife, it gets harder to argue against just using a full blown programming language, like R, TCL, Perl, Python, etc.


I use R, and I still use ad-hoc commands from time to time too. I guess mlr and xsv lie somewhere between, which I've personally never really considered.

R is specifically designed for manipulating datasets though, and has a lot more functionality/dependencies. If you're already familiar with the language, and okay with the dependencies, it is vastly more powerful, more succinct, and arguably easier to use. Otherwise, something in between probably still makes sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: