There's also https://github.com/BurntSushi/xsv which is fairly similar in how it works to Miller, though it can also index a CSV file and then future operation will be sped up by utilizing the index.
Hi. q's developer here. Thanks for the mention and kind words everyone.
I've considered the searchability issue when deciding on a name for it, but eventually favored the day-to-day minimum-typing short name over better searchability.
Anyway, you can search for "harelba q" in order to find it if needed.
DBD::CSV is a library, which you could use to build a command, but doesn't seem to come with one out of the box. q is just a command, but not a library.
One could use DBI + DBD::CSV + Perl to build something similar to q, but that's a batteries not included solution.
csvkit is excellent. I like `csvlook` for display csv files in the terminal.
Can anyone recommend an easy way to convert csv to xls on Linux? I know that libreoffice and gnumeric can do it in headless mode, but it seems terrible overkill.
It is mega overkill but Pentaho does conversion between a long list of file and database formats. I'm more mentioning it so you can check it out in case you ever see a more complex requirement, or want to learn the tool. Once I learned it the one off stuff is trivial.
It takes me 5-10 mins to import an xls, let the tool examine the file for field names and types, hook that to a table load step, generate the ddl, edit the ddl to remove spaces from field names and then execute the ddl followed by the loader job.
Since you're going flat file to flat file it would be faster for you as no field name edits are required. You just create a csv input step, xls output step and that's it, save and run the transformation and you're done.
As others have mentioned, you can find variations of this around online, but if you wish to roll you own (perhaps to add other features or what have you), you could knock it out in about 20 lines of Python using the built in CSV library and Openpyxl [1].
> And initial experiments with Rust are intriguing. Yet with none of them could I obtain the throughput I get in C: see the Performance page.
I'd be curious to see how `xsv` (written in Rust) compares performance wise to Miller. (I can't compile Miller---I commented on the issue tracker.)
> Thus the absence of in-process multiprocessing is only a slight penalty in this particular application domain — parallelism here is more easily achieved by running multiple single-threaded processes, each handling its own input files, either on a single host or split across multiple hosts.
I disagree. CSV files can be indexed[1] very simply, providing random access. This makes it trivial to split a large CSV file into chunks which can be operated on in parallel. This lifts the burden off the user to do the splitting and merging manually, which can be become quite tedious when you want to do frequency analysis on the fields.
This type of indexing is not usually seen because it requires support from the underlying CSV parser to produce correct byte offsets. But you've written your own parser here in Miller, so you should be able to do something similar to `xsv`. Moreover, random access gets you slicing proportional to the slice. i.e., extract the last 10 records in a 4M row CSV file is instanteous. This opens up other doors too, like potentially faster joining and fast random sampling.
Not exactly the same thing, but I often use a tiny tool I wrote called shellql[1] that lets you just type in SQL like
cat foo.csv | shellql "select f1, count(*) from tbl where f2='thingiwant' group by f1 order by count(*) desc limit 5"
It has some limitations on the formats of CSV that it can take, namely that string fields have to be quoted, but you can see from the code that there's not much to it if you wanted to change that. It'd also be easy enough to take in multiple CSV files, join them together, etc.
It looks like this tool doesn't support semicolon-separated CSV files, quotes, or numbers using a comma as decimal separator. Too bad.
It's a pity that the CSV file format is so fragmented. It's all too common that a CSV file written by tool A can't be read by tool B.
Only very few tools do it right. For example, I think the ruby CSV module has a heuristic to automatically detect the "style" of CSV files, which is pretty neat.
CSV stands for 'Comma-Separated Values', not 'Semicolon-Separated Values'.
Seeing as this was downvoted, I'm just pointing out that bemoaning how the CSV format is fragmented and then using it incorrectly is the precise reason it is fragmented. It wasn't originally meant to be extended in so many ways, so there really isn't room to complain when a library doesn't magically figure out your personal delimiter or qualifier.
I sometimes dumb down CSV to what sqlite understands and then just import it there. Gonna have a look at this, but whether it's SQL or Ruby/Python/Perl structures, I generally prefer to spend the least amount possible within actual CSV. Even more so if it contains hierarchical data.
I see a lot of value on this tool, because I've been in the same situation as you. I can't thank you enough for this, a tool that I've been wanting to implement for so long.
ctags etc I'll split out. effectively there are make targets to make dev nice, and a smaller set just to make the binary. ctags definitely in the former category. thanks for the feedback!
my aim is portability within a 21st century context ... I need to find a succinct way to say that. the word "portable" is the wrong choice; sounds like it would compile on everything ever which I don't aim for.
(1) I filter on column content using regex and dealing with a sub character adds complexity.
(2) Many of my columns are free-form text containing commas, carriage returns, new lines, tab, vertical tabs and file separator (0x1c). Occasionally, text is in UCS-2/UTF-16 or uses UTF-8 and foreign characters (a non-trivial quantity of the text I process is in French for example.)
(If you read between the lines here, some columns can contain MLLP-encoded HL7 messages, others contain free-form text and I'm in the medical field.)
quoting is key but i've not yet stumbled for lack of it. but it's very much worth doing.
miller handles large files well. it's fast enough, and doesn't hold entire file contents unless it has to (e.g. sort). so gigabytes of input data are fine.
I've used it out of curiosity and it works, but honestly I just fell back to my primary weapon of choice: Pentaho. This isn't because I'm not comfortable searching data at the command line, it's more because of momentum in that all my other data is already in a database and I usually need to join datasets, not just search them.
It's trivial to create a two hop transformation in Pentaho. The first step reads your data out of a CSV, the second loads it to a table. Pentaho will generate the DDL needed to create the table based on discovered field names and types found in the import step based on data introspection.
Having said that I've often wished for a tool exactly like this that's lighter weight than drill (and doesn't expose my data on a web page like drill does) so I will definitely bookmark this for the next time it's needed. Hopefully it makes it into the Redhat EPEL.
Most CSV analysis is static, not something ongoing with inserts/updates/etc, so using a combination of unix tools to handle the data processing seems like a sufficient solution for many scenarios.
I use CSV all the time because I am constantly generating new datasets and it is nice to be able to literally cat|pbcopy them and paste them into DataGraph to see the results. Every major language has some sort of csv reader (usually into a dataframe type object) which makes it very easy to work with. Databases in my experience are laborious to set up and I would need to learn the API rather than just println("a,b,c,d").
This neatly really illustrates why objects matter.
As cool as miller is, jq is also decidedly cool, but for json. Both claim that they are "like" awk, sed, cut, paste etc, but for $Format, where $Format is csv/tsv and json, respectively.
They both deal with much the same tasks, and there is considerable overlap.
What your PowerShell examples do not say, is how - after Import-Csv - you are no longer constrained to tools written for csv's (or json or ...). At that point the records of the csv file have become objects in the PowerShell type system (which is an extension of the .NET type system).
Another point about the corresponding functionality in PowerShell is how you do not need to convert back to csv, json or whatever. Your Export-Csv commands in your examples are only needed if you want to finally save the results as csv. Otherwise you would just keep the objects in a variable or flowing through the pipeline/script.
So TFAs 2nd example could be written
ipcsv flins.csv | group county -n
If the data had been in json, one would write
cat flins.json | convertfrom-json | group county -n
Which also means that data can be export through any cmdlet that exports or converts a stream of objects: Export-Csv, ConvertTo-Csv, ConvertTo-Html, ConvertTo-Json, ConvertTo-Xml, ...
It handles around 20 different input formats and is pretty fast. Has a COM API and is extensible. we use it to capture event log data from our windows platform. Has an edge over PowerShell when it comes to speed for certain types of tasks.
Your comment brings up an interesting point. Sed, Awk, etc. are single purpose tools and that's one of the reasons their so long lived according to the common wisdom of *Nix. The other side of that wisdom is that once you build a swiss-army-knife, it gets harder to argue against just using a full blown programming language, like R, TCL, Perl, Python, etc.
I use R, and I still use ad-hoc commands from time to time too. I guess mlr and xsv lie somewhere between, which I've personally never really considered.
R is specifically designed for manipulating datasets though, and has a lot more functionality/dependencies. If you're already familiar with the language, and okay with the dependencies, it is vastly more powerful, more succinct, and arguably easier to use. Otherwise, something in between probably still makes sense.