
A list of command line tools for manipulating structured text data - networked
https://github.com/dbohdan/structured-text-tools
======
chishaku
Also csvkit.

    
    
      in2csv data.xls > data.csv
      in2csv data.json > data.csv
      csvcut -c column_a,column_c data.csv > new.csv
      csvcut -c column_c,column_a data.csv > new.csv
      csvgrep -c phone_number -r 555-555-\d{4}" data.csv > matching.csv
      csvjson data.csv > data.json
      csvsql --query "select name from data where age > 30" data.csv > old_folks.csv
      csvsql --db postgresql:///database --insert data.csv
      sql2csv --db postgresql:///database --query "select * from data" > extract.csv
    

[https://github.com/wireservice/csvkit](https://github.com/wireservice/csvkit)

(Submitted a pull request.)

~~~
Apofis
Oh man where were you a few months ago where I had to write simple apples just
to do some of this stuff... also this will be so damn handy!:

    
    
        csvsql --query "select name from data where age > 30" data.csv > old_folks.csv

~~~
draegtun
Also see DBD::CSV which been around since 1998! -
[https://metacpan.org/pod/DBD::CSV](https://metacpan.org/pod/DBD::CSV)

------
benou
awk is really, really powerful. It is fast, and you can do a lot very
efficiently just playing with the 'FS' variable. And you can find it on all
*nix boxes. And it works nicely with cli other tools such as cut, paste, or
datamash
([http://www.gnu.org/software/datamash/](http://www.gnu.org/software/datamash/)).
As soon as it becomes too complex though it is better to resort to a real
language (be it python, perl or whatever - my favourite is python + scipy).

~~~
saturncoleus
Awk is great at what it does, but I find myself unable to keep it cached in my
brain long enough to reuse it. Using awk usually means a google search of how
to use it, which defeats quickly working at a term.

~~~
joepvd
awk is a really productive constraint one can oblige to. If a problem looks
like a text processing problem, use awk, and let the idioms set in. It is an
awesome tool to have on one's belt.

It is, in a sense, liberating to not have any library support for common
problems: No need to learn a library (hej!) and, by the way, what you need in
any given problem is an easy subset of what that library would do anyway.

Awk has made me focus more on the data there is to analyze, rather than the
framework to analyze it with.

As the idiomatic use of awk is also very succinct, I hardly can imagine
working efficiently on the command line without it.

~~~
kbenson
I take a different stance. I never use awk, because I can use Perl. In using
Perl, I also have access to the many modules that can extend it's usefulness.
One of the core reasons Perl was created was to fill in where awk wasn't as
useful as it could be.

~~~
drauh
I use awk only if it's a trivial one-liner. The only awk I can remember is
selecting specific columns in whitespace-separated text. If it's just a single
column, I'll try to use cut(1).

For anything more complicated, I'll use Python. Because if it's something that
is going to have a somewhat long life, and that I would need to feed into
plotting, I'd rather use Python. Because I remember the bad old days of
plot(1).

------
baldfat
HIGHLY Recommended Book:

Data Science on the Command Line

[http://datascienceatthecommandline.com/](http://datascienceatthecommandline.com/)

------
kazinator
[http://www.nongnu.org/txr](http://www.nongnu.org/txr)

~~~
vojvod
I've used some very basic TXR for refactorings that were a bit beyond my IDE's
capabilities, which gave me a taste of how powerful it could be. One thing
that's slowed me down in experimenting with it is having to save the script,
rerun TXR and refresh the output file each time I make a change. Do you have
any tips for quickly and interactively building complex scripts?

~~~
kazinator
Not in the TXR pattern language, I'm afraid! TXR Lisp has those interactive
properties of just re-loading a module on-the-fly to redefine some functions
without redefining the entire program, but the pattern language isn't
interactive in that way.

This is from the ground up, pretty much, emanating from the way it is parsed
in its entirety before being executed.

You can develop the logic in small pieces and test them in isolation on some
sub-segment of the data, then integrate the pieces into a larger script with
@(load ...).

Speaking of code refactorings, I also use it for that myself. When using it to
interactively rewrite code, I invoke it in a pipeline out of vim. E.g. select
some range of code visually with V, then "!txr script.txr -[Enter]" to filter
it through. Vim deletes the original text and replaces it with the output. If
the output is wrong, I hit u for undo to restore the original text. Then of
course I have to fix script.txr and save it, and recall and repeat that
filtering step just with ":[up arrow][Enter]".

To avoid refreshing the output file during development, don't have one; let it
dump to standard output.

Something useful is that we can also monitor the first N lines of an output
file that fit into the screen using the watch utility that is found in many
GNU/Linux distros. Watch repeatedly executes some command that is given (once
every 2.0 seconds by default) and splashes its output on a clear screen. If we
"watch cat file", we can monitor the changing contents of file.

~~~
vojvod
I didn't think of calling TXR from within vim, thanks I'll try that.

Watch could get me closer to the workflow I'm used to from web development
where the browser auto-reloads whenever a change is made. Your mentioning vim
(of which I'm not exactly a power user) actually prompted me to check if it
had the ability to run a command on writing a buffer or to automatically
reload a file when it changes. Turns out it does indeed have ways to do both
these things, so that might be another way to achieve what I'm after.

------
GhotiFish
Check out pup for parsing HTML.
[https://github.com/ericchiang/pup](https://github.com/ericchiang/pup)

pup uses CSS selectors to select elements from HTML documents. Used in
conjunction with curl, it gives you a very simple and low friction way to
scrape data in scripts.

------
crb002
I would add to that list Nokogiri, "The Chainsaw". xsltproc is ubiquitous, but
writing xslt is akin to having a pack of wild monkeys compose a mural with
their excrement.

------
fiatjaf
It isn't easy (perhaps not even possible) to get the name of the fruit Bob
grows in his farm using any of these tools and the following data:

    
    
      {
        "models": [{
          "title": "fruits",
          "fields": [
            {"name": "Name", "key": "3746"},
            {"name": "Colour", "key": "4867"}
          ],
          "entities": [{
            "_id": "372612",
            "3746": "Orange",
            "4867": "orange"
          }]
        }, {
          "title": "farmers",
          "fields": [
            {"name": "Full name", "key": "8367"},
            {"name": "Address", "key": "3947"},
            {"name": "Fruits", "key": "5243"}
          ],
          "entities": [{
            "_id": "747463",
            "8367": "Bob, the farmer",
            "3947": "Farmland",
            "5243": ["372612"]
          }]
        }]
      }

~~~
sdegutis
That's true even in Clojure, arguably the simplest and cleanest language ever
invented for complex data transformation and extraction.

The Clojure solution to this still ends up requiring temporary variables and
some sort of model transformation functionality. (Will try to post my Clojure
solution in 5 hours after my next noprocrast timer is up.)

If the data could first be transformed so that it doesn't require temporary
variables or ad-hoc transformation function definitions, instead making use of
"paths", then it would be easier with command line tools. Such a
transformation could be possible as its own command line interface.

~~~
tekacs
Yup, lots of temporaries (I haven't written much Clojure recently, so I'm sure
I'm missing lots of simplifying core fns).

    
    
      (let [; input
            js (cheshire.core/parse-string (slurp (clojure.java.io/resource "awful.json")))
            ; utils
            get-match (fn [k v coll] (first (filter #(= (get % k) v) coll)))
            model-for (fn [title toplevel] (get-match "title" title (toplevel "models")))
            key-for (fn [name* model] ((get-match "name" name* (model "fields")) "key"))
            ; models
            fruit-model (model-for "fruits" js)
            farmer-model (model-for "farmers" js)
            ; keys
            fruit-name-key (key-for "Name" fruit-model)
            farmer-name-key (key-for "Full name" farmer-model)
            farmer-fruits-key (key-for "Fruits" farmer-model)
            ; values
            bob-entity (get-match farmer-name-key "Bob, the farmer" (farmer-model "entities"))
            bob-fruit-keys (bob-entity farmer-fruits-key)
            bob-fruit-entities (map #(get-match "_id" % (fruit-model "entities")) bob-fruit-keys)
            bob-fruit-names (map #(% fruit-name-key) bob-fruit-entities)]
        bob-fruit-names)
    
      => ("Orange")

------
kbenson
I've found fsql[1] to be extremely useful in the past.

1: [https://metacpan.org/pod/distribution/App-
fsql/bin/fsql](https://metacpan.org/pod/distribution/App-fsql/bin/fsql)

------
Gratsby
Don't forget cut and sed.

------
junke
pgloader: load from CSV (and others) to postgreSQL.

See
[http://pgloader.io/howto/quickstart.html](http://pgloader.io/howto/quickstart.html)

------
fiatjaf
Does anyone know of a tool like ranger[1] for visualizing JSON on the
terminal? There is a Chrome Extension[2], but nothing useful to browse JSON on
the terminal (it doesn't have to be like ranger, I'm looking for any tool that
makes it easier to take a look at a JSON file).

    
    
      [1]: https://github.com/hut/ranger
      [2]: https://chrome.google.com/webstore/detail/json-finder/flhdcaebggmmpnnaljiajhihdfconkbj

~~~
Tiksi
If you use vim there's [https://github.com/elzr/vim-
json](https://github.com/elzr/vim-json) which gives you folding, highlighting,
etc.

------
anonfunction
For converting arrays of objects between formats like CSV, JSON, YAML, XML
(WIP), etc... I built aoot[1] which stands for "Array of objects to". It's
written in Node.js and uses upstream packages whenever possible.

1\.
[https://github.com/montanaflynn/aoot](https://github.com/montanaflynn/aoot)

------
known
[http://www.commandlinefu.com/commands/browse/sort-by-
votes](http://www.commandlinefu.com/commands/browse/sort-by-votes) has some
cool tips

------
pessimizer
[http://neilb.bitbucket.org/csvfix/](http://neilb.bitbucket.org/csvfix/)

------
nemoniac
And not a single tool for s-expressions?

~~~
chriswarbo
I was after something like jq for s-expressions a while ago, but didn't find
anything other than full-blown Lisps, Schemes, or libraries written in other
languages: [http://stackoverflow.com/questions/31232843/jq-or-
xsltproc-a...](http://stackoverflow.com/questions/31232843/jq-or-xsltproc-
alternative-for-s-expressions)

------
zokier
xsv for doing queries against CSV files probably belongs to the list too:
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

------
ktRolster
yacc, antlr, lemon, bison.....

~~~
crb002
Definitely. Most software vulnerabilities are from failure to write formal
parsers on on all inputs. Is there a command line YACC for compiling simple
stuff?

~~~
ktRolster
"Most software vulnerabilities are from failure to write formal parsers on on
all inputs."

That's a good quote

~~~
lsh
This might also interest you? [http://langsec.org/](http://langsec.org/)

~~~
ktRolster
thanks. Have you attended the conference?

