Convert JSON to a Unix-friendly line-based format

haberman · on April 10, 2011

"Everyone I know prefers to work with JSON over XML, but sadly there is a sore lack of utilities of the quality or depth of html-xml-utils and XMLStarlet for actually processing JSON data in an automated fashion, short of writing an ad hoc processor in your favourite programming language."

Actually there is just such a suite of utilites! See https://github.com/benbernard/RecordStream

adulau · on April 10, 2011

RecordStream seems more complete but requires much more dependencies. I really like the simplicity of jsonpipe and how easy is to use it:

      curl -s "http://feeds.delicious.com/v2/json/adulau  | jsonpipe -s "#"  | grep  "\#u" | cut -f2 | sed -e"s/\"//g" | xargs -d"\n" wget -r -l 1 –p –-convert-links

A simple example to take a local mirror of my last del.icio.us bookmarks...

jedsmith · on April 10, 2011

"simple"

sigil · on April 10, 2011

Maybe you're solving a different problem, but I'm not sure emitting one key-value pair per line is ultimately the way to go:

  $ echo '[{"a": [{"b": {"c": ["foo"]}}]}]' | jsonpipe
  /   []
  /0  {}
  /0/a        []
  /0/a/0      {}
  /0/a/0/b    {}
  /0/a/0/b/c  []
  /0/a/0/b/c/0        "foo"

In my own work, I've built up a suite of stream-based nested record processing tools that accept & produce JSON, protocol buffers, and a unix tab-delimited format. For the unix format it's been more useful to stick to the standard one-record-per-line thing, and let the user specify what fields to extract and their order.

Here's a depressing example of the fun you can have with new media and old unix tools, to give you some idea:

  $ mill io -r json -w texty -W fields=in_reply_to_screen_name < tweets.json | \
    grep -v -E '^(None|)$' | sort | uniq -c | sed -e 's#^ *##' | \
    sort -k1,1 -nr | head -5
  258 justinbieber
  248 ddlovato
  184 gypsyhearttour
  164 Logindaband
  145 Louis_Tomlinson

jedsmith · on April 10, 2011

    import os.path as p

Renaming imports to single letters bugs me. I went to see where it's used, and it isn't. Binding pyflakes to Cmd+S in TextMate was the best thing I ever did, and it would have caught this. pyflakes is seriously awesome in that role, and pylint before committing.

I'm also surprised that simplejson is used, instead of the built-in json in Python 2.6 and up. A good solution is:

    try: import json
    except ImportError: import simplejson as json

masklinn · on April 10, 2011

> I'm also surprised that simplejson is used

2.6's json lacks simplejson's C accelerators [0] which makes it roughly 20 times slower than simplejson (2.7's simplejson is also ~50% slower than simplejson, as new performance optimizations were added in 2.1.0, as well as memoizations). Simplejson is also updated more often, which lead to it having less bugs, and interesting new features (the ability to natively serialize decimals was added in 2.1 for instance)

simplejson should be the preferred import, with json as a fallback if available.

[0] 2.6's json is simplejson 1.9; 2.7's is 2.0.9; the latest release of simplejson is 2.1.3 and 2.1.4 is in preparation

jedsmith · on April 10, 2011

I elect to stick with the standard library in all cases, so I know how my software will perform everywhere without a redundant external dependency. Sticking with the standard library means you'll eventually get those improvements, too.

bretthoerner · on April 11, 2011

> a redundant external dependency

Is pip install that hard? Surely any deployed project is already managing a requirements.txt/buildout/setup.py/etc.

> means you'll eventually get those improvements

The C speedup extension for simplejson existed well before it was merged into Python stdlib as "json". Are you so sure you'll eventually get said improvement?

masklinn · on April 11, 2011

> Sticking with the standard library means you'll eventually get those improvements, too.

Not in 2.x, you won't.

wladimir · on April 11, 2011

Best of both worlds would be to try to import simplejson, and if that fails, fall back to the standard library json. As they have the same interface, that's trivial to do.

zpoley · on April 10, 2011

Here is my contribution to Unix friendly JSON command line tools (requires Node.js and NPM): https://github.com/zpoley/json-command. Here are a couple other good ones: http://kmkeen.com/jshon/ https://github.com/micha/jsawk

edd_dumbill · on April 10, 2011

Line-based processing is still important! This work reminds me of an article I wrote 11 years ago covering Sean McGrath's work on PYX—a line-based format for XML—see http://www.xml.com/pub/a/2000/03/15/feature/index.html.

That work derived from that of Charles Goldfarb on SGML, dating from 1989 on ESIS, ISO 8879.

We'll always be downsampling to something we can use with sed, grep and awk. They're too handy not to.

wladimir · on April 11, 2011

Does line-based processing still hold up? I tend to use it less and less these days, in favor of tools that process records instead of lines. There's only so much you can meaningfully store in a line of text, there is no standardized parsing, and it has all kinds of escaping issues if you have fields with embedded newlines / separators.

(FYI even syslog is moving from strictly line based to a more structured format, RFC5424/5425)

y0ghur7_xxx · on April 10, 2011

Why not just use Rhino or any other stand alone javascript interpreter?

http://www.mozilla.org/rhino/

a simple example

echo 'x=[1,2,3];x[1]'|js

wnoise · on April 11, 2011

> Because the path components are separated by / characters, an object key like "abc/def" would result in ambiguous output. jsonpipe will throw an error if this occurs in your input, so that you can recognize and handle the issue. To mitigate the problem, you can choose a different path separator:

Ugh. I should never ever have to pick details of the format to work around content in the format. The only real solution is escaping, though it does add complexity.

asymptotic · on April 11, 2011

Hahahah he uses a Unicode snowman as his example delimeter!

$ echo '{"abc/def": 123}' | jsonpipe -s '☃' ☃ {} ☃abc/def 123

I'm ready to face the day with a smile on my face.

tebeka · on April 10, 2011

cat foo.json | python -m json.tool

vlisivka · on April 10, 2011

  $ cat foo.json 
  { foo:"bar" }
  $ js -e "var doc=`cat foo.json`; print(doc.foo);"
  bar

brendano · on April 10, 2011

The use case is a bit different, but I wrote a little converter to TSV, adding in a header. I mostly use it for input into R, but I use it a lot. It only works for fairly flat JSON objects. https://github.com/brendano/tsvutils/blob/master/json2tsv

sigil · on April 10, 2011

Interesting. For faster flattening of nested structures into paths -- and the inverse operation, unflattening -- you could use this Python C extension:

https://github.com/acg/python-flattery

Full disclosure, I'm the author. ;) It uses "." as the path separator, but would be easy to allow "/".

fforw · on April 11, 2011

I've written a "json" command line tool with nodejs that offers transformation of JSON with modern javascript expressions (with support for Array.map, Array.reduce etc).

http://fforw.de/post/scripting-json/