Hacker News new | past | comments | ask | show | jobs | submit login
Convert JSON to a Unix-friendly line-based format (github.com/dvxhouse)
55 points by adulau on April 10, 2011 | hide | past | favorite | 21 comments



"Everyone I know prefers to work with JSON over XML, but sadly there is a sore lack of utilities of the quality or depth of html-xml-utils and XMLStarlet for actually processing JSON data in an automated fashion, short of writing an ad hoc processor in your favourite programming language."

Actually there is just such a suite of utilites! See https://github.com/benbernard/RecordStream


RecordStream seems more complete but requires much more dependencies. I really like the simplicity of jsonpipe and how easy is to use it:

      curl -s "http://feeds.delicious.com/v2/json/adulau  | jsonpipe -s "#"  | grep  "\#u" | cut -f2 | sed -e"s/\"//g" | xargs -d"\n" wget -r -l 1 –p –-convert-links
A simple example to take a local mirror of my last del.icio.us bookmarks...


"simple"


Maybe you're solving a different problem, but I'm not sure emitting one key-value pair per line is ultimately the way to go:

  $ echo '[{"a": [{"b": {"c": ["foo"]}}]}]' | jsonpipe
  /   []
  /0  {}
  /0/a        []
  /0/a/0      {}
  /0/a/0/b    {}
  /0/a/0/b/c  []
  /0/a/0/b/c/0        "foo"
In my own work, I've built up a suite of stream-based nested record processing tools that accept & produce JSON, protocol buffers, and a unix tab-delimited format. For the unix format it's been more useful to stick to the standard one-record-per-line thing, and let the user specify what fields to extract and their order.

Here's a depressing example of the fun you can have with new media and old unix tools, to give you some idea:

  $ mill io -r json -w texty -W fields=in_reply_to_screen_name < tweets.json | \
    grep -v -E '^(None|)$' | sort | uniq -c | sed -e 's#^ *##' | \
    sort -k1,1 -nr | head -5
  258 justinbieber
  248 ddlovato
  184 gypsyhearttour
  164 Logindaband
  145 Louis_Tomlinson


    import os.path as p
Renaming imports to single letters bugs me. I went to see where it's used, and it isn't. Binding pyflakes to Cmd+S in TextMate was the best thing I ever did, and it would have caught this. pyflakes is seriously awesome in that role, and pylint before committing.

I'm also surprised that simplejson is used, instead of the built-in json in Python 2.6 and up. A good solution is:

    try: import json
    except ImportError: import simplejson as json


> I'm also surprised that simplejson is used

2.6's json lacks simplejson's C accelerators [0] which makes it roughly 20 times slower than simplejson (2.7's simplejson is also ~50% slower than simplejson, as new performance optimizations were added in 2.1.0, as well as memoizations). Simplejson is also updated more often, which lead to it having less bugs, and interesting new features (the ability to natively serialize decimals was added in 2.1 for instance)

simplejson should be the preferred import, with json as a fallback if available.

[0] 2.6's json is simplejson 1.9; 2.7's is 2.0.9; the latest release of simplejson is 2.1.3 and 2.1.4 is in preparation


I elect to stick with the standard library in all cases, so I know how my software will perform everywhere without a redundant external dependency. Sticking with the standard library means you'll eventually get those improvements, too.


> a redundant external dependency

Is pip install that hard? Surely any deployed project is already managing a requirements.txt/buildout/setup.py/etc.

> means you'll eventually get those improvements

The C speedup extension for simplejson existed well before it was merged into Python stdlib as "json". Are you so sure you'll eventually get said improvement?


> Sticking with the standard library means you'll eventually get those improvements, too.

Not in 2.x, you won't.


Best of both worlds would be to try to import simplejson, and if that fails, fall back to the standard library json. As they have the same interface, that's trivial to do.


Here is my contribution to Unix friendly JSON command line tools (requires Node.js and NPM): https://github.com/zpoley/json-command. Here are a couple other good ones: http://kmkeen.com/jshon/ https://github.com/micha/jsawk


Line-based processing is still important! This work reminds me of an article I wrote 11 years ago covering Sean McGrath's work on PYX—a line-based format for XML—see http://www.xml.com/pub/a/2000/03/15/feature/index.html.

That work derived from that of Charles Goldfarb on SGML, dating from 1989 on ESIS, ISO 8879.

We'll always be downsampling to something we can use with sed, grep and awk. They're too handy not to.


Does line-based processing still hold up? I tend to use it less and less these days, in favor of tools that process records instead of lines. There's only so much you can meaningfully store in a line of text, there is no standardized parsing, and it has all kinds of escaping issues if you have fields with embedded newlines / separators.

(FYI even syslog is moving from strictly line based to a more structured format, RFC5424/5425)


Why not just use Rhino or any other stand alone javascript interpreter?

http://www.mozilla.org/rhino/

a simple example

echo 'x=[1,2,3];x[1]'|js


> Because the path components are separated by / characters, an object key like "abc/def" would result in ambiguous output. jsonpipe will throw an error if this occurs in your input, so that you can recognize and handle the issue. To mitigate the problem, you can choose a different path separator:

Ugh. I should never ever have to pick details of the format to work around content in the format. The only real solution is escaping, though it does add complexity.


Hahahah he uses a Unicode snowman as his example delimeter!

$ echo '{"abc/def": 123}' | jsonpipe -s '☃' ☃ {} ☃abc/def 123

I'm ready to face the day with a smile on my face.


cat foo.json | python -m json.tool


  $ cat foo.json 
  { foo:"bar" }
  $ js -e "var doc=`cat foo.json`; print(doc.foo);"
  bar


The use case is a bit different, but I wrote a little converter to TSV, adding in a header. I mostly use it for input into R, but I use it a lot. It only works for fairly flat JSON objects. https://github.com/brendano/tsvutils/blob/master/json2tsv


Interesting. For faster flattening of nested structures into paths -- and the inverse operation, unflattening -- you could use this Python C extension:

https://github.com/acg/python-flattery

Full disclosure, I'm the author. ;) It uses "." as the path separator, but would be easy to allow "/".


I've written a "json" command line tool with nodejs that offers transformation of JSON with modern javascript expressions (with support for Array.map, Array.reduce etc).

http://fforw.de/post/scripting-json/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: