"Everyone I know prefers to work with JSON over XML, but sadly there is a sore lack of utilities of the quality or depth of html-xml-utils and XMLStarlet for actually processing JSON data in an automated fashion, short of writing an ad hoc processor in your favourite programming language."
In my own work, I've built up a suite of stream-based nested record processing tools that accept & produce JSON, protocol buffers, and a unix tab-delimited format. For the unix format it's been more useful to stick to the standard one-record-per-line thing, and let the user specify what fields to extract and their order.
Here's a depressing example of the fun you can have with new media and old unix tools, to give you some idea:
Renaming imports to single letters bugs me. I went to see where it's used, and it isn't. Binding pyflakes to Cmd+S in TextMate was the best thing I ever did, and it would have caught this. pyflakes is seriously awesome in that role, and pylint before committing.
I'm also surprised that simplejson is used, instead of the built-in json in Python 2.6 and up. A good solution is:
try: import json
except ImportError: import simplejson as json
2.6's json lacks simplejson's C accelerators [0] which makes it roughly 20 times slower than simplejson (2.7's simplejson is also ~50% slower than simplejson, as new performance optimizations were added in 2.1.0, as well as memoizations). Simplejson is also updated more often, which lead to it having less bugs, and interesting new features (the ability to natively serialize decimals was added in 2.1 for instance)
simplejson should be the preferred import, with json as a fallback if available.
[0] 2.6's json is simplejson 1.9; 2.7's is 2.0.9; the latest release of simplejson is 2.1.3 and 2.1.4 is in preparation
I elect to stick with the standard library in all cases, so I know how my software will perform everywhere without a redundant external dependency. Sticking with the standard library means you'll eventually get those improvements, too.
Is pip install that hard? Surely any deployed project is already managing a requirements.txt/buildout/setup.py/etc.
> means you'll eventually get those improvements
The C speedup extension for simplejson existed well before it was merged into Python stdlib as "json". Are you so sure you'll eventually get said improvement?
Best of both worlds would be to try to import simplejson, and if that fails, fall back to the standard library json. As they have the same interface, that's trivial to do.
Line-based processing is still important! This work reminds me of an article I wrote 11 years ago covering Sean McGrath's work on PYX—a line-based format for XML—see http://www.xml.com/pub/a/2000/03/15/feature/index.html.
That work derived from that of Charles Goldfarb on SGML, dating from 1989 on ESIS, ISO 8879.
We'll always be downsampling to something we can use with sed, grep and awk. They're too handy not to.
Does line-based processing still hold up? I tend to use it less and less these days, in favor of tools that process records instead of lines. There's only so much you can meaningfully store in a line of text, there is no standardized parsing, and it has all kinds of escaping issues if you have fields with embedded newlines / separators.
(FYI even syslog is moving from strictly line based to a more structured format, RFC5424/5425)
> Because the path components are separated by / characters, an object key like "abc/def" would result in ambiguous output. jsonpipe will throw an error if this occurs in your input, so that you can recognize and handle the issue. To mitigate the problem, you can choose a different path separator:
Ugh. I should never ever have to pick details of the format to work around content in the format. The only real solution is escaping, though it does add complexity.
The use case is a bit different, but I wrote a little converter to TSV, adding in a header. I mostly use it for input into R, but I use it a lot. It only works for fairly flat JSON objects. https://github.com/brendano/tsvutils/blob/master/json2tsv
Interesting. For faster flattening of nested structures into paths -- and the inverse operation, unflattening -- you could use this Python C extension:
I've written a "json" command line tool with nodejs that offers transformation of JSON with modern javascript expressions (with support for Array.map, Array.reduce etc).
Actually there is just such a suite of utilites! See https://github.com/benbernard/RecordStream