Hacker News new | past | comments | ask | show | jobs | submit login

I really like using line-delimited JSON [0] for stuff like this. If you're looking at a multi-GB JSON file, it's often made of a large number of individual objects (e.g. semi-structured JSON log data or transaction records).

If you can get to a point where each line is a reasonably-sized JSON file, a lot of things gets way easier. jq will be streaming by default. You can use traditional Unixy tools (grep, sed, etc.) in the normal way because it's just lines of text. And you can jump to any point in the file, skip forward to the next line boundary, and know that you're not in the middle of a record.

The company I work for added line-delimited JSON output to lots of our internal tools, and working with anything else feels painful now. It scales up really well -- I've been able to do things like process full days of OPRA reporting data in a bash script.

[0]: https://jsonlines.org/




+1. While yes, you can have a giant json object, and you can hack your way around the obvious memory issues, it’s still a bad idea, imo. Even if you solve it for one use case in one language, you’ll have a bad time as soon as you use different tooling. JSON really is a universal message format, which is useful precisely because it’s so interoperable. And it’s only interoperable as long as messages are reasonably sized.

The only thing I miss from json lines is allowing a type specifier, so you can mix different types of messages. It’s not at all impossible to work around with wrapping or just roll a custom format, but still, it would be great to have a little bit of metadata for those use cases.


An out-of-band type specifier would be cool, though you still have to know the implicit schema implied by each type.

In the system I work with, we standardized on objects that have a "type" key at the top level that contains a string identifying the type. Of course, that only works because we have lots of different tools that all output the same 30 or so data types. It definitely wouldn't scale to interoperability in general. But that's also one of the great things about JSON: it's flexible enough that you can work out a system that works at your scale, no more and no less.


confusingly jq also has a streaming mode https://stedolan.github.io/jq/manual/#Streaming that streams JSON values as [<path>,<value>] pairs. This can also be combined with null input and enables one to reduce, foreach etc in a memory efficient way, eg sum all .a in an array without loading the whole array into memory:

    $ echo '[{"a":1},{"b":2},{"a":3}]' | jq -n --stream 'reduce (inputs | select(.[0][1:] == ["a"])[1]) as $v (0; .+$v)'
    4


Isn't this pretty much what JSON streaming does?


Yep, it’s a subset of JSON streaming (using Wikipedia’s definition [0], it’s the second major heading on that page). I like it because it preserves existing Unix tools like grep, but the other methods of streaming JSON have their own advantages.

[0]: https://en.m.wikipedia.org/wiki/JSON_streaming




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: