
Memory use and speed of JSON parsers - kawera
http://blog.ionelmc.ro/2015/11/22/memory-use-and-speed-of-json-parsers/
======
scj
I did an experiment[1] last year that demonstrated JSON parsing only requires
1 bit on the stack per array or associative array encountered (as they are the
two non-terminals that lead to different rules).

In my implementation, I let a 64-bit integer represent the non-terminal stack
(then used recursion if it went over that). So you'd need a JSON structure
that has "a hash of a hash of a hash ... pointing to an array" with a depth in
the 60s before you'd get a recursion call. And the real program stack
shouldn't take that much per call (I didn't optimize this part, so I'm not
sure what the minimum is).

I only wrote a JSON validator, but my method should apply to cases where JSON
elements are being placed into a data structure.

I'm pretty sure that a JSON parser could be written in C/C++ that requires
less than 1kb overhead, where most of that is the string buffer.

I make no claims about speed though, as my concern was proving the bit part (I
suspect speed using my method would be worse than the naive per character jump
table).

[1]
[http://web.cs.dal.ca/~sjackson/bitJson/JSON.html](http://web.cs.dal.ca/~sjackson/bitJson/JSON.html)

------
pilif
Once you have more than 50MB of JSON, maybe it's time to switch to a streaming
parser. While that's not going to give you more parsing speed, it will solve
all your memory issues and if you are getting that data from a slow source
(like the network) then you can even parse as you are reading it, which then
would give you a considerable speed advantage.

~~~
habitue
Seems like a streaming parser wouldn't help too much in this case because all
of the data was inside one terminal.

------
aabaker99
I had a very similar problem with Ruby and some of our workloads. I ended up
utilizing this evented parser [1] to process smaller subsets of the JSON
document and then clear them from memory. The result is a memory usage that is
at most the size of one of these subsets (plus some overhead).

I have been wanting to generalize this approach with some open source software
for some time, but I haven't gotten around to it. I think an attractive
solution is for a user to specify a JSONPath [2] upfront about the types of
subsets they are interested in. In the case of Ruby these chunks could then be
yielded by an iterator and cleared, but I wonder if a more general-purpose CLI
program which prints the subsets to stdout would also work if the subsets are
small enough. But this adds another serialization/deserialization step.

[1] [https://github.com/dgraham/json-stream](https://github.com/dgraham/json-
stream) [2]
[http://goessner.net/articles/JsonPath/](http://goessner.net/articles/JsonPath/)

~~~
lobster_johnson
I did something very similar with XML in Ruby. All the libraries are either
event-based (SAX, which is awful for actual parsing code) or snarf the whole
document into RAM.

I wrote a simple SAX wrapper [1] that applies a (so far, extremely simplified)
subset of XPath to the stream, then parses each match into a DOM using
Nokogiri. Works exceedingly well, though it's not _faster_ than reading
everything into RAM.

[1] [https://github.com/t11e/nokogiri-simple-
streaming](https://github.com/t11e/nokogiri-simple-streaming)

------
edsiper2
In the project that I am involved[0], I have been doing stress tests with very
large JSON messages. After evaluate a few options I came up that the project
'JSMN' was good enough for my purposes.

    
    
        https://github.com/zserge/jsmn
    

I am not sure if there are Python or another language bindings available but
that library is very clean and simple to use. It does not perform memory
allocations as you (the caller) have to provide the buffers so it gives you
good flexibility.

[0] [http://fluentbit.io](http://fluentbit.io)

~~~
pjungwir
I've worked with jsmn, and it was really a pleasure. I loved that you had one
big memory allocation up front (which could even be on the stack) and then no
more. In my case I was filling in a data structure where I was comfortable
capping the length of all list-like elements, so I also had just one big
allocation for that.

Using it reminded me a lot of my days writing XML SAX parsers. I understand
why some people don't like that pattern, but if you've ever written a Finite
State Machine it will feel very familiar. Personally I like how the event-
based model lets me break things up into little chunks.

------
stared
I've made a plot from this data:
[https://plot.ly/~PiotrMigdal/86/](https://plot.ly/~PiotrMigdal/86/) (I was
having problems in comprehending all data at once).

------
mkesper
Title is missing ...and alternatives. That was the best part!

------
any1
I made a "JSON parser generator" a few months ago. There is no memory overhead
because it just puts the json data straight into a C struct while it's
parsing.

[https://github.com/marel-keytech/jsoncc](https://github.com/marel-
keytech/jsoncc)

------
vvanders
I'd be curious to see how FlatBuffers lines up against MsgPack seeing as it
requires just a single allocation and read.

Heck if you want to get fancy(and have access to mmap) you can just mmap()
your FlatBuffer data and let the kernel handle paging it in/out for you for
very low memory overhead.

------
dorfsmay
Sometimes you dont have a choice, say receive your data in Jason. In those
cases, my experience is that ujson is much faster than the alternatives.
Interesting that the author concludes the same for small objects but a
different picture with bigger objects.

------
stefantalpalaru
The word "Python" should be inserted somewhere in the title.

