Hacker News new | past | comments | ask | show | jobs | submit login
PyPy's New JSON Parser (morepypy.blogspot.com)
192 points by gok 14 days ago | hide | past | web | favorite | 58 comments



> In the last year or two I have worked on and off on making PyPy's JSON faster, particularly when parsing large JSON files

I think this is the key takeaway. If you are parsing huge JSON documents, this will make a difference.

Otherwise, it's super hard to beat Python3's native implementation. I've tried them all - hyperjson, orjson, ujson... none of them provide any significant gains on parse speed of smallish JSON documents, even if you are parsing thousands of small(ish) JSON documents in a row.


Something else to note here is that the Python interpreter is slow compared to native code, so if you're parsing a lot of small JSON files, whatever difference in speed the JSON parser makes is masked somewhat by the interpreter overhead (assuming you're doing something with that JSON you just parsed). When you're using a JIT like PyPy, the overhead of the rest of your code becomes less, and so then the bottleneck could become something like JSON parsing, I/O, etc.

This is the main bottleneck in pysimdjson[1]. 95-99%[2] of the time is spent creating Python objects rather than parsing the documents.

There was no trick I could find (although treating all strings as bytes instead of using PyUnicode_DecodeUTF8/unicode_decode_utf8 was a fairly large gain on string-heavy documents) to reduce the cost of bringing the document into Python land, so I added a quick hack to push the filtering down into C, since most of my use cases only need a small part of the full document.

[1]: https://github.com/TkTech/pysimdjson [2]: https://github.com/TkTech/pysimdjson/issues/22


(one of the original authors of simdjson here)

I thought a bit about this - I think you could do a lot faster if you had a 'bulk' method to create lots of Python objects in one hit. I don't know it that's feasible, as I don't know much about the innards of Python, but there's no reason one couldn't keep goofing around with SIMD code and branch-free stuff all the way until you've created objects.

This may be ferociously unmaintainable (i.e. depend on things that Python implementations don't advertise as a part of the API) but I don't know. I doubt you're the only one to have this problem.


Like you said, it would unfortunately be unmaintainable, requiring a fork of CPython. Aggressive optimizations like this could go into pypy, but IMO CPython is still trying to be a "reasonably readable" reference implementation.

I think the time is currently best spent trying to push more of the work back into native code, extending the simple filtering that's in there now to also handle mutation (similar to jq), reducing the amount of back-and-forth with the interpreter.


This whole discussion makes me thing that perhaps there would be a point in parsing JSON into a "JSON database" object instead of a tree of individual hashmaps. In theory it would allow for a more memory efficient representation and more faster search or transformation operations.

> so if you're parsing a lot of small JSON files, whatever difference in speed the JSON parser makes is masked somewhat by the interpreter overhead

Does this vary based on the size/quantity of the input JSON files? I would expect this to hold regardless of size/quantity of the input files? Further, isn't CPython's implementation already native code? If so, I would expect the bottleneck to be allocating the output objects or pressure on the GC or something?


Of course, the smaller the JSON file, the shorter the amount of time spent in native code, and proportionally more time will be spent in the interpreter around whatever looping construct is used to feed data into json.loads (and presumably process that data).

    # interpreter time spent here looping
    for file_name in stuff:
        # interpreter: do stuff with file name, load file into string, etc.
        str_data = ...
        # interpreter: dispatch call to native function
        # native: actually perform json.loads
        data = json.loads(str_data)
        # interpreter: time spent processing the data
        process_data(data)

    # what happens as native time goes to zero (json parse time decreases)?
    # answer: interpreter time dominates

So "interpreter time" is defined as "time not spent parsing JSON" and thus "the less time you spend parsing JSON the less time you spend parsing JSON"? Of course that's correct, but it's so obvious as to be unsubstantial. Surely the OP is making a more interesting point...

No, that's it - for most json documents, the time the interpreter spends in all the stuff surrounding (and executing) json.loads dominates the time spent in native code parsing. That's why replacing json.loads with a faster native implementation doesn't result in noticeable improvements - it's speeding up the fastest bit, and if it's already 0.1% of the wall clock time, then you can't get much more than about 0.1% faster.

PyPy is a JITed language though, so the interpreted bits are much faster. OP is making the point that Python 3/CPython is so slow that speeding up json.loads is a premature (and unnecessary) optimization.


I would think that's because the overhead tends to dominate when calling on small JSON, whether it's function call overhead or python's interpreter, etc.

Really? I just tried with a sample of one of my files, using python 3.6. With text in memory, 1M smallish strings decoding is 12s with ujson and 25s with the built in json. orjson and rapidjson seemed about the same for this test.

Hmm, I'm starting to think my definition of "small" is different from everybody else's. I consider "smallish" JSON to be at most 100KB (and on average 1-10 KB). 1 MB I consider to be "large"... basically anything that doesn't parse near instantaneously I consider "large".

I think GP might've been saying 1M small JSON documents, rather than a single document containing 1M strings.

If you use a language like Rust (which is really easy for these kind of jobs), then 1mb will parse almost instantly.

I've been working with a 90gb JSON file recently. I'd probably start counting a file as large at around 100mb.


I have to ask - why? I understand if you're dealing with an external resource that you can't change, but I'd still like to ask them the same question. JSON's strength, it seems to me, is that it's great for human consumption. It's a very readable data structure when serialized! But who is ever going to read 90GB of JSON?

Not out of choice! It was a backup of a firebase database (also not my choice). I totally agree that JSON is not a good format for files this large, but it was the only format the data was available in so what can you do.

Not OP, but I deal with a variety of proprietary binary/encoded files in the 10s of MB to the 10s of GB range at work, and we have started decoding to JSON as the first step in our normalization, indexing and analytics pipeline (To be honest we essentially just throw everything in to MongoDB). Daily we're in the terabyte range.

It's the perfect solution because you can use the same decoder + something like jq for debugging and manipulation on the command line.


A question - do know of any good resources for jq? I haven't quite been "angry enough" to use jq much beyond using it for formatting json - but that's also partly because I can't seem to quite grasp the syntax.

It isn't quite "awk-ish", not quite cut/join/sort/uniq. But also not quite graphql? Which I suppose would be the "natural" language for a json-query tool?

I feel like I'm missing a piece - and missing out...


The manual is pretty dang good, tbh.

https://stedolan.github.io/jq/manual/


I agree that it is clear, but I still struggle to figure out (by myself) how to do (in an awk sense) simple things like:

Filter out empty or null values:

https://stackoverflow.com/questions/39500608/remove-all-null...

I seem to either end up with unwieldy, long/verbose solutions - or shorter ones that does not work :)


Oh wow, that's terrible!

I tend to get 90% of the way with jq and then awk/sed/grep my way to the desired result (jq -r).


It's not necessarily that it's great for human consumption, more that it's super interchangeable and ridiculously easy to get up and running, and for the most part 90Gb of space is so cheap that it's not worth the overhead of using a binary format.

I'm guessing that OP wasn't parsing a single 90Gb object, but a stream of many newline delimited JSON objects.

Plus the human element helps.


> I'm guessing that OP wasn't parsing a single 90Gb object, but a stream of many newline delimited JSON objects.

I wish that were true. Alas it really was a 90gb object. Actually, the first thing I did was split it by top-level key before processing further.


Especially considering that JSON compresses well.

Isn't that tautological? Clearly, no JSON parser provides a significant gain on parse speed of small JSON documents, when you define large as those where parse time is significant.

>Isn't that tautological?

No, it's no.

>Clearly, no JSON parser provides a significant gain on parse speed of small JSON documents, when you define large as those where parse time is significant.

A rather bizarre deduction.

E.g. "define large as those where parse time is significant".

So let's say a large file is one whose parse time is > 10 seconds.

So, while looking at small, ie. < 10 second parse time files, if a JSON parser takes files that others load in 500ms and load them in 20ms, then that's "a significant gain on parse speed of small JSON documents"...

We didn't have to change any of the definitions...


Well, Python2 -> Python3 introduced a dramatic performance increase, even for small JSON documents.

These were small though, I just made a small sample of a million of them and still saw a 2x improvement.

Oh I think you misunderstood, they are all small, there's just a million of them. I think they were all about 1k.

Note the qualifiers about document size — anything measured in millions probably isn't “smallish”, and unless your app does nothing other than decode those strings the total runtime change is unlikely to be anywhere near 2:1.

Each document was small, there was just a lot of them.

They can be huge changes. With slightly larger docs we had a setup that had to return batches of results back. The bottleneck was absolutely the single thread part of the API encoding and decoding JSON.

Even if the docs were just the size of my example that's 25s for 1m docs. With a few threads I can easily run 10k+/s of the useful part so that'd be 100s + 25s + encode time. With a few more threads hitting 50k/s that's 20s useful time + 25s + more.

With larger docs it becomes far more of a problem. More importantly the fix is a few seconds to implement.


Again, note the qualifiers: nobody is arguing that faster JSON parsing isn't a good thing, only that anything measured in the millions is probably not considered “small”. A significant amount of code is never profiled since it's fast enough and if someone does find the stdlib json library to be using enough time to care about it is certainly easily replaced.

The small qualifier seemed to be on the size of the documents, and on small documents I see a 2x or often higher improvement. At a scale of 10k docs that's still 100ms difference, and in my case can be a huge proportion of the total time spent (more with larger docs, and more again if you include encoding time).

Plenty of things measured in millions are small. A million bytes is a small amount of memory. There are plenty of domains where a working definition of "small" is roughly "can be slurped into memory on a single node without jumping through hoops for provisioning".

A million iterations that total to order of ten seconds seems quite reasonable for a quick performance comparison. It's just underspecified for the rest of us to take away much beyond that.


I have had pretty large speedups in the past when switching JSON implementations. About 5x. It doesn't matter much when the times are 0.2 Vs 1s, but when you are working with a couple of MB of JSON the difference will definitely be noticeable.

Python's STDlib is nice, but rarely spectacular. An old naive CSV parser I wrote for Chicken scheme (using a simple state machine) beat python's C-based one.


> Python's STDlib is nice, but rarely spectacular. An old naive CSV parser I wrote for Chicken scheme (using a simple state machine) beat python's C-based one.

The stdlib is also conservative: I've run into quite a few cases where someone had reported huge gains until they realized they weren't properly handling escaping, Unicode, etc. and once updated the gap narrowed considerably.


You're basically agreeing with the above, and I have experienced the same. The only place where this makes a difference is when the JSON file is in MBs. For smaller file, it's not really worth the extra library.

Depends on how many of the small docs you deal with, I see useful differences on small documents because I process reasonable amounts of them. A 10s change on a million docs for just decoding for a single library is easily worth it for me.

From the Perl side, there's been an interesting talk by Graham TerMarsch[1] where he tried quite a few of the allegedly fastest JSON parsers and compared them to the standard solution (JSON::XS). Apparently a lot got lost "in translation", when converting the C/C++ data.

[1]: https://www.youtube.com/watch?v=BToZ_E3vU0Y


Interesting comparison of most fast json parsers. Just https://github.com/lemire/simdjson is missing. Purple beating RapidJSON.

The standard is btw. my Cpanel::JSON::XS and he didn't compare with cperl which would be much faster and more secure. And he misinterpreted why glibc made a big difference. - march=native got better, not malloc.


I would like a json-game like csv-game[0] that benchmarks all of these together. Here is an example with a /lot/ of C++ parsers:

https://github.com/miloyip/nativejson-benchmark

[0] https://github.com/miloyip/nativejson-benchmark


Python 3.9 would also include some optimizations for parsing large strings : https://bugs.python.org/issue37587


while reading that bug, I am reminded how amazed I am that the Roundup issue tracker [1] has continued to serve Python development after all this time and through transitions in hosting and workflow. Roundup is neat, and I'm glad it exists.

[1] http://roundup.sourceforge.net/


Ka-Ping Yee, the original author of Roundup, has done some pretty awesome things. I remember one time I was BSing with a recruiter in the hallway at PyCon, and Ping walked by behind the recruiter. The recruiter stopped talking to me mid-sentence and gave chase: "Ping! Ping! Can I talk to you for a minute ..."

There is an accepted PEP to transition to GitHub issues : https://www.python.org/dev/peps/pep-0581/ . But there is another PEP to keep using roundup to address concerns raised.

>But there is another PEP to keep using roundup to address concerns raised.

Seems to be this one: www.python.org/dev/peps/pep-0595/

Please correct if I'm wrong.


Yes, maintainer of the bpo also showed some demos of REST API addition to build other features and gsoc student working on few features like markdown support during this year core dev sprints. There is a separate pep for actual migration https://www.python.org/dev/peps/pep-0588/ . I am not sure when it would take effect or so given the depth of bpo in core dev workflow. There is also an open issue to track workflows that are missing in GitHub but present in bpo : https://github.com/python/core-workflow/issues/359

Python RapidJSON: https://github.com/python-rapidjson/python-rapidjson

Is pretty amazing. Its the first library we add to most of our projects. I wonder if it would make sense to port it to using cffi so that pypy can benefit?


This post touches briefly on ujson. I found it a long time ago, and while I can't vouch for the quality of its implementation, it's very popular and seems to work just fine. Any time I've encountered a bottleneck due to parsing, dropping ujson.loads() in has made it completely vanish from profiling output.


Please note that according to the benchmark in this blogpost, CPython's builtin JSON decoder is faster than ujson.


That result is actually why I am healthily skeptical of the performance comparisons specifically in this post. Like the previous comment I too have been in a situation where JSON encode/decode on large payloads were significant enough that we decided we needed to figure out something more optimal. Our solution was to use ujson instead (of the built in) and that was good enough that we moved on.

When I say skeptical what I mean is that I have no doubt the measurements are accurate (as in not fabricated); however, I wonder what the full context of the benchmarks are and whether or not it's representative of situations like mine where the built-in parser started to fall over so we replaced it. It may also be the case that the PyPy parser would also beat ujson in this case, but if we are limiting our test case to just those where the CPython implementation is marginally faster than ujson then I'm not sure it's that useful a benchmark.

I too cannot comment on the quality of the implementation or the relative performance against other parsers...but it solved a performance problem for at least one project of mine in the past.


Agreed, I've gotten huge speedups from ujson in the past and the numbers in the post are markedly different from my experience, making me think that they are measuring something different / there is something significant about this use case that is different than the ones I've had in the past.

The post is profiling large payloads? Maybe ujson is optimized for smaller payloads?

It's not strictly the case (`reddit` case).

But the difference is pretty insignificant compared to Node and PyPyFull.


I think there are even faster JSON parsers out there nowadays. I have recently used orjson and found it to be extremely fast for our use case (serializing a dict with about 100 MB).


>This is completely transparent to the Python programmer, the dictionary will look completely normal to the Python program but its internal representation is different.

Shouldn't it be opaque.


The optimization is transparent because the representation is opaque, I guess?



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: