
PyPy's New JSON Parser - gok
https://morepypy.blogspot.com/2019/10/pypys-new-json-parser.html
======
umvi
> In the last year or two I have worked on and off on making PyPy's JSON
> faster, particularly when parsing large JSON files

I think this is the key takeaway. If you are parsing huge JSON documents, this
will make a difference.

Otherwise, it's super hard to beat Python3's native implementation. I've tried
them all - hyperjson, orjson, ujson... none of them provide any significant
gains on parse speed of smallish JSON documents, even if you are parsing
thousands of small(ish) JSON documents in a row.

~~~
tachyonbeam
Something else to note here is that the Python interpreter is slow compared to
native code, so if you're parsing a lot of small JSON files, whatever
difference in speed the JSON parser makes is masked somewhat by the
interpreter overhead (assuming you're doing something with that JSON you just
parsed). When you're using a JIT like PyPy, the overhead of the rest of your
code becomes less, and so then the bottleneck could become something like JSON
parsing, I/O, etc.

~~~
TkTech
This is the main bottleneck in pysimdjson[1]. 95-99%[2] of the time is spent
creating Python objects rather than parsing the documents.

There was no trick I could find (although treating all strings as bytes
instead of using PyUnicode_DecodeUTF8/unicode_decode_utf8 was a fairly large
gain on string-heavy documents) to reduce the cost of bringing the document
into Python land, so I added a quick hack to push the filtering down into C,
since most of my use cases only need a small part of the full document.

[1]:
[https://github.com/TkTech/pysimdjson](https://github.com/TkTech/pysimdjson)
[2]:
[https://github.com/TkTech/pysimdjson/issues/22](https://github.com/TkTech/pysimdjson/issues/22)

~~~
glangdale
(one of the original authors of simdjson here)

I thought a bit about this - I think you could do a _lot_ faster if you had a
'bulk' method to create lots of Python objects in one hit. I don't know it
that's feasible, as I don't know much about the innards of Python, but there's
no reason one couldn't keep goofing around with SIMD code and branch-free
stuff all the way until you've created objects.

This may be ferociously unmaintainable (i.e. depend on things that Python
implementations don't advertise as a part of the API) but I don't know. I
doubt you're the only one to have this problem.

~~~
TkTech
Like you said, it would unfortunately be unmaintainable, requiring a fork of
CPython. Aggressive optimizations like this could go into pypy, but IMO
CPython is still trying to be a "reasonably readable" reference
implementation.

I think the time is currently best spent trying to push more of the work back
into native code, extending the simple filtering that's in there now to also
handle mutation (similar to jq), reducing the amount of back-and-forth with
the interpreter.

------
mhd
From the Perl side, there's been an interesting talk by Graham TerMarsch[1]
where he tried quite a few of the allegedly fastest JSON parsers and compared
them to the standard solution (JSON::XS). Apparently a lot got lost "in
translation", when converting the C/C++ data.

[1]:
[https://www.youtube.com/watch?v=BToZ_E3vU0Y](https://www.youtube.com/watch?v=BToZ_E3vU0Y)

~~~
rurban
Interesting comparison of most fast json parsers. Just
[https://github.com/lemire/simdjson](https://github.com/lemire/simdjson) is
missing. Purple beating RapidJSON.

The standard is btw. my Cpanel::JSON::XS and he didn't compare with cperl
which would be much faster and more secure. And he misinterpreted why glibc
made a big difference. - march=native got better, not malloc.

------
xtreak29
Python 3.9 would also include some optimizations for parsing large strings :
[https://bugs.python.org/issue37587](https://bugs.python.org/issue37587)

~~~
jfkw
while reading that bug, I am reminded how amazed I am that the Roundup issue
tracker [1] has continued to serve Python development after all this time and
through transitions in hosting and workflow. Roundup is neat, and I'm glad it
exists.

[1] [http://roundup.sourceforge.net/](http://roundup.sourceforge.net/)

~~~
xtreak29
There is an accepted PEP to transition to GitHub issues :
[https://www.python.org/dev/peps/pep-0581/](https://www.python.org/dev/peps/pep-0581/)
. But there is another PEP to keep using roundup to address concerns raised.

~~~
Nicksil
>But there is another PEP to keep using roundup to address concerns raised.

Seems to be this one: www.python.org/dev/peps/pep-0595/

Please correct if I'm wrong.

~~~
xtreak29
Yes, maintainer of the bpo also showed some demos of REST API addition to
build other features and gsoc student working on few features like markdown
support during this year core dev sprints. There is a separate pep for actual
migration
[https://www.python.org/dev/peps/pep-0588/](https://www.python.org/dev/peps/pep-0588/)
. I am not sure when it would take effect or so given the depth of bpo in core
dev workflow. There is also an open issue to track workflows that are missing
in GitHub but present in bpo : [https://github.com/python/core-
workflow/issues/359](https://github.com/python/core-workflow/issues/359)

------
sontek
Python RapidJSON: [https://github.com/python-rapidjson/python-
rapidjson](https://github.com/python-rapidjson/python-rapidjson)

Is pretty amazing. Its the first library we add to most of our projects. I
wonder if it would make sense to port it to using cffi so that pypy can
benefit?

------
slovenlyrobot
This post touches briefly on ujson. I found it a long time ago, and while I
can't vouch for the quality of its implementation, it's very popular and seems
to work just fine. Any time I've encountered a bottleneck due to parsing,
dropping ujson.loads() in has made it completely vanish from profiling output.

~~~
ambivalence
Please note that according to the benchmark in this blogpost, CPython's
builtin JSON decoder is faster than ujson.

~~~
gshulegaard
That result is actually why I am healthily skeptical of the performance
comparisons specifically in this post. Like the previous comment I too have
been in a situation where JSON encode/decode on large payloads were
significant enough that we decided we needed to figure out something more
optimal. Our solution was to use ujson instead (of the built in) and that was
good enough that we moved on.

When I say skeptical what I mean is that I have no doubt the measurements are
accurate (as in not fabricated); however, I wonder what the full context of
the benchmarks are and whether or not it's representative of situations like
mine where the built-in parser started to fall over so we replaced it. It may
also be the case that the PyPy parser would also beat ujson in this case, but
if we are limiting our test case to just those where the CPython
implementation is marginally faster than ujson then I'm not sure it's that
useful a benchmark.

I too cannot comment on the quality of the implementation or the relative
performance against other parsers...but it solved a performance problem for at
least one project of mine in the past.

~~~
kmod
Agreed, I've gotten huge speedups from ujson in the past and the numbers in
the post are markedly different from my experience, making me think that they
are measuring something different / there is something significant about this
use case that is different than the ones I've had in the past.

~~~
weberc2
The post is profiling large payloads? Maybe ujson is optimized for smaller
payloads?

------
gsaga
>This is completely _transparent_ to the Python programmer, the dictionary
will look completely normal to the Python program but its internal
representation is different.

Shouldn't it be opaque.

~~~
ptx
The optimization is transparent because the representation is opaque, I guess?

