I think this is the key takeaway. If you are parsing huge JSON documents, this will make a difference.
Otherwise, it's super hard to beat Python3's native implementation. I've tried them all - hyperjson, orjson, ujson... none of them provide any significant gains on parse speed of smallish JSON documents, even if you are parsing thousands of small(ish) JSON documents in a row.
There was no trick I could find (although treating all strings as bytes instead of using PyUnicode_DecodeUTF8/unicode_decode_utf8 was a fairly large gain on string-heavy documents) to reduce the cost of bringing the document into Python land, so I added a quick hack to push the filtering down into C, since most of my use cases only need a small part of the full document.
I thought a bit about this - I think you could do a lot faster if you had a 'bulk' method to create lots of Python objects in one hit. I don't know it that's feasible, as I don't know much about the innards of Python, but there's no reason one couldn't keep goofing around with SIMD code and branch-free stuff all the way until you've created objects.
This may be ferociously unmaintainable (i.e. depend on things that Python implementations don't advertise as a part of the API) but I don't know. I doubt you're the only one to have this problem.
I think the time is currently best spent trying to push more of the work back into native code, extending the simple filtering that's in there now to also handle mutation (similar to jq), reducing the amount of back-and-forth with the interpreter.
Does this vary based on the size/quantity of the input JSON files? I would expect this to hold regardless of size/quantity of the input files? Further, isn't CPython's implementation already native code? If so, I would expect the bottleneck to be allocating the output objects or pressure on the GC or something?
# interpreter time spent here looping
for file_name in stuff:
# interpreter: do stuff with file name, load file into string, etc.
str_data = ...
# interpreter: dispatch call to native function
# native: actually perform json.loads
data = json.loads(str_data)
# interpreter: time spent processing the data
# what happens as native time goes to zero (json parse time decreases)?
# answer: interpreter time dominates
PyPy is a JITed language though, so the interpreted bits are much faster. OP is making the point that Python 3/CPython is so slow that speeding up json.loads is a premature (and unnecessary) optimization.
I've been working with a 90gb JSON file recently. I'd probably start counting a file as large at around 100mb.
It's the perfect solution because you can use the same decoder + something like jq for debugging and manipulation on the command line.
It isn't quite "awk-ish", not quite cut/join/sort/uniq. But also not quite graphql? Which I suppose would be the "natural" language for a json-query tool?
I feel like I'm missing a piece - and missing out...
Filter out empty or null values:
I seem to either end up with unwieldy, long/verbose solutions - or shorter ones that does not work :)
I tend to get 90% of the way with jq and then awk/sed/grep my way to the desired result (jq -r).
I'm guessing that OP wasn't parsing a single 90Gb object, but a stream of many newline delimited JSON objects.
Plus the human element helps.
I wish that were true. Alas it really was a 90gb object. Actually, the first thing I did was split it by top-level key before processing further.
No, it's no.
>Clearly, no JSON parser provides a significant gain on parse speed of small JSON documents, when you define large as those where parse time is significant.
A rather bizarre deduction.
E.g. "define large as those where parse time is significant".
So let's say a large file is one whose parse time is > 10 seconds.
So, while looking at small, ie. < 10 second parse time files, if a JSON parser takes files that others load in 500ms and load them in 20ms, then that's "a significant gain on parse speed of small JSON documents"...
We didn't have to change any of the definitions...
They can be huge changes. With slightly larger docs we had a setup that had to return batches of results back. The bottleneck was absolutely the single thread part of the API encoding and decoding JSON.
Even if the docs were just the size of my example that's 25s for 1m docs. With a few threads I can easily run 10k+/s of the useful part so that'd be 100s + 25s + encode time. With a few more threads hitting 50k/s that's 20s useful time + 25s + more.
With larger docs it becomes far more of a problem. More importantly the fix is a few seconds to implement.
A million iterations that total to order of ten seconds seems quite reasonable for a quick performance comparison. It's just underspecified for the rest of us to take away much beyond that.
Python's STDlib is nice, but rarely spectacular. An old naive CSV parser I wrote for Chicken scheme (using a simple state machine) beat python's C-based one.
The stdlib is also conservative: I've run into quite a few cases where someone had reported huge gains until they realized they weren't properly handling escaping, Unicode, etc. and once updated the gap narrowed considerably.
The standard is btw. my Cpanel::JSON::XS and he didn't compare with cperl which would be much faster and more secure. And he misinterpreted why glibc made a big difference. - march=native got better, not malloc.
Seems to be this one: www.python.org/dev/peps/pep-0595/
Please correct if I'm wrong.
Is pretty amazing. Its the first library we add to most of our projects. I wonder if it would make sense to port it to using cffi so that pypy can benefit?
When I say skeptical what I mean is that I have no doubt the measurements are accurate (as in not fabricated); however, I wonder what the full context of the benchmarks are and whether or not it's representative of situations like mine where the built-in parser started to fall over so we replaced it. It may also be the case that the PyPy parser would also beat ujson in this case, but if we are limiting our test case to just those where the CPython implementation is marginally faster than ujson then I'm not sure it's that useful a benchmark.
I too cannot comment on the quality of the implementation or the relative performance against other parsers...but it solved a performance problem for at least one project of mine in the past.
But the difference is pretty insignificant compared to Node and PyPyFull.
Shouldn't it be opaque.