
Writing a Fast JSON Parser - jsheard
https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/
======
chadaustin
Hi all, author here.

Several people have raised questions about SIMD. For many workloads, SIMD is
obviously a huge win. But in my experience, for problems like this, the gains
are marginal and the additional complexity often isn't worth it. e.g. Do you
have a build-time option? Check at runtime at every call? Use templates to
check at startup and run a different code path?

That said, it never hurts to try, so I implemented the fast string path in
AVX2 (see
[https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3...](https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3c50cd368d13e645659990fac)).
It actually benchmarks slower than the scalar code:
[https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7a...](https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7aa61484b)

I couldn't tell you why - it's possible something could be tweaked to make it
actually execute faster, or that enabling -mavx2 causes other code to get
slower, but I wanted to illustrate that enabling SIMD is not always an
automatic go fast button.

I did look at the instructions generated and they seem reasonable:

    
    
        .LBB5_118:
            vmovdqu	ymm5, ymmword ptr [rbx - 32]
            vpcmpgtb	ymm6, ymm5, ymm0
            vpcmpeqb	ymm7, ymm5, ymm1
            vpcmpeqb	ymm8, ymm5, ymm2
            vpcmpeqb	ymm5, ymm5, ymm3
            vpxor	ymm5, ymm5, ymm4
            vpor	ymm6, ymm8, ymm6
            vpor	ymm5, ymm5, ymm6
            vpor	ymm5, ymm5, ymm7
            vpmovmskb	esi, ymm5
            test	esi, esi
            jne	.LBB5_148
            mov	rbp, rax
            sub	rbp, rbx
            mov	rsi, rbx
            lea	rbx, [rbx + 32]
            cmp	rbp, 31
            jg	.LBB5_118
            jmp	.LBB5_120
    
        .LBB5_148:
            add	rbx, -32
            bsr	eax, esi
            xor	eax, 31
            lea	rsi, [rbx + rax]
            mov	bl, byte ptr [rbx + rax]
        .LBB5_125:
            mov	r10, qword ptr [rsp + 24] # 8-byte Reload
            mov	r9b, byte ptr [rsp + 39] # 1-byte Reload
            sub	rcx, r8
            movzx	eax, bl
            cmp	eax, 34
            jne	.LBB5_127
    

I suppose it's also worth pointing out that, while string scanning is an
important part of JSON parsing, the instruction histogram is generally spread
across the entire parser. That is, a doubling in string parsing performance
would only help overall parse times by 10-20% maybe.

Edit: Upon thinking about it, it's probably slower because there's a
dependency chain through a pile of high-latency vector instructions all the
way to the bsr (should be bsf, I fixed on the branch) to the next loop
iteration's load. The branch predictor doesn't have much opportunity to help
us out here. You could look at Agner Fog's tables and calculate the critical
path latency of this loop, but I'm guessing it's not pretty.

~~~
haberman
If you're using AVX instructions make sure to insert a vzeroupper after you're
done, otherwise you take a big speed hit transitioning between SSE and AVX.
The compiler will generate these automatically for AVX instructions it is
emitting:

[https://godbolt.org/g/jNXJyA](https://godbolt.org/g/jNXJyA)

Here's an intrinsic for it:

[http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_c...](http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_cls/common/intref_avx_zeroupper.htm)

~~~
ot
Little known fact about AVX: every time you use an AVX instruction Intel
processors reduce the core frequency by a few percents for about a millisecond
to avoid overheating (I guess because they need to turn on an otherwise unused
functional unit).

As a result, if you sprinkle some AVX instructions in your code but not enough
to cause a significant efficiency win you might actually end up reducing your
capacity (and affecting latency).

~~~
burntsushi
Can you provide a citation for that or a link to more reading? I'm not asking
because I don't believe you, but because I would like to learn more. :-)

~~~
ot
Sure, see this doc from Intel:
[https://computing.llnl.gov/tutorials/linux_clusters/intelAVX...](https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf)

"Intel AVX instructions require more power to run. When executing these
instructions, the processor may run at less than the marked frequency to
maintain thermal design power (TDP) limits."

...

"When the processor detects Intel AVX instructions, additional voltage is
applied to the core. With the additional voltage applied, the processor could
run hotter, requiring the operating frequency to be reduced to maintain
operations within the TDP limits. The higher voltage is maintained for 1
millisecond after the last Intel AVX instruction completes, and then the
voltage returns to the nominal TDP voltage level."

------
glangdale
I felt moved to comment on the thread. HN user burntsushi has been very nice
by writing up a trick we developed here on the Hyperscan team that should be
considerably faster than this system assuming there's enough data to make
using SIMD pay off (this would be a lose if the sizes are small enough).

It's possible that a SIMD implementation (assuming that you routinely pass
over enough bytes) can blow this away. I'm not sure what the cross-over point
is - it depends on whether you are falling out of this code quickly or not.
Obviously using a big SIMD sequence to replace a "2 cycles per byte"
implementation is pretty stupid if the SIMD sequence is "0.5 cycles a byte"
but the fixed cost of using the SIMD means that your costs are >=16 cycles and
the typical case is small.

My comment follows:

It's also not clear that you actually need a SIMD instruction - if you've
squashed your hotspot to the point that adding the complexity of per-platform
SIMD specialization is only a small win, then why bother?

All that being said...

I've been really lazy in writing this up, but someone kindly did a decent
write-up for us in the Rust reimplementation of our "Teddy" literal matcher.
Please see [https://github.com/rust-
lang/regex/blob/master/src/simd_acce...](https://github.com/rust-
lang/regex/blob/master/src/simd_accel/teddy128.rs) if interested or look up
the Hyperscan project at
[https://github.com/01org/hyperscan](https://github.com/01org/hyperscan) for a
few different versions of this trick.

~~~
ot
You don't even have to go into SIMD and lookup tables: on 64-bit archs you can
process 8 bytes at a time using a few bit tricks. Lookup tables work great in
microbenchmarks because the tables are always in cache, but not so great if
there's cache pressure from other code.

See this implementation of JSON escaping for example:
[https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f...](https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f6453ef055e03ee1a34d51)

De-escaping is even simpler because you only have to detect two characters, \
and ".

~~~
glangdale
The bit tricks are OK. We used them extensively back when we supported
platforms that lacked SSSE3, but they are way slower than the SIMD techniques
because of all the bit-bashing and far less flexible.

The technique I outline can match most character classes that you run into
(and there's a 3-PSHUFB variant that can handle _all_ predicates on bytes),
while the bit-twiddling version will get linearly slower as you add more
cases. If there's a way to get at what we did with general purpose operations
and bit twiddling hacks I'm too stupid to think of it.

~~~
ot
Well of course, they're different trade-offs. Here we don't care about
flexibility, we're talking about a specific syntax (JSON).

Also SSE will be beneficial only if your strings are long enough. I'd be
curious to benchmark my implementation against an SSE one on a representative
dataset.

EDIT: the microbenchmark indicates that the bit-trick implementation spends
about 1 cycle per byte (for the entire escaping routine, if there's nothing to
escape).

~~~
glangdale
1 cycle per byte is pretty good, and its properties might be very nice for
small strings.

Assuming you can prep everything in advance, the SIMD version is 3 loads and
about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple
transfer/scan operations (PMOVMSKB and whatever bit scan operation you want)
to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you
might care about reciprocal throughput or latency. Generally our work has been
more throughput oriented so those 10 operations can be scheduled along with
lots of others, but this isn't always the case.

The other way SIMD (or for that matter bit-twiddling) can be helpful is that
you can cover a wider variety of sizes in a branch-free fashion, so if your
sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be
straight line with no branch misses. So a SIMD version of the code might be
"slower" than a bit-twiddling version even if both sides always get 9 bytes,
but if it's a mix of sizes and thus unpredictable, it might take fewer branch
misses (unless some clown starts giving you 33 byte inputs half the time, of
course).

Again, benchmark data is always a good cure for folkloric beliefs whether pro-
SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never
worth it").

------
jokoon
People will keep laughing at me when I tell them to just produce a binary,
pre-parsed format for their data.

Parsing will always be slow no matter what you do, and you can hardly
parallelize it. Of course it's fast on a non-mobile CPU, but it will always be
slow on a smartphone, hence the need for such a format and it's the reason
smartphones use apps instead of HTML+JS.

It's funny because computers are very fast, but batteries are not so good, so
are we realizing that well designed software formats have an effect on battery
life?

I've seen several articles about how webpages are bloated, but I don't see any
real proposition on how to fix it.

~~~
vanni
[https://capnproto.org/](https://capnproto.org/)

~~~
Tempest1981
I was about to suggest Protobuf, but it looks like it has evolved into this,
and/or FlatBuffers.

[https://developers.google.com/protocol-
buffers](https://developers.google.com/protocol-buffers)

------
Cieplak
Looks like he's fuzzed it, too:

[https://github.com/chadaustin/sajson/tree/master/sajson-
fuzz](https://github.com/chadaustin/sajson/tree/master/sajson-fuzz)

~~~
chadaustin
My first exposure to American Fuzzy Lop! What an amazingly easy tool to get
running.

------
nly
I've found if you're in the situation where JSON parsing performance is
becoming a bottleneck then data/schema validation bottlenecks are probably not
far behind.

I have several hot paths in server projects where parsing IP addresses, dates,
and other such data types that JSON doesn't support, eat cycles.

Or maybe nobody else bothers to validate their data.

------
herf
Nobody has mentioned PCMPISTRI (and the other SSE4.2 string extensions), but
they deserve a benchmark here. Some of them appear in my regular glibc now,
and they're hard to beat.

I found this Intel article for XML (but not JSON):
[https://software.intel.com/en-us/articles/xml-parsing-
accele...](https://software.intel.com/en-us/articles/xml-parsing-accelerator-
with-intel-streaming-simd-extensions-4-intel-sse4)

~~~
chadaustin
PCMPISTRI and friends are great, but they're not very friendly to the problem
described in the article, since they can only check membership in a 16-element
set.

~~~
burntsushi
Eh? PCMPISTRI has a few different modes of operation, including full substring
search and character classes. e.g., You can use PCMPISTRI on a needle that
contains adjacent classes. For example, `azAZ09` would check if any byte in
the search string is in any of the ranges a-z, A-Z and 0-9.

Regardless, in the OP, they're specifically looking for one of a small number
of bytes, which is exactly what PCMPISTRI is supposed to be good for.

With that said, my experience mirrors glangdale's. Every time I've tried to
use PCMPISTRI, it's either been slower than other methods or not enough of an
improvement to justify it.

~~~
chadaustin
Oh, didn't know that! Thanks.

------
brianolson
This is a cool story of micro-optimization. But now I wonder about comparing
to other serializations entirely (protobuf, cap'n'proto, cbor) when not
constrained to JSON.

------
lefty2
If you really want to it to be faster get rid of the DOM. Most projects that
use a parser just transfer the data from the DOM into project specific data
structures. You save memory and performance if to transfer directly the data
into project data structure and bypass the DOM.

------
Aissen
Very interesting. I was looking for a comparison between ujson and
sajson/rapidjson and I stumbled on this benchmark:

[https://github.com/miloyip/nativejson-
benchmark](https://github.com/miloyip/nativejson-benchmark)

~~~
chadaustin
Milo Yip (author of RapidJSON) has done fantastic work on validating the
correctness of and benchmarking JSON parsers. The results listed there predate
the current round of sajson optimizations and sajson's dynamic allocation
mode, so it would be great to see them rerun.

------
duozerk
For json parsing, I really like JSMN
([https://github.com/zserge/jsmn/](https://github.com/zserge/jsmn/)). It
doesn't allocate memory, has a simple and easy to use API, and is pretty fast.

~~~
m_eiman
I started out using JSMN for parsing JSON in our embedded devices at work, but
the requirement to pre-allocate all the tokens turned out to be a problem - I
was running out of memory when the configuration files got too large…

So I wrote a new parser that trades lower RAM usage for higher CPU usage:
[https://bitbucket.org/tlabwest/json-
reader/overview](https://bitbucket.org/tlabwest/json-reader/overview)

Sample usage:

    
    
      json_token_t json;
      json_validate(json_string, &json);
      
      json_token_t value;
      json_find(&json, "key1/subkey:2/value1", &value);
      
      char s[bufsize];
      json_decode_string(&value, s, bufsize);
      
      printf("%s\n", s);
      
      json_free(&json);

------
treeform
Where did he get that assembler output from? It uses "jump-if-less" and "load"
instead of "jl" and "mov"? Is that a swift/llvm syntax?

~~~
ma2rten
I think it's pseudo code.

------
Cyph0n
Excellent write up. I have actually never thought about how having pointers as
locals instead of e.g. a class member is an optimization. Also, that goto-
based state machine is wonderful: I'll probably be stealing that for my next
project. I'm curious why you went with that instead of the more common switch
statement approach though.

------
yawniek
we recently wrote our own json parser for etl style operations on huge json
streams. the idea was that you are still able to define your structs/classes
and annotated members are auto-parsed at full speed. some sse4 optimizations
where used and performance should be on par or faster than rapidJSON (could
not benchmark yet).

[http://code.dlang.org/packages/asdf](http://code.dlang.org/packages/asdf)

as a simple tool you could try je, which is similar to a simple version of jq
but to extract json objects as a csv/tsv table. its about 10x faster than jq.
[https://github.com/tamediadigital/je](https://github.com/tamediadigital/je)

------
js8
I find it mildly amusing that people are competing on the fastest decoder of
horribly inefficient format. Don't get me wrong, I like e.g. ZX Spectrum
demos, but the practical utility of such endeavor is questionable.

But that aside, I wonder how this compares to the compressed indexes
approaches (see
[https://www.youtube.com/watch?v=uA0Z7_4J7u8&t=2850s](https://www.youtube.com/watch?v=uA0Z7_4J7u8&t=2850s)
and
[http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf](http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf)),
obviously, once the index is built.

~~~
dzdt
The practical utility is the reason for such work! JSON is an interchange
format. When you work on a project that interfaces with others, the
interchange format is likely outside of your control. So you have to do the
best you can with that format.

~~~
js8
Yeah, I know, but it was more of a lament. Although - if somebody is producing
so much JSON for you that you need to care about the speed of parsing it,
wouldn't it make sense for both of you to agree on a more efficient format? I
mean they would probably save money too on generating it..

~~~
Klathmon
Unless you are at either a very large scale, or a really small "embedded"
scale, I think the time spent engineering a different interchange format would
dwarf the savings from it, especially when you consider that the "bottleneck"
of development is almost always the developers, and pulling them off to re-
implement another format is time that they don't get to spend on the actual
product.

~~~
krallja
There already exist formats other than JSON for data exchange. This wouldn't
require reengineering; just integration with an existing library that meets
your licensing needs.

~~~
Klathmon
I'd just like to point out that it's still engineering to re-implement a
library into your code (it's a bit of a sore spot with me... I've been told to
"just use a library it shouldn't take any time at all" in the past, and
apparently I'm still not over it!)

But regardless, it's still a non-zero amount of work, and in my experience
choosing a mutual format isn't always a simple meeting...

Not to mention that if JSON is still needed elsewhere in your system (either
internally, or externally with others) all you've done is add complexity to
your application and spend developer time all for some savings on parsing
speed.

~~~
js8
But you already accepted the cost of this by playing this game - this
particular library is in C, which might be an effort to integrate with a
different programming language.

On the other hand, there are CBOR or MsgPack libraries that are fast and
native for many languages.

------
some_guy2234
This seems like a use-case where the Mill CPU architecture would work quite
well.

With the Mill's Loop pipelining and Mill's Vector ops this could be pretty
well-optimized.

~~~
Veedrac
Pretty sure you could do a single-instruction vectorized version of the C code
on Gold with 2c latency, yeah.

------
kevindqc
What is a critical execution path when talking about CPU? Why is only the
'increment p'?

