
Simdjson 0.3: Faster JSON parser - ngaut
https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/
======
KenanSulayman
This is a great project -- we use a Rust port of this [1] following the
general principles to process the bunyan logs of hundreds of node containers
with a single machine. I'd love to see AVX512 support (mainly to get more out
of the ridiculously overpriced AWS instances..) [2].

[1] [https://github.com/simd-lite/simd-json](https://github.com/simd-
lite/simd-json) [2]
[https://github.com/simdjson/simdjson/issues/10](https://github.com/simdjson/simdjson/issues/10)

~~~
rurban
Better switch to AMD. No AVX512 but much faster. And safer.

~~~
maallooc
>faster and safer

Citation needed.

~~~
Ygg2
Spectre, meltdown reduce Intel's performance by quite a lot.
[https://www.phoronix.com/scan.php?page=article&item=linux-41...](https://www.phoronix.com/scan.php?page=article&item=linux-419-mitigations&num=1)

And seeing how AMD is already a faster and better CPU in general (I mean the
latest AMD Ryzen 3000 vs Intel 9000 series, at time of writing this), I'm not
sure what about that needs citation?

~~~
eyegor
Pretty sure the op was asking for a citation specifically about aws nodes.
Should be simple enough to take a test workload on Intel vs amd nodes and
measure things (besides "safety" which is too nebulous to measure). Consumer
chips aren't entirely relevant.

~~~
Ygg2
I'm pretty sure I linked server chip comparison. And the same guys that design
consumer chips, design server, so there will be a lot of "cross-pollination"
between architectures. We definitely see that a lot of Intel consumer and
server grade is hit with similar bugs.

Safety can be roughly estimated by a number of critical vulnerabilities. Most
have hit Intel, some being Intel specific bugs.

~~~
wahern
But the mitigations in AWS' VM monitor might be a fixed cost--i.e. not
disabled on AMD architectures or otherwise not disadvantaging Intel. For
example, they might use techniques like page coloring as an alternative to
cache flushing and prediction suppression, which wouldn't disadvantage Intel,
might not impose any significant overhead at all, and might even improve
performance in shared hosting environments.

Alternatively, maybe their support for AMD is nascent and optimizations for
Intel absent on AMD.

There's really no substitute for benchmarking AWS' actual offerings.

------
SloopJon
Here's the discussion of the initial release:

[https://news.ycombinator.com/item?id=19214387](https://news.ycombinator.com/item?id=19214387)

The inability to handle embedded NULs (issue #40) felt like cheating, but that
has been fixed.

It looks like I can fetch a number as a string, so I can read into a decimal
type. I'll have to try that. Edit: nope, doesn't work. "The JSON element does
not have the requested type."

------
umvi
So it achieves the speed gains by using simd CPU instructions. Do most
architectures support simd (including arm)?

Also, iirc the biggest negative of this library last time was that the API was
super unintuitive (compared to, say, nlohmann). Glad to see progress has been
made here.

Would love to see a quality python wrapper for this library.

~~~
jkeiser
Worth pointing out: I thought it was just the SIMD that made it fast when I
first got involved. It turns out that while it helps, it's just a tool that
helps to achieve the real gain: eliminating branches (if statements) that make
the processor stumble and fall in its attempts to sprint ahead (speculative
execution). simdjson's first stage pushes this capability really far towards
its limit, achieving 4 instructions per cycle by not branching. And yes, 1
cycle is the smallest amount of time a single instruction can take. Turns out
a single thread is running multiple instructions in parallel at any given
time, as long as you don't trip it up!

Parsing is notoriously serial and branchy, which is what makes simdjson so out
of the ordinary. It's using a sort of "microparallel algorithm," running a
small parallel parse on a SIMD-sized chunk of JSON (16-32 bytes depending on
architecture), and then moving to the next.

And yeah, you have to go back over a decade to find CPUs that don't have SIMD.
simdjson runs on those too, just obviously doesn't use SIMD instructions :)

~~~
glangdale
This is a good explanation of why it's fast.

An interesting point with the design of simdjson loses its branchlessness in
"stage 2". I originally had a bunch of very elaborate plans to try to stay
branchless far further into the parsing process. It proved just too hard to
make it work. There were some promising things that ultra-modern Intel chips -
meaning Icelake - and future iterations of ARM (SVE/SVE2) - are adding to
their SIMD abilities, so it might be worth revisiting this in a few years
(there aren't too many Icelake boxes out there and SVE barely exists).

~~~
jkeiser
Yep. Most of stage 2's branchiness essentially comes from "is the next thing
an array? object? string? number? Handle it differently if so."

Making it so you can handle all the brackets at once, all the strings at once,
all the numbers at once, would make a big difference, and we're thinking about
that. Another thing that could help is making the if statement more
predictable using type information from the user. get<int>() could mean "I
expect this next thing to be an integer, so parse it that way and just yell if
it's not, please."

It's difficult. But it's why I'm still so fascinated! Solving JSON thoroughly
and completely will give us a _lot_ of information on how to quickly parse
XML, YAML, and other file formats.

We've clearly been collectively doing parsing wrong (including me) if there's
this much of a gap. It's exciting to see something innovative and new in this
domain and even being able to contribute to it :) @lemire deserves a ton of
credit for making an actual project out of his work and promoting it; I likely
wouldn't have heard of it otherwise.

~~~
glangdale
That's right. There's a problem which I referred to as the toothpaste problem
- you can squeeze the needed branchiness _around_ but you can't make it go
away entirely (at least, I couldn't).

There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes,
stage 3 was the tape construction and stage 4 was the 'clean everything up and
do all the hard stuff'). It's possible - though awkward - to do tape
construction branchlessly, but the gyrations required were expensive and weird
and it just delayed the reckoning.

I built a prototype of the 'gather all the X at once and handle it in one go'
and the logic to gather that stuff was more expensive than just handling
everything.

In my wacky world of branch free coding (which I've been doing a while) there
are worse things than missing a branch. The idea that you can accumulate an
array branchlessly (i.e. always put the thing you have in a location somewhere
and bump a pointer when you need to) seems pretty cool, but branch miss is not
the only hazard. This technique of branchlessly writing a log is a anti-
pattern I've tried over and over again and a stream of unpredictable writes is
just as big a pain as a bunch of unpredictable branches - it causes the
pipeline to come to a screaming halt. If you can get it going somehow (new
SIMD tricks? better algorithm?) I'd be impressed.

Get my email from Daniel or hit me up in Twitter DMs if you want to discuss
further.

------
drej
I recommend the author's talk about this project (and his blog is a good
resource).
[https://www.youtube.com/watch?v=wlvKAT7SZIQ](https://www.youtube.com/watch?v=wlvKAT7SZIQ)

------
xakahnx
I like string processing libraries that implement a multi-document feature
like the one mentioned here. There's always some efficiency to be gained-
maybe the public API has a lot of branching or initialization, maybe it
acquires a lock, etc. Batching will amortize that cost, or open up new
opportunities for SIMD processing. Letting the user reduce overhead through
batching isn't something I see supported in other libraries, even ones that
advertise high performance.

~~~
hnlmorg
ndjson isn't really batching though, it's a derivative of JSON (or a
superset?) and there are a few variations of it[1][2][3].

It's also supported by quite a few libraries[4] and tools[5] too -- plus many
others that don't document it as being ndjson/jsonl (such as AWS CloudTrail
logs). I've got support for it in the non-POSIX UNIX/Linux shell I'd written
too[6]

The issue is really more that, like with comments, JSON doesn't support
streaming nor even multiple documents (something formats like YAML already
have built into the spec) so it becomes something you need to advertise if
you're writing a parser simply because it's not part of the standard JSON
specification.

[1] [https://en.wikipedia.org/wiki/JSON_streaming#Line-
delimited_...](https://en.wikipedia.org/wiki/JSON_streaming#Line-
delimited_JSON)

[2] [http://ndjson.org/](http://ndjson.org/)

[3] [http://jsonlines.org/](http://jsonlines.org/)

[4] [http://ndjson.org/libraries.html](http://ndjson.org/libraries.html)

[5] [http://jsonlines.org/on_the_web/](http://jsonlines.org/on_the_web/)

[6]
[https://murex.rocks/docs/types/jsonl.html](https://murex.rocks/docs/types/jsonl.html)

~~~
xakahnx
Good point. I was picturing some gather/scatter over strings which are not in
adjacent memory (maybe a generous interpretation for my use-case).
Concatenating small strings into ndjson may still come out ahead performance-
wise.

------
rurban
What I don't like about that approach is that it still copies into temp.
buffers. The normal tape approach is to reference everything via int indices
and keep the input buffer as is. This works also concurrently and with
streaming. I don't see the need to copy at all. You don't need \0 delims for
strings, just use the mem* API everywhere, all sizes are known.

And API wise filter hooks are needed to skip unneeded arrays or objects. This
would also save a lot of unneeded buffering.

~~~
eska
They're using those \0 to find bit patterns with SIMD instructions. Their
approach would fundamentally not work if they were unable to modify the input.

~~~
jkeiser
simdjson does not modify the input, fyi. Rather than \0, which isn't really a
part of JSON, the SIMD instructions search for whitespace and JSON structure
like commas, colons, etc.

~~~
eska
Yeah, sorry I didn't feel like giving some guy who hasn't even read the
article a full explanation. However I saw in a presentation on YouTube that
these characters are replaced (e.g. ") to then apply SIMD instructions. This
implies copies and allocations. But these are necessary, and it would not be
possible to achieve this with indices.

------
nojvek
This is really sick. Love it!

Although I would really love to see a more zero-parse binary format. At the
end of the day simd, is just making markers in the original file of where an
array begins, object begins, each of the object key value ranges e.t.c i.e the
find marks -> build tape algorithm.

This could be achieved by having length prefixed values, kinda like flat
buffers which is zero parse time, and much compact than json.
[https://google.github.io/flatbuffers/](https://google.github.io/flatbuffers/)

I just wish json had a more universal alternative binary format which was
insanely fast and compact to work with.

Protobufs are nice but parsing large files is still a pain and consumes a ton
of memory. Ideally a format that could simply be lazy mmap-ed would be
desirable.

\---

> The work is supported by the Natural Sciences and Engineering Research
> Council of Canada under grant number RGPIN-2017-03910.

Well, thank you Canada!

------
code_biologist
Haven't actually used simdjson, but the love and care put into the API is
evident. Looks great!

------
dirtydroog
Did the rapidjson profile use insitu parsing?

~~~
jkeiser
Yep. You can see the raw numbers at
[https://simdjson.org/about/](https://simdjson.org/about/).

simdjson UTF8-validation, exact numbers: 2.5 GB/s

RapidJSON insitu, UTF8-validation: 0.41 GB/s

RapidJSON insitu, UTF8-validation, exact numbers: 0.39 GB/s

RapidJSON UTF8-validation: 0.29 GB/s

RapidJSON UTF8-validation, exact numbers: 0.28 GB/s

------
drewm1980
Does orjson belong in their comparison?

~~~
TkTech
orjson generally doesn't even compete in the same order of magnitude. The
round-trip cost of object creation in CPython is pretty much constant given
the number of objects and strings in a JSON document. The overhead is
overwhelming, 95% of the total execution time vs the actual parsing of the
document.

------
tmikaeld
anyone know what the V8 engine uses?

~~~
fulafel
Most JSON parses are small. Does this have bigger spin-up overhead? Also the
memory handling requirements of working with GC'd JS data are very particular,
this sounds like it has its own way with memory.

~~~
jkeiser
simdjson's spinup overhead is minimal, it runs at a pretty steady rate no
matter the size of document:
[https://github.com/simdjson/simdjson/blob/master/doc/growing...](https://github.com/simdjson/simdjson/blob/master/doc/growing.png)

It's also still faster than other parsers in our tests, even on the smallest
documents
([https://github.com/simdjson/simdjson/issues/312#issuecomment...](https://github.com/simdjson/simdjson/issues/312#issuecomment-597223884\)--its)
advantage is just smaller, like 1.1x and 1.2x instead of 2.5x :) It really
starts to pull ahead somewhere between 512 bytes and 1K.

------
wingi
Are you sure to support the json-standard instead offering the best speed?
[http://seriot.ch/json/parsing.html](http://seriot.ch/json/parsing.html)

~~~
jkeiser
Yep, simdjson is a fully compliant, validating JSON parser, up to and
including full UTF-8 validation. That's part of what makes its speed so eerie.

------
marknadal
Holy Cow! 2.5GB/s that is amazing.

Meanwhile I can barely get Chrome/NodeJS to parse 20MB in less than 100ms :(.

How useful (or useless) would Simdjson as a Native Addon to V8 be? I assume
transferring the object into JS land would kill all the speed gains?

I wrote my own JSON parser just last week, to see if I could improve the
NodeJS situation. Discovered some really interesting factoids:

(A) JSON parse is CPU-blocking, so if you get a large object, your server
cannot handle any other web request until it finishes parsing, this sucks.

(B) At first I fixed this by using setImmediate/shim, but discovered to
annoying issues:

(1) Scheduling too many setImmediates will cause the event loop to block at
the "check" cycle, you actually have to load balance across turns in the event
loop like so
([https://twitter.com/marknadal/status/1242476619752591360](https://twitter.com/marknadal/status/1242476619752591360))

(2) Doing the above will cause your code to be way slow, so a trick instead,
is to actually skip setImmediate and invoke your code 3333 (some divider of
NodeJS's ~11K stack depth limit) times or for 1ms before doing a real
setImmediate.

(C) Now that we can parse without blocking, our parser's while loop
([https://github.com/amark/gun/blob/master/lib/yson.js](https://github.com/amark/gun/blob/master/lib/yson.js))
marches X byte increments at a time (I found 32KB to be a sweet spot, not sure
why).

(D) I'm seeing this pure JS parser be ~2.5X slower than native for big complex
JSON objects (20MB).

(E) Interestingly enough, I'm seeing 10X~20X faster than native, for parsing
JSON records that have large values (ex, embedded image, etc.).

(F) Why? This happened when I switched my parser to skip per-byte checks when
encountering `"` to next indexOf. So it would seem V8's built in JSON parser
is still checking every character for a token which slows it down?

(G) I hate switch statements, but woah, I got a minor but noticeable speed
boost going from if/else token checks to a switch statement.

Happy to answer any other Qs!

But compared to OP's 2.5GB/s parsing?! Ha, mine is a joke.

~~~
bjoli
I did a small benchmark on machine machine last time simdjson was up for
discussion and back then it was faster than /bin/cat on my machine

~~~
mianos
This comment was right at the bottom. It was so funny I just spit my coffee.

~~~
bjoli
the thing is, it really was faster than gnu cat. I suspect it is because gnu
cat does other things than just using Linux splice to a file descriptor and
has options to count lines and such, and doesn't (didn't?) bother to use SSE.
I just thought cat would give me a practical maximum to compare to when
reading from disk.

