
Simdjson: Parsing gigabytes of JSON per second - lorenzfx
https://github.com/simdjson/simdjson
======
dang
A few months ago:
[https://news.ycombinator.com/item?id=22745351](https://news.ycombinator.com/item?id=22745351)

2019:
[https://news.ycombinator.com/item?id=19214387](https://news.ycombinator.com/item?id=19214387)

------
kasperni
I would consider Daniel Lemire (the main author) quite an authority within
practical use of vectorization (SIMD). He is a computer science professor at
Université du Québec. And is also behind the popular Roaring Bitmaps project
[1]. You can check out his publication list here [2].

[1] [https://roaringbitmap.org/](https://roaringbitmap.org/)

[2] [https://lemire.me/en/#publications](https://lemire.me/en/#publications)

~~~
sillysaurusx
Heh, I like the tagline: "Grab one of our research papers". It's very
readable, too!
[https://arxiv.org/pdf/1603.06549.pdf](https://arxiv.org/pdf/1603.06549.pdf)

Looks like the last active discussion of Roaring Bitmaps was 6 years ago:
[https://news.ycombinator.com/item?id=8796997](https://news.ycombinator.com/item?id=8796997)
possibly when it was first introduced. Interesting comments!

> _How do these compare space and performance wise with Judy arrays, which are
> 256-ary trees whose nodes also distinguish between sparse and dense
> subsets?_
> [https://en.wikipedia.org/wiki/Judy_array](https://en.wikipedia.org/wiki/Judy_array)

 _Good question, since the patent on them won 't expire until Nov 29, 2020._

Judy arrays seem to be much less known. Looks like today will be Datastructure
Thursday; lots of neat stuff to dig into.

~~~
c0l0
I used to work at a small-ish (but very profitable) European company whose
"crown jewel", in terms of in-house developed technology, was a distributed
in-memory search/database engine, written in hand-crafted, multi-threaded C by
a true coding wizard and industry veteran. It used Judy arrays at its heart
for the very first filtering/processing stage of queries, and did so to
amazing effect. It's quite spectacular what you could and can do with libjudy
if you use it wisely :)

I always wondered if relatively recent changes in x86 CPU µarchs broke some of
the assumptions that made libjudy perform this great even on Pentium-class
hardware, and if the de facto only complete implementation (that I happen to
be aware of) could be improved as a consequence, if those changes (i.e.,
hugely increased cache sizes, etc.) were considered.

~~~
Bootvis
What were you working on?

------
dan-robertson
Gigabytes per second can be a worrying statistic. It suggests that benchmarks
would be parsing massive json files rather than the small ones that real-world
applications deal with.

However this library maintains roughly constant throughput for both small (eg
300 byte) and large documents, if it’s benchmarks are accurate.

~~~
userbinator
Indeed, I feel like much of this is self-inflicted and could be avoided
completely if only more developers would learn to use far more compact and
efficient binary formats instead, where parsing (if it could even be called
that) is just a natural consequence of reading the data itself.

One of my favourite examples of this is numerical values, which is notably
called out in the performance results for this library as being one of the
slowest paths.

In a textual format you need to read digits and accumulate them, and JSON
allows floating-point which is even more complex to correctly parse.

In a binary format, you read _n_ bytes and interpret it as an integer or a
floating-point value directly. One machine instruction instead of dozens if
not hundreds or more.

I see such inefficiencies so often that I wonder if most developers even know
how to read/write binary formats anymore...

"The fastest way to do something is to not do it at all."

~~~
beached_whale
Fun fact, JSON's number type isn't floating point. It may map to it, but the
limits on the numbers that are representible are defined by the encoder and
decoder.

~~~
cbsmith
Which creates this lovely reality where you effectively have numbers that can
be accurately represented in IEEE float but can't be represented accurately in
decimal notation, and numbers that can be represented accurately in decimal
notation but not in IEEE float. All of those cases are dicey with JSON.

~~~
beached_whale
Curious, what number is representable in IEEE float but not as a decimal?

~~~
nayuki
+Infinity, -Infinity, NaN (quiet, signaling, miscellaneous).

~~~
beached_whale
Well, those are all error conditions and not a number. NaN's explicitly say
they aren't, and +-Inf may or may not be infinity, all we know is that it is
outside the representable range of IEEE floating point.

~~~
derefr
IMHO, code should check for floating-point overflow just like it checks for
integer overflow. If your inputs are finite and your output is infinite,
that's similar to your inputs being positive and your output being negative:
you should check for it and emit an error instead. Then infinity (like
negative numbers) can have an actual semantic meaning, rather than just being
the detritus of an error condition.

------
avian
The GitHub page links to a video that explains some of the internals [1]. Can
someone comment on the result that they show at 14:26?

My understanding is that they run a code that does 2000 branches based on a
pseudo-random sequence. Over around 10 runs of that code, the CPU supposedly
learns to correctly predict those 2000 branches and the performance steadily
increases.

Do the modern branch predictors really have the capability to remember an
exact sequence of past 2000 decisions on the same branch instruction? Also,
why would the performance increase incrementally like that? I would imagine
that it would remember the loop history on the first run and achieve maximum
performance on the second run.

I doubt that there's really a neural net in the silicon doing this as the
author speculates.

[1] [https://youtu.be/wlvKAT7SZIQ?t=864](https://youtu.be/wlvKAT7SZIQ?t=864)

~~~
WJW
Modern branch predictors absolutely have neural nets these days. Check out for
example this 2001 IEEE paper on "dynamic branch prediction using neural
networks":
[https://ieeexplore.ieee.org/document/952279](https://ieeexplore.ieee.org/document/952279)
or this 2007 patent using "conditional bias" (it beats around the bush, only
explicitly naming neural networks once, but it's clear what it is about if you
read between the lines):
[https://patents.google.com/patent/WO2009066063A1/en](https://patents.google.com/patent/WO2009066063A1/en).

There is a wealth of patents and articles if you search for "neural net branch
prediction patent".

~~~
heavenlyblue
Can you clear up whether you mean “there’s a paper that describes using neural
nets” or “there’s an actual modern CPU that uses neural nets for branch
brediction”?

~~~
avian
AMD said back in 2016 to have a neural net predictor [1]. I found this slide
following references in the Wikipedia article that rowanG077 linked below.

[1] [https://cdn.arstechnica.net/wp-
content/uploads/sites/3/2016/...](https://cdn.arstechnica.net/wp-
content/uploads/sites/3/2016/12/AMD-Zen-December-2016-Update_Final-For-
Distribution-18-980x551.jpeg)

~~~
bXVsbGVy
> Anticipate future decisions, pre-load instruction, chose best path through
> the CPU.

This slide is about speculative execution not necessarily about branch
prediction.

~~~
brianush1
Pre-loading instructions doesn't mean they're actually getting executed. It
sounds like it's talking about branch prediction.

------
rattray
For other folks interested in using this in Node.js, the performance of
`simdjson.parse()` is currently slower than `JSON.parse()` due to the way C++
objects are converted to JS objects. It seems the same problem affects a
Python implementation as well.

Performance-sensitive json-parsing Node users must do this instead:

    
    
         require("simdjson").lazyParse(jsonString).valueForKeyPath("foo.bar[1]")
    

[https://github.com/luizperes/simdjson_nodejs/issues/5](https://github.com/luizperes/simdjson_nodejs/issues/5)

~~~
flywheel
Thanks for posting this - I've been down that road. I regularly have to parse
about 20GB of JSON split up into 8MB JSON files - tried this library but was
sad that it didn't help. I'm currently using threading in nodejs and that has
helped quite a bit though, parsing up to 8 of those files at a time has given
me quite a performance boost - but I always want to do it faster. Switching to
using C just isn't really a viable option though.

~~~
bufferoverflow
You can write a small program in C just for the parsing part. Then call it
from Node.

------
dmitryminkovsky
SQLite can seemingly parse and process gigabytes of JSON per second. I was
pretty shocked by its performance when I tried it out the other month.I ran
all kinds of queries on JSON structures and it was so fast.

------
burtonator
An idea I had a few years ago which someone might be able to run with is to
develop new charsets based on the underlying data, not just some arbitrary
numerical range.

The idea being that characters that are more common in the underlying language
would be represented as lower integers and then use varint encoding so that
the data itself is smaller.

I did some experiments here and was able to compress our data by 25-45% in
many situations.

There are multiple issues here though. If you're compressing the data anyway
you might not have as big of a win in terms of storage but you still might if
you still need to decode the data into its original text.

~~~
beached_whale
This sounds a lot like Huffman Coding/Gray Codes, the basis of lz family of
compression. No need to change the character encoding, just build a table for
text and use that for encoding/decoding. Or for better results, build it from
the frequency of use in the document and store the table. This is
gzip/pkzip...

~~~
derefr
The parent is suggesting something complementary to compression. An altered
character encoding (even if just internal to the software, like UCS-16 is to
Windows) would mean that strings would be smaller not just on disk/on the
network, but _in memory_ , while held in random-access mutable form. This
might be a win for heavy text-manipulation systems like ElasticSearch.

~~~
cheerlessbog
And not so good if you want to do O(1) indexing

~~~
derefr
True for the GP's suggestion of a varint encoding, but also true for UTF-8
(which _is_ a varint encoding.) So that's not much of a loss; we're already
biting this bullet.

Still, though, you could have a _fixed-size_ encoding that could _still_ be
more compact than UTF-8, _if_ you limited what it could encode (and then held
either it, or UTF-8 text, in a tagged union, as an ADT wrapped with an API of
string operations that will implicitly "promote" your limited encoding to
UTF-8 if the other arg is UTF-8, the same way integers get "promoted" to
floats when you math them together.)

Then your limited-encoding text could hold and manipulate e.g. ASCII, or
Japanese hiragana and katakana, or APL, or whatever else your system mostly
holds, as a random-access array of single-octet codepoints; until something
outside of that stream comes up, at which point you get UTF-8 text instead and
your random-access operations become shimmed by seq-scans.

(Or you get a _rope_ with both UTF-8 strings and limited-encoding strings as
leaf nodes!)

Of course, if you didn't catch it, I'm talking about going back to having code
pages. :) Just, from a perspective where everything is "canonically" UTF-8 and
code pages are an internal optimization within your string ADT; rather than
everything "canonically" being char[] of the system code page.

------
fastball
And if you're looking for a fast JSON lib for CPython, orjson[1] (written in
rust) is the best I've found.

[1]
[https://github.com/ijl/orjson#performance](https://github.com/ijl/orjson#performance)

~~~
jefurii
There is a Python wrapper for simdjson:
[https://github.com/TkTech/pysimdjson](https://github.com/TkTech/pysimdjson)

It looks like pysimdjson's biggest performance gain compared to e.g. orjson is
when you can cherry-pick single values out of the JSON and avoid deserializing
the whole document.

~~~
welder
I've tested parsing complete documents with rapidjson vs pysimdjson vs
simplejson on production data. Surprisingly, mean 90 times are exactly the
same for all three libraries. I'm not re-using the simdjson parser, just doing
simdjson.loads().

~~~
TkTech
The time is overwhelmingly the cost of conversion to CPython objects, 95-99%
of the total time. The actual document parsing to tape is always much faster
than rapidjson. No CPython JSON library can get away from this overhead.

You see a real benefit when you don't need or want the entire document, and
can use the JSON pointer or proxy object interface.

------
hellofunk
I never thought I’d write this, but we have officially entered a golden age
for C++ JSON utils. They are everywhere, and springing up right and left. It
is a great time to be alive.

~~~
emerged
I wrote my own little JSON lib years ago which was just a recursive std::map.
It was so disappointingly slow. Bottlenecked the whole app.

~~~
saagarjha
Perhaps std::unordered_map would have helped slightly.

------
grandinj
Just noting that this library requires that you are able to hold your expanded
document in memory.

I needed to parse a very very large JSON document and pull out a subset of
data, which didn't work, because it exceeded available RAM.

~~~
wenc
Any possibility of using mmap?

[https://en.wikipedia.org/wiki/Mmap](https://en.wikipedia.org/wiki/Mmap)

~~~
grandinj
not when you exceed the size of the pagefile/swapfile :-)

But it's a good project, otherwise.

~~~
cma
mmap can exceed pagefile/swapfile+ram if you are mapping a file and not
anonymous pages.

~~~
grandinj
yeah. In my situation (a) the file was 10x RAM (b) when this library parses a
JSON it creates an in-memory tree representing the file which is approx 16x
the size of the JSON file.

Consequently it swapped so hard I was never going to finish processing. So I
used a streaming parser, and it finished in minutes.

------
dheera
So what is the fastest JSON library available? orjson claims they are the
fastest but they don't benchmark simdjson. simdjson claims they are the
fastest but did they forget to benchmark anything?

------
cerberusss
The author has given a talk last month, which can be viewed on YouTube:

[https://www.youtube.com/watch?v=p6X8BGSrR9w](https://www.youtube.com/watch?v=p6X8BGSrR9w)

------
chungus
I use Emacs with lsp-mode (Language Server Protocol) a lot (for haskell, rust,
elm and even java) and there was a dramatic speedup from Emacs 27 onwards when
it started using jansson JSON parsing.

I don't think it's the bottleneck at the moment, but it's good to know there
are faster parsers out there. Had a small search but couldn't find any plans
to incorporate simdjson, besides a thread from last year on Emacs China
forums.

------
Const-me
Very impressive. Still there’re couple of issues there.

This comment is incorrect:
[https://github.com/simdjson/simdjson/blob/v0.4.7/src/haswell...](https://github.com/simdjson/simdjson/blob/v0.4.7/src/haswell/simd.h#L111)

The behavior of that instruction is well specified for all inputs. If the high
bit is set, the corresponding output byte will be 0. If the high bit is zero,
only the lower 4 bits will be used for the index. Ability to selectively zero
out some bytes while shuffling is useful sometimes.

I’m not sure about this part:
[https://github.com/simdjson/simdjson/blob/v0.4.7/src/simdpru...](https://github.com/simdjson/simdjson/blob/v0.4.7/src/simdprune_tables.h#L9-L11)
popcnt instruction is very fast, the latency is 3 cycles on Skylake, and only
1 cycle on Zen2. It produces same result without RAM loads and therefore
without taking precious L1D space. The code uses popcnt sometimes, but
apparently the lookup table is still used in other places.

~~~
nkurz
> Perform a lookup assuming the value is between 0 and 16 (undefined behavior
> for out of range values)

I think you are misinterpreting the way that "undefined" is being used here.
It's not a claim that one will get unpredictable results for this particular
implementation, rather it's about the specification of the function. It's
telling the user that the behavior of this function for out of range values is
not guaranteed to remain the same across time as the code is changed, or
across different architectures.

> I’m not sure about this part: ... popcnt instruction is very fast

I haven't worked on this particular code, but I've coauthored a paper with
Daniel on beating popcnt using AVX2 instructions:
[https://lemire.me/en/publication/arxiv1611.07612/](https://lemire.me/en/publication/arxiv1611.07612/).
While you are right that at times saving L1 space is a greater priority, I'd
bet that the approach used here was tested and found to be faster on Haswell.
I'm not sure if you noticed that the page you linked is Haswell specific?

~~~
Const-me
> rather it's about the specification of the function

That “function” compiles into a single CPU instruction. The OP is perfectly
aware of that, that’s why really_inline is there.

> on beating popcnt using AVX2 instructions

It’s easy to do with pshufb when you have many values on input. I have wrote
about it years before that article, see there: [https://github.com/Const-
me/LookupTables#test-results](https://github.com/Const-me/LookupTables#test-
results)

> I'd bet that the approach used here was tested and found to be faster on
> Haswell

I'd bet it’s an error.

> if you noticed that the page you linked is Haswell specific

I did. Was disappointed though, I expected to find something newer than
Haswell from 2013, like Zen 2 or Skylake. When doing micro-optimizations like
that, the exact micro-architecture matters.

~~~
nkurz
> I expected to find something newer than Haswell from 2013, like Zen 2 or
> Skylake.

I'm sure optimizations for more recent architectures would be appreciated, and
Daniel is wonderfully accepting of patches. Be careful though, or you might
inadvertently end up as the maintainer of the whole project!

~~~
Const-me
When I have free time, I’m generally more willing to contribute to my own open
source projects no one cares about. Like this one: [https://github.com/Const-
me/Vrmac](https://github.com/Const-me/Vrmac) BTW did substantial amount of
SIMD stuff there, for both 3D and 2D parts.

------
yalok
Didn’t find any mention of plans for NEON (ARM’s SIMD) support - anyone heard
of such plans?

~~~
mbreese
I don’t know about full support, as I can barely understand this code. However
Neon is one of the code paths shown in the #if blocks, so I’d assume it
supports neon, or at least has plans to support it.

[https://github.com/simdjson/simdjson/blob/master/singleheade...](https://github.com/simdjson/simdjson/blob/master/singleheader/simdjson.cpp)

    
    
        enum instruction_set {
          DEFAULT = 0x0,
          NEON = 0x1,
          AVX2 = 0x4,
          SSE42 = 0x8,
          PCLMULQDQ = 0x10,
          BMI1 = 0x20,
          BMI2 = 0x40
        };
    
        #if defined(__arm__) || defined(__aarch64__) // incl. armel, armhf, arm64
    
        #if defined(__ARM_NEON)

------
mattbk1
There's an R (#rstats) wrapper as well:
[https://github.com/eddelbuettel/rcppsimdjson](https://github.com/eddelbuettel/rcppsimdjson)

------
Koshkin
There's something wrong with having gigabytes-sized _text_ files.

~~~
sroussey
I mentioned this elsewhere, but if you google takeout your location history
you get... gigabyte text files.

------
asadlionpk
It seems this is for parsing multiple JSONs, each a few MBs at most. What does
one do if they have a _single_ 100GB JSON file? :)

ie.

    
    
      {
        // 100GB of data
      }

------
pier25
This is fantastic.

Anyone knows what library does V8 use or how does it compare?

~~~
jayflux
Last I checked V8 don’t outsource to a library, their JSON parsing is built-in
see
[https://news.ycombinator.com/item?id=20724854](https://news.ycombinator.com/item?id=20724854)

------
chii
i wonder if it's better to FFI into this library when using node.js, rather
than using JSON.parse()

~~~
Legogris
The FFI overhead in nodejs is significant. We have a project where a while
back, after profiling, the majority of the CPU time was spent in a couple of
small hotspots doing parsing and object construction in Nodejs.

I broke those out into a Rust library that was >100x faster (IIRC) in
synthetic benchmarks with the same complexity. Plugging it in with FFI into
the Nodejs app and it actually performed slightly worse due to FFI overhead
and translation.

So for large documents; could be worth it. For lots of small objects; probably
not. You'd have to try on real-world data for your use-case to know.

Also
[https://github.com/luizperes/simdjson_nodejs/issues/5](https://github.com/luizperes/simdjson_nodejs/issues/5)

~~~
dgb23
> I broke those out into a Rust library that was >100x faster (IIRC) in
> synthetic benchmarks with the same complexity. Plugging it in with FFI into
> the Nodejs app and it actually performed slightly worse due to FFI overhead
> and translation.

This is interesting. Have you tried to optimize for reducing the overhead? I
can imagine that this can be hard/complex and require a refactoring of the
consumer in some scenarios.

------
MariuszGalus
Glad Lemire is getting his shine-time on hn

------
deathnoto
missing comparison with jansson
([https://jansson.readthedocs.io/en/2.10/](https://jansson.readthedocs.io/en/2.10/))

~~~
chrisan
Looks like jansson is slower than RapidJSON (which is compared)
[https://github.com/miloyip/nativejson-
benchmark](https://github.com/miloyip/nativejson-benchmark)

------
deathnoto
missing comparison with libjansson
([https://jansson.readthedocs.io/en/2.10/](https://jansson.readthedocs.io/en/2.10/))

------
rgovostes
Your computer scientists were so preoccupied with whether or not they could,
they didn't stop to think if they should.

