
Simdjson – Parsing Gigabytes of JSON per Second - cmsimike
https://github.com/lemire/simdjson
======
raphlinus
This is very cool. Meanwhile, in the xi-editor project, we're struggling with
the fact that Swift JSON parsing is very slow. My benchmarking clocked in at
0.00089GB/s for Swift 4, and things don't seem to have improved much with
Swift 5. I'm encouraging people on that issue to do a blog post.

[1]: [https://github.com/xi-editor/xi-mac/issues/102](https://github.com/xi-
editor/xi-mac/issues/102)

~~~
marton78
Why does Xi use JSON in the first place? It would be easier and faster to use
a binary format, e.g. Protobufs, Flatbuffers or if the semantics of JSON is
needed: CBOR.

~~~
aratno
From “Design Decisions”[1]:

> JSON. The protocol for front-end / back-end communication, as well as
> between the back-end and plug-ins, is based on simple JSON messages. I
> considered binary formats, but the actual improvement in performance would
> be completely in the noise. Using JSON considerably lowers friction for
> developing plug-ins, as it’s available out of the box for most modern
> languages, and there are plenty of the libraries available for the other
> ones.

1: [https://github.com/xi-editor/xi-
editor/blob/master/README.md...](https://github.com/xi-editor/xi-
editor/blob/master/README.md#design-decisions)

~~~
nothrabannosir
So is it too slow or not?

~~~
raphlinus
We actually do get 60fps, but JSON parsing on the Swift side takes more than
its share of total CPU load, affecting power consumption among other things.
So (partly to address the trolls elsewhere in the thread), the choice of JSON
does not preclude fast implementation (as the existence of simdjson proves),
but it does make it dependent on the language having a performant JSON
implementation. I made the assumption that this would be the case, and for
Swift it isn't.

~~~
dilap
At some point though, isn't it maybe easier just to use an inherently more
efficient format than trying to rely on clever implementations to save you?

I totally get json for public internet services where you want to have lots of
consumers and using a more efficient format would be significant friction, but
writing an editor frontend is a very large endeavor -- it seems like the extra
work of adopting something more efficient than json (like flatbuffers or
whatever) would really be in the noise.

~~~
raphlinus
It's a complicated tradeoff. It's not just performance, the main thing is
clear code. Another factor was support across a wide variety of languages,
which was thinner for things like flatbuffers at the time we adopted JSON.
Also, "clever implementations" like simdjson don't have a high cost, if
they're nice open source libraries.

~~~
aseipp
The problem with clever implementations isn't that they can't be reused or
that they have abnormally high cost for end-users (though this is sometimes
the case). It's that they inherently require more work to maintain, author,
and debug over time. When you're talking about a cross language protocol that
will have myriads of available implementations (each with different
constraints), it's not unreasonble to take a look at how much work a third
party must engage in to get such a "clever" implementation (or, in other
words, "how many people could reimplement simdjson?") And if those existing
clever implementations aren't available (or viable) for some use case, then
you're out of luck and start at square one. This happens more often than you
think.

In this case there's a lot of work already put into fast JSON parsers, but in
general JSON is not a very friendly format to work with or write efficient,
generalized implementations of. Maybe it's not worth switching to something
else. I'm not saying you should, it seems like a fine choice to me. But clever
implementations don't come free and representation choice has a big impact on
how "clever" you need to be.

------
glangdale
One of the two authors here. Happy to answer questions.

The intent was to open things but not publicize them at this stage but Hacker
News seems to find stuff. Wouldn't surprise me if plenty of folks follow
Daniel Lemire on Github as his stuff is always interesting.

~~~
Jerry2
I'm writing an IoT library for devices with tiny microprocessors and have been
sending data as JSON or BSON (binary JSON). On the backend, I've been storing
reports from IoT devices into a database (MariaDB on AWS). How crazy would it
be to just store all the data as JSON files on disk (or S3 bucket) and then
batch process them when I need to perform data analysis on them? If a million
devices sends dozens of status reports per day, that's going to be a crapton
on files... but that might be faster to process than querying the database.

If you or anyone else has some opinions on this, please let me know! I'd
really like to learn how people do this type of analysis at scale.

~~~
tobilg
We‘re using AWS Kinesis delivery streams to batch incoming JSON messages from
IoT devices to Parquet files in S3. Those can directly be read by different
AWS services like Redshift, EMR or Athena...

~~~
DanFeldman
We use Athena for all our robotics data, which we ETL into JSON. It's
fantastic for queries that are simple time-slice queries, as most are because
sensor data is inherently time-series. When more complicated joins are
necessary, the performance is there across terabytes, and the cost is very
very low, $5 per terabyte scanned (storage costs are another thing).

------
xfs
If you're working with json objects with sizes on the higher end quite often
you're not going to need the entirety of them, just a small part of them. If
that is the workload what then to do is simply parse as little data as
possible: skip the validation, locate the relevant bits, and then start
parsing, validation and all the stuff. In this optimizing the json
scanner/lexer gives much greater improvement than optimizing the parser.

Though this job is trickier than it may look. The logic to extract the
"relevant" bits is often dynamic or tied to user input but for the
scanner/lexer to be ultrafast it has to be tightly compiled. You can try
jitting but libllvm is probably too heavyweight for parsing json.

~~~
chubot
I agree that's a good strategy for big JSON. Do you know of any such "lazy"
parsers?

I think the problem is that to extract arbitrary keys, you really need to
parse the whole thing, although you don't need to materialize nodes for the
whole thing.

But if you have big JSON with a given schema, you may be able to skip things
lexically. You basically need to count {} and [], while taking into account "
and \ within quoted strings.

That doesn't seem too hard. I think a tiny bit of
[http://re2c.org/](http://re2c.org/) could do a good job of it.

~~~
kristjansson
It’s not exactly the lazy parser you describe, but Sparser[1] builds filters
to exclude json lines/files that can’t contain what you’re looking for, and
only parses those that might.

The Morning Paper’s writeup[2] from last year provides a good summary

[1]:
[http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf](http://www.vldb.org/pvldb/vol11/p1576-palkar.pdf)
[2]: [https://blog.acolyer.org/2018/08/20/filter-before-you-
parse-...](https://blog.acolyer.org/2018/08/20/filter-before-you-parse-faster-
analytics-on-raw-data-with-sparser/)

~~~
glangdale
This work is somewhat orthogonal to ours as it assumes that you can locate
JSON records without doing parsing; if I remember correctly, it groups JSON
records as lines. If your JSON has been formatted to conform to this, I
suppose it would be quite effective.

------
jillesvangurp
Number handling looks like it would be a problem. There are Test suites for
json parsers and lots of parsers that fail a lot of these tests. Check e.g.
[https://github.com/nst/JSONTestSuite](https://github.com/nst/JSONTestSuite)
which checks compliance against RFC 8259.

Publishing results against this could be useful both for assessing how good
this parser is and establishing and documenting any known issues. If
correctness is not a goal, this can still be fine but finding out your parser
of choice doesn't handle common json emitted by other systems can be annoying.

Regarding the numbers, I've run into a few cases where Jackson being able to
parse BigIntegers and BigDecimals was very useful to me. Silently rounding to
doubles or floats can be lossy and failing on some documents just because the
value exceeds max long/in t can be an issue as well.

------
baybal2
> We store strings as NULL terminated C strings. Thus we implicitly assume
> that you do not include a NULL character within your string, which is
> allowed technically speaking if you escape it (\u0000).

I lost count to broken JSON parsers which all fall to that.

~~~
groestl
Yeah, this is unforgivable, and for me makes the whole speed argument void.

Edit: to be fair, they handle a couple of other things, which many similar
libraries ignore. I particulary like the support for full 64bit integers. And
at least they document their limitation on NULL bytes.

~~~
glangdale
"Unforgivable" is a bit strong. I don't think this is something which
invalidates our entire approach - nothing in the algorithm depends on this
behavior as the \0 chars don't appear until quite late. Even then, we are not
dependent on sighting a \0 in our string normalization and as such we can
probably just store a offset+length in our 'tape' structure rather than
assuming we have null terminated strings.

Please add an issue on Github.

Edit: I went ahead and added an issue. Seems like something we should fix.

------
adrianN
I feel like if you need to parse Gigabytes per second of JSON, you should
probably think about using a more efficient serialization format than JSON.
Binary formats are not much harder to generate and can save a lot of bandwidth
and CPU time.

~~~
wongarsu
I have in the past parsed terabytes of JSON. The specific use case was
analysing archived Reddit comments. The Reddit API uses JSON, and somebody [1]
runs a server that just dumps them in a file, one line of JSON per comment,
and offers them for download (compressed, obviously). So now you end up with
Gigabytes of small JSONs per month, and anything you do will be quickly
dominated by JSON parsing time.

You _could_ store them in some binary format, but the API response format
changed over the years with various fields being added and removed, and either
your binary format ends up not much better than JSON or you end up reencoding
old comments because the API changed.

1: [http://files.pushshift.io/reddit/](http://files.pushshift.io/reddit/)

~~~
nojvek
The parsed format in tape.md is quite close to the flatbuffer format.
Flatbuffer can encode any json file just fine. The parse time is immediate and
requires no extra memory.

It’s a great way to store big json files where you only want to access a
subset of data very quickly and not load the whole file into memory.

[https://google.github.io/flatbuffers/](https://google.github.io/flatbuffers/)

------
kccqzy
I guess the question is, what do you parse it to? I'm guessing definitely not
turning objects into std::unordered_map and arrays into std::vector or some
such. So how easy it is to use the "parsed" data structure? How easy is it to
add an element to some deeply nested array for example?

~~~
Falell
The ParsedJson type is immutable and accessed mutating iterators (up and down
the tree, forward and backward through members and indices).

My immediate thought is to compare it to rapidjson, which I've used before.
The paradigm of mutating iterators seems awkward at first but should be just
as powerful as rapidjson's Value. For example, both approaches end up doing a
linear scan to find an object member by name.

The fact that rapidjson supports mutation of Values and simdjson does not has
huge implications (as mentioned in the simdjson README scope section), I
suspect this tradeoff explains most of the performance differences as I know
rapidjson also uses simd internally.

~~~
hnaccy
Is there a reason these fast json libraries seem to favor doing linear scan
for object representation?

~~~
yoklov
Faster to build than a hash map, less code (which is also better for icache),
etc.

JSON Objects tend to have few enough values that it doesn't matter a ton
anyway.

------
westurner
> _Requirements: […] A processor with AVX2 (i.e., Intel processors starting
> with the Haswell microarchitecture released 2013, and processors from AMD
> starting with the Rizen)_

~~~
aristidb
Also noteworthy that on Intel at least, using AVX/AVX2 reduces the frequency
of the CPU for a while. It can even go below base clock.

~~~
scottlamb
iirc, it's complicated. Some instructions don't reduce the frequency; some
reduce it a little; some reduce it a lot.

I'm not sure AVX2 is as ubiquitous as the README says: "We assume AVX2 support
which is available in all recent mainstream x86 processors produced by AMD and
Intel."

I guess "mainstream" is somewhat subjective, but some recent Chromebooks have
Celeron processors with no AVX2:

[https://us-store.acer.com/chromebook-14-cb3-431-c5fm](https://us-
store.acer.com/chromebook-14-cb3-431-c5fm)

[https://ark.intel.com/products/91831/Intel-Celeron-
Processor...](https://ark.intel.com/products/91831/Intel-Celeron-
Processor-N3160-2M-Cache-up-to-2_24-GHz)

~~~
Ultimatt
Because someone wanting 2.2GB/s JSON parsing is deploying to a chromebook...

~~~
scottlamb
It doesn't seem that laughable to me to want faster JSON parsing on a
Chromebook, given how heavily JSON is used to communicate between webservers
and client-side Javascript.

"Faster" meaning faster than Chromebooks do now; 2.2 GB/s may simply be
unachievable hardware-wise with these cheap processors. They're kinda slow, so
any speed increase would be welcome.

------
ben-schaaf
I wonder how this compares to fast.json: "Fastest JSON parser in the world is
a D project?"
([https://news.ycombinator.com/item?id=10430951](https://news.ycombinator.com/item?id=10430951)),
both in an implementation/approach sense and in terms of performance.

------
yeldarb
Will this work on JSON files that are larger than the available system memory?

Firebase backups are huge JSON files and we haven’t found a good way to deal
with them.

There are some “streaming JSON parsers” that we have wrestled with but they
are buggy.

~~~
glangdale
Sadly it will not. Arguably we could 'stream' things, but we don't have an API
or a use case for it. If you could capture your requirements and put them on
an issue on Github, it would be helpful. We're not against the streaming use
case, we just don't understand it very well.

------
xnormal
Any chance of something similar for CSV? (full RFC-4180 including quotes,
escaping etc).

Terabytes of "big data" get passed around as CSV.

~~~
glangdale
CSV is on our list; this is a simpler task than JSON due to the absence of
arbitrary nesting.

~~~
imtringued
I doubt someone using CSV for big data is going to follow that rule...

~~~
carlmr
What do you mean? It's not a rule, it's just not possible in the CSV format to
have arbitrary nesting.

------
fooyc
What happens of the parsed data ? Do the benchmarks account for the time to
access that data after parsing ?

------
ftp-bit
Perhaps I'm misunderstanding or don't have a good enough grasp of this, but,
in what circumstance would you need to parse gigabytes? I've only seen it be
used in config files, so...

~~~
userbinator
What usually happens is someone creates an API, one which did not initially
have to handle much data, and then it just grew over time. (I guess it's
similar to how a lot of the Internet's early application-layer protocols like
HTTP, SMTP, etc. are text-based --- the text format was initially more
"convenient" for a variety of reasons, but obviously is not very efficient at
scale.)

Or, perhaps a more common scenario today, it was designed by people who simply
had no knowledge of binary protocols or efficiency at all --- not too long ago
I had to deal with an API which returned a binary file, but instead of simply
sending the bytes directly, it decided to send a JSON object containing one
array, whose elements were strings, and each string was... a hex digit.
Instead of sending "Hello world" it would send '{"data":["4","8"," ","6","5","
","6","C"," " ... '

------
maliker
If this kind of work is interesting to you, you might like Daniel Lemire's
blog ([https://lemire.me/blog/](https://lemire.me/blog/)).

He's a professor, but his work is highly applied and immediately usable. He
manages to find and demonstrate a lot of code where we assume the big-O
performance, but the reality of modern processors and caching (etc.) mean very
difference performance in practice.

------
sbr464
Thanks for posting. I've been working with lidar/robotic data more recently
and it's nice to work with JSON directly, when the performance is good enough.

------
avmich
> All JSON is JavaScript, but not all JavaScript is JSON

Really? I thought they diverged specifications long enough ago (though using
those extras could be discouraged in some cases).

~~~
chubot
The JSON spec [1] never had any updates, so it couldn't have diverged.

Kudos to Douglas Crockford for keeping it simple. I wish more standards
committees would take a cue from him. (Looking at ECMAScript [2] and C++.)

There's been a tremendous amount of growth and value around JSON precisely
because it's so simple and easy to implement.

People complain about the lack of comments and trailing commas, but I think
those are really expanding on the initial use case of JSON, and the benefit
isn't worth cost of change. JSON does some things super well, other things
marginally well, and some not at all, and that's working as intended.

You can always make something separate to cover those use cases, and that
seems to have happened with TOML and so forth.

(I recall there was an RFC that cleaned up ambiguities in Crockford's web
page, but it just clarified things. No new features were added. So JSON is
still as much of a subset of JavaScript as it ever was. On the other hand,
JavaScript itself has grown wildly out of control.)

[1] [http://json.org/](http://json.org/)

[2]
[https://news.ycombinator.com/item?id=18766361](https://news.ycombinator.com/item?id=18766361)

~~~
eesmith
[https://en.wikipedia.org/wiki/JSON#Data_portability_issues](https://en.wikipedia.org/wiki/JSON#Data_portability_issues)
:

> Although Douglas Crockford originally asserted that JSON is a strict subset
> of JavaScript, his specification actually allows valid JSON documents that
> are invalid JavaScript. Specifically, JSON allows the Unicode line
> terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear
> unescaped in quoted strings, while ECMAScript 2018 and older does not.

~~~
empyrical
That bit of incompatibility will be going away when this proposal is
implemented, however:

[https://github.com/tc39/proposal-json-
superset](https://github.com/tc39/proposal-json-superset)

~~~
v413
It is already implemented in the current Firefox, Chrome and Safari 12.

------
fulafel
What's the current state of the art in doing this on GPU?

~~~
glangdale
To my knowledge, it is limited to posting "Towards JSON Parsing on a GPU" type
articles. Writing that sort of article is easy and fun, without the tedious
burden of implementing things.

------
tenken
I'm curious how fast the sqlite json extension is for validation and
manipulation of json data when compared to this library.

------
kitd
OT, but I notice it can be run by _#include_ -ing the simdjson.cpp file. How
common is this in CPP projects?

~~~
Erwin
It seems like there are quite a few single-header C++ libraries:
[https://github.com/nothings/single_file_libs](https://github.com/nothings/single_file_libs)

The people complaining about dependency management in Python should try doing
it in C++; there seems to be half a dozen competing ones. And three times as
many build systems.

------
vkaku
Honestly, this is a cool hack. But it's not the best way to shuttle that much
data around.

It's a hammer on rocket fuel.

------
hrdwdmrbl
Would it be possible to make a native module out of this for node?

~~~
sbr464
Here's the node bindings for rapid json, I'm assuming it would be similar.

[https://github.com/matthewpalmer/node-
rapidjson](https://github.com/matthewpalmer/node-rapidjson)

~~~
hrdwdmrbl
Thank you!

Though from the readme on that module the dev says "it turns out that you’re
better off using the normal Node.js/V8 implementation unless you’re operating
on huge JSON.

... the bridging from V8 to C++ is a bit too costly at this stage."

~~~
sbr464
That was two years ago though, not sure what improvements the N-API has in
newer versions of nodejs.

------
iamleppert
Is this faster than the browser’s native parsing speed I assume?

------
achalkley
With this work on an Arduino?

~~~
abhorrence
This code in particular won’t, since it relies on a particular extension of
the x86 instruction set. I don’t believe Arduino compatible chips have simd
instructions, but if they do, a similar approach could be taken.

~~~
glangdale
I'm not aware of any SIMD-capable Arduino chips; even when Quark was a thing,
it didn't support SIMD.

It's possible to do SWAR (SIMD Within A Register) tricks to try to substitute,
but on a 32-bit processor (or even a 64-bit processor) I doubt our techniques
would look good. In Hyperscan, my regex project, we used SWAR for simple
things (character scans) but I doubt that simdjson would work well if you
tried to make it into swarjson. :-)

~~~
fulafel
I wonder if it's possible to do something with bitslicing?

