
Fastest JSON parser in the world is a D project? - micaeloliveira
http://forum.dlang.org/thread/20151014090114.60780ad6@marco-toshiba
======
grayrest
Here's the "how so fast" explanation:

[https://github.com/kostya/benchmarks/pull/46#issuecomment-14...](https://github.com/kostya/benchmarks/pull/46#issuecomment-147932489)

~~~
Tinyyy
I still don’t get it. ELI5?

~~~
grayrest
The JSON looks like this:

    
    
        {"coordinates": [
           {"x": 0.65, "y": 0.23, "z": 0.91, "name": "fwgzd", "opts": {'1': [1, true]},
           {"x": 0.45, "y": 0.78, "z": 0.22, "name": "alfsj", "opts": {'1': [1, true]},
           ...
        ],
         "info": "some info"}
    

The benchmark code [1] (very readable) is reading an array of structs
containing x,y,z from 'coordinates'.

[1]
[https://github.com/kostya/benchmarks/blob/master/json/test_f...](https://github.com/kostya/benchmarks/blob/master/json/test_fast.d)

I haven't read the code but the algorithm would look at this roughly like:

    
    
       See `{` process? yes
       See `"coordinates"` process? yes
       See `[` process? yes
       See `{` process? yes
       See `"x"` process? yes => new Coord, Coord.x = 0.65
       See `"y"` process? yes => Coord.y = 0.23
       See `"z"` process? yes => Coord.z = 0.91
       See `"name"` process? no, next token is `"`, skip to `"`
       See `"opts"` process? no, next token is `{`, skip to `}`
    

The tradeoff is that he's completely ignoring the contents of `name` or `opts`
or `info` and those values could potentially be invalid JSON but this
processor doesn't care.

The code is also picking up efficiencies from being a static C-like language.
The "new Coord" isn't actually doing anything, the alloc happened for the
array as a whole so the assignments just write a known size value to a known
offset from the start of the array. He's also using SIMD instructions to
process multiple bytes at a time and some other tricks but the skipping is the
main difference.

I think it's also interesting that the Rust code implemented value skipping in
the benchmark file itself. The relative slowness there is likely that the
library used (serde_json) is the JSON plugin for a generic
serialization/deserialization lib and that Rust doesn't have a way to do SIMD
yet.

~~~
erickt
I wrote serde_json, and wrote the rust benchmark here a few months ago.
Interestingly when I wrote this benchmark, I had my implementation as
equivalent to RapidJSON on my Mac, but for some reason Kostya couldn't
replicate it:

O[https://github.com/kostya/benchmarks/pull/44](https://github.com/kostya/benchmarks/pull/44)

I'm guessing gcc just has some optimizations llvm doesn't.

Rust does have some experimental SIMD, but I'm not using it yet because I want
the serde libraries to be safe to use on byte streams, and reading 16 bytes
ahead could block if at the end of a socket stream. Hopefully we will get
specialization soon, which would let me use SIMD when I know I have at least X
bytes in a buffer.

~~~
grayrest
I enjoyed your blog series on serde perf.

One thing I noticed in this example was that the D example worked pretty much
exactly like I want serde to work in that it was able to deserialize a subset
of the overall document and the Coord struct didn't need to exhaustively cover
the individual json data objects. If there's a way to do this in serde, an
example in the docs would be really helpful.

~~~
erickt
Thanks! I need to get back into writing on it.

I just pushed up a rust pull parser version here:
[https://github.com/kostya/benchmarks/pull/54](https://github.com/kostya/benchmarks/pull/54).
Is that what you were thinking of?

~~~
grayrest
My wishes are much more prosaic. It's not clear to me just from reading your
docs how I can extract data from a JSON file using the pattern this benchmark
shows (top level key containing the data, other keys containing metadata about
the request) without having to create otherwise useless struct to cover the
outer wrapper object.

I see you have a reply to Gankro for a non-exhaustive flag and that'd work. As
for the default, the current behavior is what I'd expect from a Rust lib given
the correctness first mindset of the community but I will always be opting for
non-exhaustive because I think most people providing JSON APIs consider
additional keys to be backwards compatible (they are in dynamic languages) and
I'd prefer my apps to not break in production for no apparent reason.

------
_Codemonkeyism
Sounds to me like VW or Nvidia:

(Why so fast)

"On the downside I did not validate the unused side-structures. I think it is
not necessary to validate data you are not using. So basically I only scan
them so much as to find where they end. Granted it is a bit of optimization
for a benchmark, but is actually handy in real-life as well."

~~~
nailer
What was the nvidia thing? I recall [http://techreport.com/review/3089/how-
ati-drivers-optimize-q...](http://techreport.com/review/3089/how-ati-drivers-
optimize-quake-iii) , but that was ATI.

~~~
icebraining
[http://www.geek.com/games/futuremark-confirms-nvidia-is-
chea...](http://www.geek.com/games/futuremark-confirms-nvidia-is-cheating-in-
benchmark-553361/)

------
simgidacav
So, how is it going with D? Last time I gave it a look I really liked it, but
I sadly left the place when I hit the multiple standard libraries problems.

Anyone using D in real life, among hackers here?

~~~
zamalek
Facebook does [1][2][3].

[1]: [http://www.drdobbs.com/mobile/facebook-adopts-d-
language/240...](http://www.drdobbs.com/mobile/facebook-adopts-d-
language/240162694) [2]:
[https://github.com/facebook/warp](https://github.com/facebook/warp) [3]:
[https://github.com/facebook/flint](https://github.com/facebook/flint)

~~~
TwoBit
Actually Facebook is backing away from D. Not because they dislike it, but
simply to converge on C++.

~~~
kal31dic
You work for facebook? They are rewriting w0rp in C++?

------
unwind
I don't read D, but here's the code anyway:
[https://github.com/mleise/fast/blob/master/source/fast/json....](https://github.com/mleise/fast/blob/master/source/fast/json.d).

------
sgt
On another note, I see Scala is doing really poorly in the JSON benchmark
test:
[https://github.com/kostya/benchmarks#json](https://github.com/kostya/benchmarks#json)

~~~
waxjar
Those aren't very fair to the dynamic languages and JIT compiled languages
(this includes Scala).

* For the dynamic languages execution time includes the time it takes to lex, parse and interpret the source code.

* For language implementations with a JIT execution time includes the time the JIT takes to properly optimise hot code paths. Generally you start benchmarking after a warm up period in such cases.

The only fair comparisons are those between ahead of time compiled languages.

------
dbcfd
Although D may have a fast json parser, this benchmark is a horrible
comparison. Really need a better way to do cross language comparisons.

~~~
EvenThisAcronym
The repo owner is accepting pull requests for the benchmarks so at least it's
fixable.

~~~
dbcfd
It's a lot of work to fix this benchmark, which is fairly contrived to begin
with. To start, have to adjust every sample to accept a warmup time, then run
time (which is likely multiple samples), measuring results in that run time,
both speed and performance. And you also have to be careful that the compiler
is then not optimizing out the repetitions in the runtime, while still
allowing optimizations that would produce the best performance.

------
tacone
> Yep, that's right. stdx.data.json's pull parser finally beats the dynamic
> languages with native efficiency. (I used the default options here that
> provide you with an Exception and line number on errors.)

Nice to see that it only took a couple of years for the D community to beat in
speed the scripting languages.

~~~
nkozyra
It's literally the next paragraph that extends this:

"A few days ago I decided to get some practical use out of my pet project
'fast' by implementing a JSON parser myself, that could rival even the by then
fastest JSON parser, RapidJSON. The result can be seen in the benchmark
results right now:

[https://github.com/kostya/benchmarks#json](https://github.com/kostya/benchmarks#json)

fast: 0.34s, 226.7Mb (GDC) RapidJSON: 0.79s, 687.1Mb (GCC)"

~~~
tacone
I read it, so? What's wrong in grandparent?

~~~
nkozyra
Supplant "it only took a couple of years to beat scripting languages" with "it
only took a couple of years to be the fastest JSON library inclusive of
scripting and compiled languages" and I imagine you'll understand.

------
z3t4
The D people laugh at the dynamic type languages, yet they have "auto" in-
front of all variables ;)

~~~
geofft
To give a little more detail on the difference between type inference and
dynamic typing, the following code is valid in a dynamic language, but a
static language rejects it:

    
    
        auto x = 15;
        if (some_condition) {
            x = "Hello world!";
        }
        print(x);
    

And this solves real errors. I believe there's even a function somewhere in
the Python _standard library_ that, if it finds one result, returns a string,
and if it finds multiple results, returns a list of strings. Of course a
string is itself iterable, so duck-typing goes horribly wrong.

~~~
scardine
I'm OK with dynamic types, they buy you something, cost you something. My big
complaint is weakly typed languages (I'm looking at you, JavaScript). I've
just fired SHIFT+CTRL+i and typed the line below in the JavaScript console:

    
    
        > (1 + "1") * 2
        22
    

At least it raises a TypeError in Python (unsupported operand type(s) for +:
'int' and 'str'). OK, probably better to fail at compile time, but I would
rather fail at runtime than not fail at all.

~~~
z3t4
In the early days of my JavaScript'n I always used minus instead of plus when
doing addition. But as the language and tools has matured I hardly see any
type errors any more. I do have a habit of parseInt() that won't easily go
away though :P

Besides the concatenation of numbers to strings and vice versa, most other
type errors in JavaScript will give you a syntax error or something like NaN
(Not a Number).

So don't mix strings and numbers. Everything else is just "objects". =)

One thing that causes lots of bugs though is undefined object properties. But
I'm not sure if inferred types would help in those cases.

------
wfunction
> Yep, that's right. stdx.data.json's pull parser finally beats the dynamic
> languages with native efficiency.

'dynamic' is the key word here. The title is misleading.

~~~
vamega
This post is about the fast project, which is claimed to beat RapidJson, a
very fast C++ json parser.

~~~
easytiger
If it was implemented in c++ using identical algorithms it would be no slower.

~~~
fauigerzigerk
That's hard to know. It's always possible that one compiler generates faster
code than another one for the same algorithm.

I wouldn't be surprised if most of the speed difference was down to
correctness checks though.

~~~
easytiger
There are full threads on why this impl is faster. It's cheating

------
merb
I don't know but the benchmarks are aweful, since scala and python using their
built in Json Parsers and not the fastest parsers available and he is
comparing some other super fast libraries against D and says "hoho" it's so
fast.

Benchmarks are for the poor people who can't live in the real world.

~~~
dbcfd
Benchmarks would be fine if they could somehow actually compare performance
between languages. I have yet to see any benchmarks between languages that are
even close to being results that would actually be seen.

