
A Journey building a fast JSON parser and full JSONPath, Oj for Go - dcu
https://github.com/ohler55/ojg/blob/master/design.md
======
quelsolaar
I wrote a fast C json parser a few years ago (+600megs/s) And it was an
interesting experience. Validating the json roughly halved the performance.
The most interesting performance gain was going from using aligned structs to
packed byte offsets being accessed using memcpy. It added 20-30%. The overhead
of aligning was nothing compared to fewer cache misses. In the end i found
that making a truly fast json parser mostly depend on what you parse it to.
Like, is the structure read only, and how fast is it to access?

~~~
mister_hn
Without validation, all the speed and performance is worthless, if a bad
formed JSON can break your application

~~~
beached_whale
There are many circumstances where the JSON is always perfect. Having the
option to not validate is beneficial

~~~
fnord123
If you have so much control over the JSON and performance is a big deal, then
there's a big chance you can get rid of the JSON in favor of a more performant
format.

~~~
beached_whale
That isn't true though. It's the lingua franca of data transfer. Also, why are
we giving away money because of this view? We are stuck with JSON for better
or worse.

I've seen parsers where the same real task, say parse, map, and reduce a 100MB
of doubles encoded as JSON. In a decent library it is taking much less than
half a second and very little memory, say the size of the result and the
memory mapping of the JSON document. In many common libraries it takes
multiple seconds while using hundreds/gigs of memory. That means one has to
pay for bigger machines with more uptime per task. That is giving money away.

~~~
fnord123
If you want performance and you control both ends to the point that you don't
want to validate the format, why are you using a lingua franca and not a
specific solution? You could swap it out for bson or msgpack.

A little ETL goes a long way.

~~~
beached_whale
For sure, if you can control both ends, use something more efficient. But, at
least in my cases, I am consuming data from others.

~~~
dharmon
So you are consuming data from others yet assuming it to be perfect and not in
need of validation?

------
jeffbee
There is a nugget buried at the bottom of this: you will get different output
for different kinds of for ... range or for i ... loops. You'd think these are
equivalent:

    
    
            for i = range arr {
                    sum += arr[i]
            }
            for _, c = range arr {
                    sum += c
            }
    

... but the latter is 4 bytes shorter in x86. The compiler takes things
literally.

~~~
peterohler
Nice explanation, thanks.

------
svnpenn
> Both the Ruby Oj and the C parser OjC are the best performers in their
> respective languages.

Um, no, they arent:

[https://github.com/simdjson/simdjson](https://github.com/simdjson/simdjson)

~~~
coder543
According to the OjC README:

> No official benchmarks are available but in informal tests Oj is 30% faster
> that [sic] simdjson.

Source:
[https://github.com/ohler55/ojc/blob/master/README.md#benchma...](https://github.com/ohler55/ojc/blob/master/README.md#benchmarks)

If you have benchmarks that show otherwise, that would be great for the
discussion here, but your point appears to have already been addressed?

~~~
peterohler
I'll get the code up on the OjC repo in the next day or two. You are right,
there should be reproducible benchmarks to back up the statement.

~~~
peterohler
Benchmarks added at
[https://github.com/ohler55/ojc/blob/master/test/simdbench/RE...](https://github.com/ohler55/ojc/blob/master/test/simdbench/README.md)

~~~
abaines
These are my results on an (admittedly old) Intel i3-2120 CPU @ 3.30GHz.
Compiling both programs with -O3:

    
    
        ojc_parse_str    1000000 entries in 3607.615 msecs. (  277 iterations/msec)
        simdjson_parse   1000000 entries in  418.997 msecs. ( 2386 iterations/msec)
    

and might as well throw in my own parser...

    
    
        uj_parse   1000000 entries in 1959.731 msecs. (  510 iterations/msec)
    

The -O3 seems to make a large difference for simdjson.

~~~
peterohler
Shockingly different. The -O3 option made hardly any difference with OjC but
more than a 10x difference with simdjson. I'll be removing the claim from the
OjC readme.

Thank you for being civil with your reply. Much appreciated.

~~~
jkeiser
What I've learned from this (as a simdjson author) is that we need to update
the quick start in the README to have -O3. I was so psyched about the fact
that we now compiled warning-free without any parameters ... that I didn't
stop to think that some people would go "huh I did what they told me and
simdjson is slow, wtf." Because we evidently told you to compile it in debug
mode in the quick start :)

simdjson relies deeply on inlining to let us write performant code that is
also readable.

Sorry to have sent you down a blind alley!

One thing to note: if you want to get good numbers to chew on, we have a bunch
of really good real world examples, of ALL sizes (simdjson is about big and
small), in the jsonexamples/ directory of simdjson. And if you want to check
ojc's validation, there are a number of pass _.json and fail_.json files in
the jsonchecker/ directory.

------
throwaway77384
How does this compare to something like
[https://github.com/valyala/fastjson](https://github.com/valyala/fastjson) ?

~~~
peterohler
The fastjson package is fast. Can't disagree but unfortunately it accepts
invalid JSON and what you get from the parser is a Value that you can get
values from using a Get() function and not a simple go type. To get a simple
go type like []interface{} or map[string]interface{} you have to know what the
paths are. You can't actually iterate over a map as far as I can tell.

Basically the two packages are different in what they provide. OjG returns
simple go types and instead of a simple Get() it provides a full JSONPath
implementation. Different but both have their uses so depending on a user's
needed one might be better than the other.

------
nautilus12
I wonder how people that make super specific things like this their whole
career make money. I wonder if they are independently wealthy and just do
things like this for fun

~~~
jkeiser
It's kind of a "continued learning" requirement, a little. Work doesn't always
have the problems (or funding for them) that grow you in the right directions.

Open source is also a sort of portfolio for many people, I think. Resumes can
only tell you so much; code speaks volumes. It pays, just not directly or
immediately.

Though my first job in Silicon Valley-sized tech was at Netscape, and I got
that specifically after writing some big open source patches for Mozilla. So
it can sometimes pay more directly.

------
remorses
Go is becoming the language to implement json parsers

