This is a great project -- we use a Rust port of this [1] following the general principles to process the bunyan logs of hundreds of node containers with a single machine. I'd love to see AVX512 support (mainly to get more out of the ridiculously overpriced AWS instances..) [2].
And seeing how AMD is already a faster and better CPU in general (I mean the latest AMD Ryzen 3000 vs Intel 9000 series, at time of writing this), I'm not sure what about that needs citation?
Been seeing ryzen tear into intel cpus in terms of output / power .. its ridiculous. AMD drew a straight on the river, truly awe inspiring and refreshing.
If you're going to comment on an article about a performance focused SIMD JSON parser and say a chip without AVX512 is faster than one with it, you're going to need to provide some proof.
Pretty sure the op was asking for a citation specifically about aws nodes. Should be simple enough to take a test workload on Intel vs amd nodes and measure things (besides "safety" which is too nebulous to measure). Consumer chips aren't entirely relevant.
I'm pretty sure I linked server chip comparison. And the same guys that design consumer chips, design server, so there will be a lot of "cross-pollination" between architectures. We definitely see that a lot of Intel consumer and server grade is hit with similar bugs.
Safety can be roughly estimated by a number of critical vulnerabilities. Most have hit Intel, some being Intel specific bugs.
But the mitigations in AWS' VM monitor might be a fixed cost--i.e. not disabled on AMD architectures or otherwise not disadvantaging Intel. For example, they might use techniques like page coloring as an alternative to cache flushing and prediction suppression, which wouldn't disadvantage Intel, might not impose any significant overhead at all, and might even improve performance in shared hosting environments.
Alternatively, maybe their support for AMD is nascent and optimizations for Intel absent on AMD.
There's really no substitute for benchmarking AWS' actual offerings.
The inability to handle embedded NULs (issue #40) felt like cheating, but that has been fixed.
It looks like I can fetch a number as a string, so I can read into a decimal type. I'll have to try that. Edit: nope, doesn't work. "The JSON element does not have the requested type."
So it achieves the speed gains by using simd CPU instructions. Do most architectures support simd (including arm)?
Also, iirc the biggest negative of this library last time was that the API was super unintuitive (compared to, say, nlohmann). Glad to see progress has been made here.
Would love to see a quality python wrapper for this library.
We have ARM support. I wrote the original ARM code. I'm also, sadly, responsible for some of the not-too-ergonomic API code from the original release - it was always more intended to be a quickie research code at first so the API was a bit of an afterthought. Glad to see that it's been improved by other folks.
Worth pointing out: I thought it was just the SIMD that made it fast when I first got involved. It turns out that while it helps, it's just a tool that helps to achieve the real gain: eliminating branches (if statements) that make the processor stumble and fall in its attempts to sprint ahead (speculative execution). simdjson's first stage pushes this capability really far towards its limit, achieving 4 instructions per cycle by not branching. And yes, 1 cycle is the smallest amount of time a single instruction can take. Turns out a single thread is running multiple instructions in parallel at any given time, as long as you don't trip it up!
Parsing is notoriously serial and branchy, which is what makes simdjson so out of the ordinary. It's using a sort of "microparallel algorithm," running a small parallel parse on a SIMD-sized chunk of JSON (16-32 bytes depending on architecture), and then moving to the next.
And yeah, you have to go back over a decade to find CPUs that don't have SIMD. simdjson runs on those too, just obviously doesn't use SIMD instructions :)
An interesting point with the design of simdjson loses its branchlessness in "stage 2". I originally had a bunch of very elaborate plans to try to stay branchless far further into the parsing process. It proved just too hard to make it work. There were some promising things that ultra-modern Intel chips - meaning Icelake - and future iterations of ARM (SVE/SVE2) - are adding to their SIMD abilities, so it might be worth revisiting this in a few years (there aren't too many Icelake boxes out there and SVE barely exists).
Yep. Most of stage 2's branchiness essentially comes from "is the next thing an array? object? string? number? Handle it differently if so."
Making it so you can handle all the brackets at once, all the strings at once, all the numbers at once, would make a big difference, and we're thinking about that. Another thing that could help is making the if statement more predictable using type information from the user. get<int>() could mean "I expect this next thing to be an integer, so parse it that way and just yell if it's not, please."
It's difficult. But it's why I'm still so fascinated! Solving JSON thoroughly and completely will give us a lot of information on how to quickly parse XML, YAML, and other file formats.
We've clearly been collectively doing parsing wrong (including me) if there's this much of a gap. It's exciting to see something innovative and new in this domain and even being able to contribute to it :) @lemire deserves a ton of credit for making an actual project out of his work and promoting it; I likely wouldn't have heard of it otherwise.
That's right. There's a problem which I referred to as the toothpaste problem - you can squeeze the needed branchiness around but you can't make it go away entirely (at least, I couldn't).
There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.
I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.
In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.
Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.
... and yes, agreed that Daniel has done a great job making this into a real thing. This would have stayed a bunch of code sketches if I had been the only one working on it. In terms of quality, the project now is well on the way to being commercial quality code rather than the research prototype that I did. I understand I have you to thank for that as well!
The tables of time taken, at the bottom of the README, do not include units that I could tell. Would personally prefer if it did. For example, if it’s in second you could change the column titles in each of the two tables to add the word “seconds” in parentheses for the columns that have the times taken listed.
I like string processing libraries that implement a multi-document feature like the one mentioned here. There's always some efficiency to be gained- maybe the public API has a lot of branching or initialization, maybe it acquires a lock, etc. Batching will amortize that cost, or open up new opportunities for SIMD processing. Letting the user reduce overhead through batching isn't something I see supported in other libraries, even ones that advertise high performance.
ndjson isn't really batching though, it's a derivative of JSON (or a superset?) and there are a few variations of it[1][2][3].
It's also supported by quite a few libraries[4] and tools[5] too -- plus many others that don't document it as being ndjson/jsonl (such as AWS CloudTrail logs). I've got support for it in the non-POSIX UNIX/Linux shell I'd written too[6]
The issue is really more that, like with comments, JSON doesn't support streaming nor even multiple documents (something formats like YAML already have built into the spec) so it becomes something you need to advertise if you're writing a parser simply because it's not part of the standard JSON specification.
Good point. I was picturing some gather/scatter over strings which are not in adjacent memory (maybe a generous interpretation for my use-case). Concatenating small strings into ndjson may still come out ahead performance-wise.
What I don't like about that approach is that it still copies into temp. buffers. The normal tape approach is to reference everything via int indices and keep the input buffer as is. This works also concurrently and with streaming. I don't see the need to copy at all. You don't need \0 delims for strings, just use the mem* API everywhere, all sizes are known.
And API wise filter hooks are needed to skip unneeded arrays or objects. This would also save a lot of unneeded buffering.
Streaming and selective parsing are good things, and something we're looking into for the next set of features.
Note that there are real speed gains to be had by not being selective. The CPU is normally sprinting ahead, executing future instructions before you even get to them, and every branch can make it trip and land flat on its face.
We've found we have to be very careful with every branch we introduce. I've tried to introduce branches that skip tons of code and ought to make things a ton faster, but which instead slow it down.
That's of course true. Branching is costly and malloc is costly.
But there is a need to filter objects and arrays in the 2nd part behind the parsing, in the conversion to your language/data. With a slower seperate function of course.
Parsing is the neglible fast part, converting it into your business objects is the slow part. This is usually done with the 2nd tape step. This involves a lot of malloc's, and you really want to skip the unneeded parts and filter it upfront.
Still thinking how to design my new JSON parser around simdjson.
Agreed, we're thinking it through too. Most parsing of JSON could be done almost without any second stage at all, or at least with a direct-to-user-struct/array, if you had the right API.
They're using those \0 to find bit patterns with SIMD instructions. Their approach would fundamentally not work if they were unable to modify the input.
I absolutely love that the first comment I see coming into the discussion is someone arrogantly telling (us, not the author) how to do their job, when evidently we're looking at one of the most thought-out, better engineered pieces of software there is in their area (JSON parsing).
What's more, the authors of simdjson show a lot of benchmarks, explain how they got there, they provide their effort in a liberal license, and yet... This peep over here comes to tell us how he thinks that's wrong...
Just got a bit ranty. And sorry eska for taking a bit of your comment for it. You tried to give some sense into this guy.
Edit: at first I thought this was the top comment because of being newer... Nope, it's the top comment alltogether. Followed by another comment about the same time as old of someone telling how great this library is and how they're using it in production with a rust wrapper.
simdjson does not modify the input, fyi. Rather than \0, which isn't really a part of JSON, the SIMD instructions search for whitespace and JSON structure like commas, colons, etc.
Yeah, sorry I didn't feel like giving some guy who hasn't even read the article a full explanation. However I saw in a presentation on YouTube that these characters are replaced (e.g. ") to then apply SIMD instructions. This implies copies and allocations. But these are necessary, and it would not be possible to achieve this with indices.
Although I would really love to see a more zero-parse binary format. At the end of the day simd, is just making markers in the original file of where an array begins, object begins, each of the object key value ranges e.t.c i.e the find marks -> build tape algorithm.
This could be achieved by having length prefixed values, kinda like flat buffers which is zero parse time, and much compact than json. https://google.github.io/flatbuffers/
I just wish json had a more universal alternative binary format which was insanely fast and compact to work with.
Protobufs are nice but parsing large files is still a pain and consumes a ton of memory. Ideally a format that could simply be lazy mmap-ed would be desirable.
---
> The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.
orjson generally doesn't even compete in the same order of magnitude. The round-trip cost of object creation in CPython is pretty much constant given the number of objects and strings in a JSON document. The overhead is overwhelming, 95% of the total execution time vs the actual parsing of the document.
V8 has a hand rolled JSON parser which feeds directly into our object model without any intermediate data structures. This allows us to take advantage of structural knowledge, e.g. we assume that objects in an array have the same shape (and just verify this before falling back to pointer chasing through transition trees).
I keep meaning to look into SIMD for both this and ordinary tokenisation, last time I tried it the bookkeeping overheads and alignment requirements weren't worth the speedup but it's always worth re-evaluating.
Most JSON parses are small. Does this have bigger spin-up overhead? Also the memory handling requirements of working with GC'd JS data are very particular, this sounds like it has its own way with memory.
It's also still faster than other parsers in our tests, even on the smallest documents (https://github.com/simdjson/simdjson/issues/312#issuecomment... advantage is just smaller, like 1.1x and 1.2x instead of 2.5x :) It really starts to pull ahead somewhere between 512 bytes and 1K.
Meanwhile I can barely get Chrome/NodeJS to parse 20MB in less than 100ms :(.
How useful (or useless) would Simdjson as a Native Addon to V8 be? I assume transferring the object into JS land would kill all the speed gains?
I wrote my own JSON parser just last week, to see if I could improve the NodeJS situation. Discovered some really interesting factoids:
(A) JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing, this sucks.
(B) At first I fixed this by using setImmediate/shim, but discovered to annoying issues:
(1) Scheduling too many setImmediates will cause the event loop to block at the "check" cycle, you actually have to load balance across turns in the event loop like so (https://twitter.com/marknadal/status/1242476619752591360)
(2) Doing the above will cause your code to be way slow, so a trick instead, is to actually skip setImmediate and invoke your code 3333 (some divider of NodeJS's ~11K stack depth limit) times or for 1ms before doing a real setImmediate.
(D) I'm seeing this pure JS parser be ~2.5X slower than native for big complex JSON objects (20MB).
(E) Interestingly enough, I'm seeing 10X~20X faster than native, for parsing JSON records that have large values (ex, embedded image, etc.).
(F) Why? This happened when I switched my parser to skip per-byte checks when encountering `"` to next indexOf. So it would seem V8's built in JSON parser is still checking every character for a token which slows it down?
(G) I hate switch statements, but woah, I got a minor but noticeable speed boost going from if/else token checks to a switch statement.
Happy to answer any other Qs!
But compared to OP's 2.5GB/s parsing?! Ha, mine is a joke.
the thing is, it really was faster than gnu cat. I suspect it is because gnu cat does other things than just using Linux splice to a file descriptor and has options to count lines and such, and doesn't (didn't?) bother to use SSE. I just thought cat would give me a practical maximum to compare to when reading from disk.
I've also written and tried to optimize a hand-rolled JSON parser for exchange messages, just to see how fast pure JS could go. I tried many different things, but I only ever got near to the native implementation once I started assuming certain offsets in the buffer or optimistically parsing whole keys which were highly unsafe. My verdict was that you will never really get close to native, let alone close to hand-optimized C/C++.
> JSON parse is CPU-blocking, so if you get a large object, your server cannot handle any other web request until it finishes parsing
Well, your CPU core is busy on one request or another, so I don't understand why this is an issue as long as you're guarding against maliciously large bodies. Blocking I/O is different because your core is partially idle while other hardware is doing async work. Using Node.js' cluster module lets you keep more cores busy. Chunking CPU-limited work increases total CPU time and memory required. (This is a pet peeve of mine and a hill I'm willing to die on :-) .)
I think that is a good hill to die on, tho I would rather prioritize UX (browser not freezing) and server responsiveness. Ideally we'd have no CPU chunking & good UX, but if we have to choose one, which would you sacrifice?
There are third party bindings for nodejs https://github.com/luizperes/simdjson_nodejs. As you suspected, converting the entire document to a JS object is not recommended. [0] There is an additional API that allows you to query keys without conversion.
Yes, that is correct. I spent a lot of time on issue #5 to make as user-friendly as I could, but the only way I found to not have all the C++/JS conversion overhead was to keep the pointer to the external C++-parsed object. There might have other options that I haven't thought of, so if anyone knows of a better approach, let me know.
[1] https://github.com/simd-lite/simd-json [2] https://github.com/simdjson/simdjson/issues/10