Hacker News new | past | comments | ask | show | jobs | submit login
Simdjson: Parsing gigabytes of JSON per second (github.com/simdjson)
503 points by lorenzfx 84 days ago | hide | past | favorite | 151 comments




I would consider Daniel Lemire (the main author) quite an authority within practical use of vectorization (SIMD). He is a computer science professor at Université du Québec. And is also behind the popular Roaring Bitmaps project [1]. You can check out his publication list here [2].

[1] https://roaringbitmap.org/

[2] https://lemire.me/en/#publications


Heh, I like the tagline: "Grab one of our research papers". It's very readable, too! https://arxiv.org/pdf/1603.06549.pdf

Looks like the last active discussion of Roaring Bitmaps was 6 years ago: https://news.ycombinator.com/item?id=8796997 possibly when it was first introduced. Interesting comments!

> How do these compare space and performance wise with Judy arrays, which are 256-ary trees whose nodes also distinguish between sparse and dense subsets? https://en.wikipedia.org/wiki/Judy_array

Good question, since the patent on them won't expire until Nov 29, 2020.

Judy arrays seem to be much less known. Looks like today will be Datastructure Thursday; lots of neat stuff to dig into.


I used to work at a small-ish (but very profitable) European company whose "crown jewel", in terms of in-house developed technology, was a distributed in-memory search/database engine, written in hand-crafted, multi-threaded C by a true coding wizard and industry veteran. It used Judy arrays at its heart for the very first filtering/processing stage of queries, and did so to amazing effect. It's quite spectacular what you could and can do with libjudy if you use it wisely :)

I always wondered if relatively recent changes in x86 CPU µarchs broke some of the assumptions that made libjudy perform this great even on Pentium-class hardware, and if the de facto only complete implementation (that I happen to be aware of) could be improved as a consequence, if those changes (i.e., hugely increased cache sizes, etc.) were considered.


What were you working on?


The showstopper with Judy arrays, for me, was that they require a lot of hairy fine control over memory layout to implement (it's been years since i looked at this, so please excuse the lack of detail). You can do that in C, but not in no-sharp-edges languages like Java, which i was using at the time. It would even be quite a headache in Rust.

It doesn't help that the documentation is very weird.


There is an implementation of Judy arrays in Rust: https://github.com/adevore/rudy Not sure how complete / usable it is.


The docs say it's largely complete.

As an example of a headache, one node type in Judy has a count, followed by that many index bytes, followed by the same number of pointers. In Rust, that would naturally be a struct with two variable-sized fields, which isn't possible, so it's done by defining a struct generic over a pair of array types:

https://github.com/adevore/rudy/blob/300f2cc7842f6329a76fff0...

It looks to me like this is only ever instantiated for 2, 7, or 31 elements, whereas i think for original Judy, it can be any size up to 31 elements (maybe?). Handling a genuinely variable length in Rust would be rather gnarly, because there have to be different types for each length.


I would assume that within unsafe you could simply do the same kinds of shenanigans you would do in C++, e.g. you have a fixed size struct that is just the "header", it's only created on the heap, and it's immediately followed by the variable amount of data which you get access to by casting. If you couldn't do this in unsafe rust this seems like a pretty huge limitation.


Yes, you could.


He's a great guy to work with. Also, in addition to being very efficient, his code (and papers) are also quite readable.


Roaring Bitmaps look very useful, I need get around to porting it to C++ some day



That's just a wrapper over the C interface. When I last looked at using Roaring Bitmaps I was going to use it in shared memory, which would mean removing explicit calls to malloc etc, and supplying allocators instead.


I literally love everything Lemire does.


He certainly knows how to promote his software, always getting way more attention than "usual" open source authors.

What about his floating point conversion software that is basically David Gay's?


When I read his 'discovery' of fast modulo division [1], it immediately reminded me of the same technique discussed as an aside in this very interesting [2006] IBM SIMD paper [2].

I let him know about this and he added the last footnote in [1], but no update of his blog or related github page saying "hah, turns out this was a known method at least in IBM'. It left a slight bad impression in my mind.

So, yeah ..

[1]: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...

See "2 Outline of Approach" section, last 2 paragraphs:

[2]: https://dominoweb.draco.res.ibm.com/reports/rc24100.pdf


Why did you put "discovery" in quotes, even though Lemire doesn't claim anywhere in that blog post to have discovered or invented the technique?


Well; I read the blog post and I am disappointed at Lemire, actually.

Yes, it does not claim anywhere it is his invention - it starts the description of the algorithm with "The common solution", but if you read it, it really looks like the development of the technique as a whole is his; at the end one note says: "The technique described in this blog post is in used within Microsoft Arriba."

If no one had pointed out that it is not his, I would have thought that it was his, the blog post title is "A fast alternative to the fast modulo reduction", and not "How X's algorithm for fast modulo reduction works. My implementation".


That's a fair point. Generally I am used to seeing a cite of original source on an elaboration, proof, or "let's dig into how it works" sort of writeups on algorithmic approaches. (Sometimes this is done even when the algorithm in question and source is well known.) Specially if the author is an academic.


Do you mean the code in http://www.netlib.org/fp/ under dtoa.c? (for strtod)... It appears from https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac... that he doesn't even claim to have had the idea for the floating point conversion code... What is the connection?


The casual reader of his blogs (like many on HN apparently) is likely to get the impression that he had the original ideas for much of his software.

His software has very few tests (especially for floating point), there's no paper for the method but nevertheless it apparently gets used at Google almost instantly.

So it looks as if there are connections that regular OSS authors don't have.


Actually, I did upvote your comment. After doing some research and looking into his blog, I realized that I though that things that he describes that appear in previous works give the impression that they are his ideas. The posts are great and a fantastic source of information, very valuable by themselves. I don't understand why wouldn't he just be more clear about the credit to others.


But well, because it's written and wrapped in a readily usable form, with a very permissible license. If there are concerns about the deserved credit, I found it good that it is voiced and he should make sure it credit goes to where it should. I want to know where the ideas originated from and also appreciate the value that Lemire's group adds to be it.


Gigabytes per second can be a worrying statistic. It suggests that benchmarks would be parsing massive json files rather than the small ones that real-world applications deal with.

However this library maintains roughly constant throughput for both small (eg 300 byte) and large documents, if it’s benchmarks are accurate.


Gigabytes per second has been a common metric used to compare column store performance (compressed bitmap indexes), a sweet spot for SIMD. Daniel Lemire is involved with both Roaring Bitmaps [1] and EWAH [2].

It is real-world in terms of exposing I/O bottlenecks, if they exist in the architecture (compared to GPU, for example). I suspect it is the default/natural metric these researchers use rather than an exercise in cherry picking results.

The underlying question is a good one: when does it make sense to adopt a SIMD library for JSON parsing?

[1] https://roaringbitmap.org/

[2] https://github.com/lemire/javaewah


A few days ago, at work, I transcoded a 13GB file from a proprietary binary format in to a 190GB JSON file... just so I could use jq[0]. This was a small file by our standards.

Performance matters

[0] https://stedolan.github.io/jq/


I've done this same thing (advanced text searches from a proprietary format), but I just inserted it into a local SQLite database instead. Is there a reason JSON/jq made more sense for your purposes than SQLite (or even a local Mongo or Postgres DB, if you wanted to stick with JSON)?


The data structures are often nested, and can change quite often, so spending time normalizing all of those in to columns and tables wouldn't make a lot of sense.

I suppose on the human level the more important reason is I often get requests from non-techies in the company to see snippets from this data. They can read JSON just fine (Well, most of them... a few still insist on using Excel, so I have to flatten the JSON to csv with jq)


> spending time normalizing all of those in to columns and tables wouldn't make a lot of sense

I agree with you (depending on the use-case, of course).

I was talking about dumping the data into one of those databases (SQLite, Postgres, or MongoDB) and using their JSON querying functionality.


If you're really talking about streaming a JSON Lines file into a SQL DB as a series of binary JSON documents, that might make a lot of sense.

If, however, you're really talking about a single JSON document, then you're going to be generating that 190GB intermediate JSON file either way. Plugging it into a database to query it just seems like an extra step for little benefit. (It's basically a variant on the question of whether you'll get better whole-pipeline latency from a data warehouse, or a data lake — which has no general answer.)


"JSON Lines" isn't "streamable JSON", it's just JSON that's easier to use with existing tools in a streaming fashion. Just like you can stream an HTML/XML file with a SAX parser, you can do the same with JSON.

I don't think there's any reason it's necessary to have an 190gb json file, nor anything stopping one from incrementally dumping it into Sqlite. Though it would depend on the format of the proprietary file.

But I will add that there's an obvious benefit to dumping the data into the database: it has indexes and querying capacity that doesn't involve full 190GB file scans. The I/O of a 190gb scan alone takes time.


The DB can also provide transparent compression, turning that 190GB file into something hopefully a small fraction of the size.


I got deep into sqlite json handling recently and its actually has really powerful json support out-of-the-box. I thought I was going to have to move to Postgres for some of my parsing/querying but haven't had the need so far.


pipes are really nice, jq lets you use them on json; compactness of syntax is another plus


Sorry if this appears as a silly question, just wondering that the "proprietary format" that you're transcoding is a well-known format to you right?


The explosion in size is terriying. I'm guessing the JSON mostly contained numerics?


Get your google location history. On an android phone is can be gigabytes. Just one user.


JSON is used to replace CSV in many applications such as logging / event streams - http://jsonlines.org/


There's also ndjson[0], personally I feel like I don't see jsonlines/ndjson in the wild enough -- people still aren't convinced of json logging (or doing stuff like using Content-Type to deliver ndjson/jsonlines over certain endpoints) in the codebases I run into.

[0]: http://ndjson.org/


JSON encoded logs are really only useful if you’re putting them into something like splunk.

The problem with splunk is that it’s way less expressive than perl, grep, etc, so there are certain useful analyses that simply can’t be run.

To work around that, I see applications duplicate information over and over in their log output.

The end result vs text files is a 2x blow up because of JSON, then a 5-10x blow up because of splunk.

It’s not hard to implement a distributed log processor on top of ssh, and I think open source solutions exist. I don’t know why more companies don’t do that.

As a side benefit, your log files stay human readable. In addition to being more flexible, in practice, the result is something like an order of magnitude cheaper and faster than splunk.


Using json for logging or event streams is fine until your cloud provider hands you your bill for cross zone or region traffic.


Oh, so there were two different formats, what's the difference?


Funnily enough, issue #1 on ndjson[0] is this same question. There's a fantastic response with a table too[1].

[0]: https://github.com/ndjson/ndjson.github.io/issues/1

[1]: https://github.com/ndjson/ndjson.github.io/issues/1#issuecom...


Thanks. One of the comments also points out that there's a third spec, IETF RFC7464.


FHIR Bulk Data Extract is using ndjson as a requirement.

https://build.fhir.org/ig/HL7/bulk-data/


Indeed, I feel like much of this is self-inflicted and could be avoided completely if only more developers would learn to use far more compact and efficient binary formats instead, where parsing (if it could even be called that) is just a natural consequence of reading the data itself.

One of my favourite examples of this is numerical values, which is notably called out in the performance results for this library as being one of the slowest paths.

In a textual format you need to read digits and accumulate them, and JSON allows floating-point which is even more complex to correctly parse.

In a binary format, you read n bytes and interpret it as an integer or a floating-point value directly. One machine instruction instead of dozens if not hundreds or more.

I see such inefficiencies so often that I wonder if most developers even know how to read/write binary formats anymore...

"The fastest way to do something is to not do it at all."


Fun fact, JSON's number type isn't floating point. It may map to it, but the limits on the numbers that are representible are defined by the encoder and decoder.


Which creates this lovely reality where you effectively have numbers that can be accurately represented in IEEE float but can't be represented accurately in decimal notation, and numbers that can be represented accurately in decimal notation but not in IEEE float. All of those cases are dicey with JSON.


Curious, what number is representable in IEEE float but not as a decimal?


I should have qualified that statement. You can represent them with an infinite length decimal, but that kind of defeats the purpose, right?


+Infinity, -Infinity, NaN (quiet, signaling, miscellaneous).


Well, those are all error conditions and not a number. NaN's explicitly say they aren't, and +-Inf may or may not be infinity, all we know is that it is outside the representable range of IEEE floating point.


IMHO, code should check for floating-point overflow just like it checks for integer overflow. If your inputs are finite and your output is infinite, that's similar to your inputs being positive and your output being negative: you should check for it and emit an error instead. Then infinity (like negative numbers) can have an actual semantic meaning, rather than just being the detritus of an error condition.


NaNs have an odd ontology, regardless of what they claim to be. For example, hypot(Inf, NaN) == Inf (C99), which suggests that NaN is "some number but we cannot know what it is".


> In a binary format, you read n bytes and interpret it as an integer or a floating-point value directly.

IF it's a raw binary format. Other formats, such as protocol buffers, use packed representations in order to save space.


And if the source architecture and destination architecture use identical endian-ness.


This always struck me as being dumber than streaming the whole file through lz4 or something similar.

At least by separating the compression from the encoding you have a choice of when decompression vs decoding will happen.


For certain inputs, compression doesn't necessarily work out so well, because of the overhead introduced by the compression format itself. For example, "compressing" a small datagram may actually make it larger.

I also believe (but am not prepared to prove) that packed representations can, under the right circumstances, evade some information theoretic limits that constrain lossless compression, by engaging in a little honest cheating. Since you're using a purpose-specific format instead of creating a general-purpose compression scheme, you don't have to record every single bit of information in the blob itself. You can use a side channel - in the form of the format specification itself - to record some information.

Which is why you see a packed representation (perhaps with optional compression) in things like protocol buffers, whereas a format like Parquet that's meant to store large volumes of self-describing data will just go straight for compression.


Does lz4 really have low enough overhead that it can compress a single integer better than having a predetermined compression scheme for small numbers? Sure, most protos are not a single integer, but i'd doubt it if the average proto message was more than 1 kb.


lz4 is kind of amazing, but yes, it does not really help when compressing a single integer.


Compression algos come with their own disadvantages. Varint encoding exploits the normal distribution/Benford's law type reality that the vast majority of fixed point numbers tend to be closer to zero than their upper bounds. As fast as lz4 is, encoding & decoding varint is way, way faster. There are other advantages too... lz4 works at the block level, rather than the field level, so you have effects from surrounding fields and the need to decode all the fields/records at once. You also need a certain size block for it to really be useful...

There are cases where what you are saying makes a lot of sense, and for that you have things like FlatBuffers & CapnProto.


Varint is not compression. Proper compression looks for redundancy in data and eliminates it.

All a compression algorithm has to do to get similar advantages to varint is notice that 0-valued bytes are common and compress them. A simple Huffman coding will do that nicely.

Is varint actually "way, way faster" than Huffman? I don't think it's entirely clear -- it probably depends on the implementation. Protobuf generated code inlines copies of varint encoding all over the place, which is better than not inlining, but still not very nice to the instruction cache. Huffman coding would process the entire buffer in one go and can be a much tighter loop. A lot of work has gone into optimizing Huffman coding, including with things like SIMD. You can't really leverage SIMD for Varints in Protobuf because the input is not a homogenous array. Varint is known to be a very branchy format which is not so great for performance. Branchless Huffman is a thing.

You can probably do even better with an algorithm tailor-made to look for zeros. I took a crack at this with "packed" format in Cap'n Proto, which I implemented in a branchless way, but not with SIMD (I'm no good at assembly). It's been like 7 years since I did the benchmarks but IIRC capnp+packing turned out to be pretty similar to protobuf in both size and speed... but there are a lot of confounding factors there. Could be interesting to replace Varint with fixed-width encoding in Protobuf itself and then run packing on the output and see how that performs...

But it really doesn't seem obvious to me at all that Varint would be "much faster" than a matching compression algorithm. Do you have some data behind that or is it just a hunch?

Disclaimer: I'm not a compression expert. I am the author of Protobuf v2 though. My recollection is that the original designers of Protobuf were never very happy with varint encoding. It was a casual decision made early on that became impossible to change later.


Hmm, I guess, ironically, lz4 does not do a Huffman pass, unlike most other compression algorithms. It only does dictionary-matching, which I imagine has a tougher time squeezing the zeros out of fixed-width-encoded integers.

I wonder if this implies lz4 pairs well with Protobuf (since it does Varint) but that Cap'n Proto users should look at different algorithms (or maybe apply packing followed by lz4).

Source: https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)


My experience is indeed that lz4 tends to work reasonably well on protobuf encoded data, which generally does not compress terribly well. I hadn't much thought about why, but your reasoning makes sense.


Protobuf varints are particularly annoying because they require searching for a terminating byte. A naive implementation needs to make 20 comparisons to decode an 8-byte int (checking the value of each byte, and for overflow in the input data).

UTF-8-style varints, where the encoded length can be determined entirely from the first byte, are much nicer.


Again, for small integers, this isn't really much of a disadvantage. You can find the terminating byte very efficiently. The bigger issue with varint tends to be more that when you pack data into bytes you have a lot more unaligned memory access. You can work around this, but it increases the complexity of your decoder.

Flatbuffers, Cap'nproto, and similar formats can pad everything else for alignment, at the cost of easily compressible nulls. For a lot of use cases, that's a good trade off.


Hey there Kenton! Thanks again for Cap,n Proto!

> Varint is not compression. Proper compression looks for redundancy in data and eliminates it.

I'm not sure why you are choosing to make a semantic argument about an assertion I didn't make, but very well. Compression is any mechanism that allows you to encode information using fewer bits than the original representation. There are compression mechanisms that don't require there to be any redundancy within the data they are compressing (for example, with a fixed dictionary).

> Huffman coding would process the entire buffer in one go and can be a much tighter loop.

That's a very good point, except part of how lz4 pulls off the wonders that it pulls off is by not having an entropy encoding stage... so no Huffman coding going on there. You can optimize Huffman coding all you want, but it still going to be slower than lz4's "not doing it" (hence why lz4 is so fast), which is in turn slower than varint.

You're right, you can't really leverage x86's various SIMD extensions for varints, because varints are usually only less than four bytes long. On the other hand, you can use normal processor instructions for executing an instruction in parallel on a sequence of bytes packed in to a 32-bit or 64-bit register. I agree that protobuf has some surprisingly untuned logic for this (in fairness, the three bits reserved for field identifiers does hamper taking advantage of it, but then those also screw up the use of SIMD instructions in general).

I agree that I found capnp+packing to be very competitive with protobuf for certain applications, as you had promised. However, as you said, there are confounding factors there.

The notion that varint is much faster is intuitive rather than benchmarks. I've looked at the assembly for a varint decoder and compared it to lz4. While the lz4 might be able to win out if you are trying to decode a long sequence of bytes, it's all over but the crying within a a few instructions for simply decoding a small integer.

> My recollection is that the original designers of Protobuf were never very happy with varint encoding. It was a casual decision made early on that became impossible to change later.

That is my recollection as well (although a lot of the pain was around the ugly work around for signed integers). Varint isn't the best thing, but it is a thing, and if used properly, it does yield advantages.


No idea why I got Cap'n Proto wrong. Sorry.


I'd love to hear more on that last part, if you care to share more details. I've always been a little curious if varints are worth the effort, at least as a default format. Seems like there's a decent chance that what you gain by shaving bytes, you lose in branch instructions.


This is why I love using serde bincode serialization. I can even deal with serde bincode binary blobs inside rust based wasm on the front end. It can save a ton of bytes on the wire, used appropriately.


Having grown up with 8bit home computers everything about the web seems insanely inefficient. But text formats do have undeniable upside. Best would be more portable efficient formats like protobuf.


It might be worth noting that, by the words of Douglas Crockford, JSON wasn't even at first specified, it was "discovered."[0] As in, a small group of people at a startup were wanting an object interchange format they could use in the browser in the early 2000s, found that JavaScript object literals worked good enough, and went with that. It only became a standard after the fact.

If the data you have requires more thought as to how to store it than "JavaScript's object notation is good enough," then perhaps a specialized binary format would work better.

Then again, nearly all APIs only accept and receive JSON. JSON is the one object notation that gets all the attention, so most maintainers will focus their time making optimizations for reading or writing JSON. And if you don't like XML or YAML for some reason, then JSON is tempting to use as a config file format, although it critically lacks comments or heredocs.

[0] http://www.json.org/fatfree.html


Unfortunately, binary formats still have plenty of issues too like usually requiring schemas (msgpack doesn't but it's MUCH slower than something with a schema like flatbuffers). They also are fairly clunky for handling nested structured data (eg; arrays of maps).


Many “real-world” applications deal with massive json files.


Yeah, although one can argue they are badly designed if they do so from the performance/energy perspective.


Of course, but there are many situations in which one must deal with the input one is given.


I wonder if the lib moves computations and validation for access time.

The graph over mb/s for different json-string sizes seem completely bogus. Small sizes should give more overhead and be slower but the line is totally flat.

https://github.com/simdjson/simdjson/blob/master/doc/growing...


Well it claims to do validation.

I think it perhaps doesn’t fully build json values into fast-to-access data structures but I don’t think this is unique amongst fast json parsers, or indeed necessary.

It seems reasonably believable that it could maintain high throughput if it has minimal setup and keeps everything linear.


I tested it and it seems to do validation before access.

The lib doesn't do double keys properly though, and returns the first result instead of the last. If you don't scan to the end each time at look-up and do no expensive hash map-backed tree I understand why it is faster at parsing ...


What do you mean by "properly"? The original json.org "spec" is silent on the matter of non-unique keys, and RFC 7159 says:

> When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.


Oh, in that case it is fine then. I vaguely remembered that it had to have the last value according to the specification when keys are duplicate. I meant "properly" as in "according to the JSON specification".

The lib is really nice to use from 5 minutes of testing.


I'm not sure why you'd think that, but yes there is no special optimizations for files of size above about 512-bits IIRC.


The GitHub page links to a video that explains some of the internals [1]. Can someone comment on the result that they show at 14:26?

My understanding is that they run a code that does 2000 branches based on a pseudo-random sequence. Over around 10 runs of that code, the CPU supposedly learns to correctly predict those 2000 branches and the performance steadily increases.

Do the modern branch predictors really have the capability to remember an exact sequence of past 2000 decisions on the same branch instruction? Also, why would the performance increase incrementally like that? I would imagine that it would remember the loop history on the first run and achieve maximum performance on the second run.

I doubt that there's really a neural net in the silicon doing this as the author speculates.

[1] https://youtu.be/wlvKAT7SZIQ?t=864


Modern branch predictors absolutely have neural nets these days. Check out for example this 2001 IEEE paper on "dynamic branch prediction using neural networks": https://ieeexplore.ieee.org/document/952279 or this 2007 patent using "conditional bias" (it beats around the bush, only explicitly naming neural networks once, but it's clear what it is about if you read between the lines): https://patents.google.com/patent/WO2009066063A1/en.

There is a wealth of patents and articles if you search for "neural net branch prediction patent".


AMD Zen 2 CPU and IBM z15 mainframe claims to be using TAGE branch predictor -- but this doesn't say much. The basic TAGE is quite simple (see [1]), but actual implementation likely includes loop predictor, statistical corrector predictor, perceptron hashing functions, etc.

To learn more about state of start BP look for Andre Seznec (INRIA/IRISA) and Daniel Jimenez (Texas A&M University) work. They are usually the winners of Championship Branch Prediction [2,3].

[1] https://fuse.wikichip.org/news/2458/a-look-at-the-amd-zen-2-...

[2] https://www.jilp.org/cbp2014/program.html

[3] https://www.jilp.org/cbp2016/program.html


Can you clear up whether you mean “there’s a paper that describes using neural nets” or “there’s an actual modern CPU that uses neural nets for branch brediction”?


I originally meant "people were already doing this in the lab 19 years ago", but you are right that that does not mean CPUs "in the wild" actually do this. I originally learned about it from a blog post where someone uncapped chips to make photographs of the die and made an offhand comment about the branch prediction having neural nets.

I did some more searching and found an article at https://acad.ro/sectii2002/proceedings/doc2019-2/12-Vintan.p... which describes Samsung Galaxy S7 phones having these type of branch predictors as far back as 2016 and AMD CPUs having perceptron based branch predictors as far back as 2011. It was not officially known if Intel also uses neural networks in their chips, but the author of that paper "strongly believes" it does and I concur that it's unlikely they don't given that Samsung, Oracle, IBM, ARM and AMD all seem to use them.


AMD said back in 2016 to have a neural net predictor [1]. I found this slide following references in the Wikipedia article that rowanG077 linked below.

[1] https://cdn.arstechnica.net/wp-content/uploads/sites/3/2016/...


> Anticipate future decisions, pre-load instruction, chose best path through the CPU.

This slide is about speculative execution not necessarily about branch prediction.


Pre-loading instructions doesn't mean they're actually getting executed. It sounds like it's talking about branch prediction.


Well, that's certainly the most surprising thing I've learned today. Thanks.


Now, the natural next question: are there commercial cpus available which branch prediction is Turing complete?


I'm no expert but neural networks are definitely used in some CPUs for branch prediction. See the wikipedia section: https://en.wikipedia.org/wiki/Branch_predictor#Neural_branch....


Seems like the the processor may just gradually be storing the explicit branch pattern at that point in the program? 2000 branches isn't that many if you have tens of megabytes of cache on the die.


For other folks interested in using this in Node.js, the performance of `simdjson.parse()` is currently slower than `JSON.parse()` due to the way C++ objects are converted to JS objects. It seems the same problem affects a Python implementation as well.

Performance-sensitive json-parsing Node users must do this instead:

     require("simdjson").lazyParse(jsonString).valueForKeyPath("foo.bar[1]")
https://github.com/luizperes/simdjson_nodejs/issues/5


Thanks for posting this - I've been down that road. I regularly have to parse about 20GB of JSON split up into 8MB JSON files - tried this library but was sad that it didn't help. I'm currently using threading in nodejs and that has helped quite a bit though, parsing up to 8 of those files at a time has given me quite a performance boost - but I always want to do it faster. Switching to using C just isn't really a viable option though.


You can write a small program in C just for the parsing part. Then call it from Node.


SQLite can seemingly parse and process gigabytes of JSON per second. I was pretty shocked by its performance when I tried it out the other month.I ran all kinds of queries on JSON structures and it was so fast.


An idea I had a few years ago which someone might be able to run with is to develop new charsets based on the underlying data, not just some arbitrary numerical range.

The idea being that characters that are more common in the underlying language would be represented as lower integers and then use varint encoding so that the data itself is smaller.

I did some experiments here and was able to compress our data by 25-45% in many situations.

There are multiple issues here though. If you're compressing the data anyway you might not have as big of a win in terms of storage but you still might if you still need to decode the data into its original text.


This sounds a lot like Huffman Coding/Gray Codes, the basis of lz family of compression. No need to change the character encoding, just build a table for text and use that for encoding/decoding. Or for better results, build it from the frequency of use in the document and store the table. This is gzip/pkzip...


The parent is suggesting something complementary to compression. An altered character encoding (even if just internal to the software, like UCS-16 is to Windows) would mean that strings would be smaller not just on disk/on the network, but in memory, while held in random-access mutable form. This might be a win for heavy text-manipulation systems like ElasticSearch.


And not so good if you want to do O(1) indexing


True for the GP's suggestion of a varint encoding, but also true for UTF-8 (which is a varint encoding.) So that's not much of a loss; we're already biting this bullet.

Still, though, you could have a fixed-size encoding that could still be more compact than UTF-8, if you limited what it could encode (and then held either it, or UTF-8 text, in a tagged union, as an ADT wrapped with an API of string operations that will implicitly "promote" your limited encoding to UTF-8 if the other arg is UTF-8, the same way integers get "promoted" to floats when you math them together.)

Then your limited-encoding text could hold and manipulate e.g. ASCII, or Japanese hiragana and katakana, or APL, or whatever else your system mostly holds, as a random-access array of single-octet codepoints; until something outside of that stream comes up, at which point you get UTF-8 text instead and your random-access operations become shimmed by seq-scans.

(Or you get a rope with both UTF-8 strings and limited-encoding strings as leaf nodes!)

Of course, if you didn't catch it, I'm talking about going back to having code pages. :) Just, from a perspective where everything is "canonically" UTF-8 and code pages are an internal optimization within your string ADT; rather than everything "canonically" being char[] of the system code page.


This is close but incorrect.

Most common Lempel-Ziv compressors (gzip/brotli/zstd) use some form of entropy encoding (Huffman/Arithmetic/ANS) but that is in addition to lz algorithm (usually lz77) itself.

You can have a Lempel-Ziv compressor that doesn't use entropy encoding, look at something like lz4.

lz77 essentially works by copying parts of the recent window of decompressed data. e.g if a phrase of length 10 was used 100 characters ago we could encode 100,10 instead of the phrase itself. Most compressors use entropy encoding on those offset and length streams.


Although he says "able to compress our data", he means that the uncompressed representation is smaller. In addition to memory savings, iteration could theoretically be faster due to fewer missed branches (imagine iterating through mostly similar width characters vs iterating through characters of varying widths).


It’s called Huffman coding.


And if you're looking for a fast JSON lib for CPython, orjson[1] (written in rust) is the best I've found.

[1] https://github.com/ijl/orjson#performance


There is a Python wrapper for simdjson: https://github.com/TkTech/pysimdjson

It looks like pysimdjson's biggest performance gain compared to e.g. orjson is when you can cherry-pick single values out of the JSON and avoid deserializing the whole document.


I've tested parsing complete documents with rapidjson vs pysimdjson vs simplejson on production data. Surprisingly, mean 90 times are exactly the same for all three libraries. I'm not re-using the simdjson parser, just doing simdjson.loads().


The time is overwhelmingly the cost of conversion to CPython objects, 95-99% of the total time. The actual document parsing to tape is always much faster than rapidjson. No CPython JSON library can get away from this overhead.

You see a real benefit when you don't need or want the entire document, and can use the JSON pointer or proxy object interface.


I never thought I’d write this, but we have officially entered a golden age for C++ JSON utils. They are everywhere, and springing up right and left. It is a great time to be alive.


I wrote my own little JSON lib years ago which was just a recursive std::map. It was so disappointingly slow. Bottlenecked the whole app.


Perhaps std::unordered_map would have helped slightly.


Just noting that this library requires that you are able to hold your expanded document in memory.

I needed to parse a very very large JSON document and pull out a subset of data, which didn't work, because it exceeded available RAM.


Any possibility of using mmap?

https://en.wikipedia.org/wiki/Mmap


not when you exceed the size of the pagefile/swapfile :-)

But it's a good project, otherwise.


mmap can exceed pagefile/swapfile+ram if you are mapping a file and not anonymous pages.


yeah. In my situation (a) the file was 10x RAM (b) when this library parses a JSON it creates an in-memory tree representing the file which is approx 16x the size of the JSON file.

Consequently it swapped so hard I was never going to finish processing. So I used a streaming parser, and it finished in minutes.


Document streaming is on the roadmap, and jkeiser is actively working on a SAX-style interface. Once these are done you'll be able to control memory usage.


Yeah, have this problem reading gig+ files on mobile in JS. Have to stream them.


What did you use instead?


Sorry, don't remember, was a throwawy project


So what is the fastest JSON library available? orjson claims they are the fastest but they don't benchmark simdjson. simdjson claims they are the fastest but did they forget to benchmark anything?


The author has given a talk last month, which can be viewed on YouTube:

https://www.youtube.com/watch?v=p6X8BGSrR9w


I use Emacs with lsp-mode (Language Server Protocol) a lot (for haskell, rust, elm and even java) and there was a dramatic speedup from Emacs 27 onwards when it started using jansson JSON parsing.

I don't think it's the bottleneck at the moment, but it's good to know there are faster parsers out there. Had a small search but couldn't find any plans to incorporate simdjson, besides a thread from last year on Emacs China forums.


Very impressive. Still there’re couple of issues there.

This comment is incorrect: https://github.com/simdjson/simdjson/blob/v0.4.7/src/haswell...

The behavior of that instruction is well specified for all inputs. If the high bit is set, the corresponding output byte will be 0. If the high bit is zero, only the lower 4 bits will be used for the index. Ability to selectively zero out some bytes while shuffling is useful sometimes.

I’m not sure about this part: https://github.com/simdjson/simdjson/blob/v0.4.7/src/simdpru... popcnt instruction is very fast, the latency is 3 cycles on Skylake, and only 1 cycle on Zen2. It produces same result without RAM loads and therefore without taking precious L1D space. The code uses popcnt sometimes, but apparently the lookup table is still used in other places.


> Perform a lookup assuming the value is between 0 and 16 (undefined behavior for out of range values)

I think you are misinterpreting the way that "undefined" is being used here. It's not a claim that one will get unpredictable results for this particular implementation, rather it's about the specification of the function. It's telling the user that the behavior of this function for out of range values is not guaranteed to remain the same across time as the code is changed, or across different architectures.

> I’m not sure about this part: ... popcnt instruction is very fast

I haven't worked on this particular code, but I've coauthored a paper with Daniel on beating popcnt using AVX2 instructions: https://lemire.me/en/publication/arxiv1611.07612/. While you are right that at times saving L1 space is a greater priority, I'd bet that the approach used here was tested and found to be faster on Haswell. I'm not sure if you noticed that the page you linked is Haswell specific?


> rather it's about the specification of the function

That “function” compiles into a single CPU instruction. The OP is perfectly aware of that, that’s why really_inline is there.

> on beating popcnt using AVX2 instructions

It’s easy to do with pshufb when you have many values on input. I have wrote about it years before that article, see there: https://github.com/Const-me/LookupTables#test-results

> I'd bet that the approach used here was tested and found to be faster on Haswell

I'd bet it’s an error.

> if you noticed that the page you linked is Haswell specific

I did. Was disappointed though, I expected to find something newer than Haswell from 2013, like Zen 2 or Skylake. When doing micro-optimizations like that, the exact micro-architecture matters.


> I expected to find something newer than Haswell from 2013, like Zen 2 or Skylake.

I'm sure optimizations for more recent architectures would be appreciated, and Daniel is wonderfully accepting of patches. Be careful though, or you might inadvertently end up as the maintainer of the whole project!


When I have free time, I’m generally more willing to contribute to my own open source projects no one cares about. Like this one: https://github.com/Const-me/Vrmac BTW did substantial amount of SIMD stuff there, for both 3D and 2D parts.


Didn’t find any mention of plans for NEON (ARM’s SIMD) support - anyone heard of such plans?


I don’t know about full support, as I can barely understand this code. However Neon is one of the code paths shown in the #if blocks, so I’d assume it supports neon, or at least has plans to support it.

https://github.com/simdjson/simdjson/blob/master/singleheade...

    enum instruction_set {
      DEFAULT = 0x0,
      NEON = 0x1,
      AVX2 = 0x4,
      SSE42 = 0x8,
      PCLMULQDQ = 0x10,
      BMI1 = 0x20,
      BMI2 = 0x40
    };

    #if defined(__arm__) || defined(__aarch64__) // incl. armel, armhf, arm64

    #if defined(__ARM_NEON)


Neon is fully supported. pysimdjson has pre-built binary wheels for ARM if you want to experiment on aarch64, just do `pip install pysimdjson`.


There's an R (#rstats) wrapper as well: https://github.com/eddelbuettel/rcppsimdjson


There's something wrong with having gigabytes-sized text files.


I mentioned this elsewhere, but if you google takeout your location history you get... gigabyte text files.


It seems this is for parsing multiple JSONs, each a few MBs at most. What does one do if they have a single 100GB JSON file? :)

ie.

  {
    // 100GB of data
  }


This is fantastic.

Anyone knows what library does V8 use or how does it compare?


Last I checked V8 don’t outsource to a library, their JSON parsing is built-in see https://news.ycombinator.com/item?id=20724854


i wonder if it's better to FFI into this library when using node.js, rather than using JSON.parse()


The FFI overhead in nodejs is significant. We have a project where a while back, after profiling, the majority of the CPU time was spent in a couple of small hotspots doing parsing and object construction in Nodejs.

I broke those out into a Rust library that was >100x faster (IIRC) in synthetic benchmarks with the same complexity. Plugging it in with FFI into the Nodejs app and it actually performed slightly worse due to FFI overhead and translation.

So for large documents; could be worth it. For lots of small objects; probably not. You'd have to try on real-world data for your use-case to know.

Also https://github.com/luizperes/simdjson_nodejs/issues/5


> I broke those out into a Rust library that was >100x faster (IIRC) in synthetic benchmarks with the same complexity. Plugging it in with FFI into the Nodejs app and it actually performed slightly worse due to FFI overhead and translation.

This is interesting. Have you tried to optimize for reducing the overhead? I can imagine that this can be hard/complex and require a refactoring of the consumer in some scenarios.


I guess it depends in your use-case. Looks like this was primarily made for large JSON files and not the typical small JSON payloads you encounter with HTTP bodies and the like. On top of that JSON.parse() is pretty heavily optimized already. Profiling is key.


JSON.parse() is required to return a concrete JS value [1]. Most, if not all, C/C++ JSON parsers designed for performance don't build the value unless requested. So whether to switch indeed depends on your use case, but for the different reason: if you use the majority of the parsed JSON your performance would be rather bound by object creation speed instead of parsing speed.

[1] It might be the case that modern benchmark-obsessed JS engines defer JSON parsing in this case though, in which case you definitely should not switch to simdjson.


That is the case with this library as well:

https://github.com/luizperes/simdjson_nodejs/issues/5


Glad Lemire is getting his shine-time on hn


missing comparison with jansson (https://jansson.readthedocs.io/en/2.10/)


Looks like jansson is slower than RapidJSON (which is compared) https://github.com/miloyip/nativejson-benchmark


missing comparison with libjansson (https://jansson.readthedocs.io/en/2.10/)


Your computer scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: