Several people have raised questions about SIMD. For many workloads, SIMD is obviously a huge win. But in my experience, for problems like this, the gains are marginal and the additional complexity often isn't worth it. e.g. Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?
That said, it never hurts to try, so I implemented the fast string path in AVX2 (see https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3...). It actually benchmarks slower than the scalar code: https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7a...
I couldn't tell you why - it's possible something could be tweaked to make it actually execute faster, or that enabling -mavx2 causes other code to get slower, but I wanted to illustrate that enabling SIMD is not always an automatic go fast button.
I did look at the instructions generated and they seem reasonable:
vmovdqu ymm5, ymmword ptr [rbx - 32]
vpcmpgtb ymm6, ymm5, ymm0
vpcmpeqb ymm7, ymm5, ymm1
vpcmpeqb ymm8, ymm5, ymm2
vpcmpeqb ymm5, ymm5, ymm3
vpxor ymm5, ymm5, ymm4
vpor ymm6, ymm8, ymm6
vpor ymm5, ymm5, ymm6
vpor ymm5, ymm5, ymm7
vpmovmskb esi, ymm5
test esi, esi
mov rbp, rax
sub rbp, rbx
mov rsi, rbx
lea rbx, [rbx + 32]
cmp rbp, 31
add rbx, -32
bsr eax, esi
xor eax, 31
lea rsi, [rbx + rax]
mov bl, byte ptr [rbx + rax]
mov r10, qword ptr [rsp + 24] # 8-byte Reload
mov r9b, byte ptr [rsp + 39] # 1-byte Reload
sub rcx, r8
movzx eax, bl
cmp eax, 34
Edit: Upon thinking about it, it's probably slower because there's a dependency chain through a pile of high-latency vector instructions all the way to the bsr (should be bsf, I fixed on the branch) to the next loop iteration's load. The branch predictor doesn't have much opportunity to help us out here. You could look at Agner Fog's tables and calculate the critical path latency of this loop, but I'm guessing it's not pretty.
1 shift (4 bits right), unfortunately no byte-granularity shift is available.
2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)
2 PSHUFB operations to look up a table
1 PAND to combine the table.
... and then rejoins your implementation at the VPMOVMSKB.
I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.
Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.
With glibc there's also the IFUNC option:
Here's an intrinsic for it:
As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).
"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."
"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."
It's possible that a SIMD implementation (assuming that you routinely pass over enough bytes) can blow this away. I'm not sure what the cross-over point is - it depends on whether you are falling out of this code quickly or not. Obviously using a big SIMD sequence to replace a "2 cycles per byte" implementation is pretty stupid if the SIMD sequence is "0.5 cycles a byte" but the fixed cost of using the SIMD means that your costs are >=16 cycles and the typical case is small.
My comment follows:
It's also not clear that you actually need a SIMD instruction - if you've squashed your hotspot to the point that adding the complexity of per-platform SIMD specialization is only a small win, then why bother?
All that being said...
I've been really lazy in writing this up, but someone kindly did a decent write-up for us in the Rust reimplementation of our "Teddy" literal matcher. Please see https://github.com/rust-lang/regex/blob/master/src/simd_acce... if interested or look up the Hyperscan project at https://github.com/01org/hyperscan for a few different versions of this trick.
Use two 16-byte values in SIMD registers as 16-entry lookup tables with "8 buckets" (really just bits) each.
Lookup those two separate tables with the high nibble (bits 4..7) and low nibble (bits 0..3) of your input (you can do this in parallel either 16/32/64 ways depending on whether you are using SSSE3/AVX2/AVX512).
AND together the table results (or OR them, depending on the sense of the table).
PMOVMSKB (or VPMOVMSKB) to extract the results back into GPR-land.
So if you want to match, say 0x12 and 0xab at once, assign 0x12 to 'bucket #0' and 0xab to 'bucket #1'.
High nibble table gives you 0x1 -> 0x1 (that is, 0x10 in the high nibble gives you bucket #0) and 0xa -> 0x2
Low nibble table gives you 0x2 -> 0x1 and 0xb -> 0x2.
Tables are otherwise zero.
You can now look up lots of different useful character classes (not all! That takes three shuffle tables) in about 6 instructions. Sadly the instructions are pretty much a long linear chain of dependencies so that latency is a bit worse than throughput, but we've built some fun stuff around them.
See this implementation of JSON escaping for example: https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f...
De-escaping is even simpler because you only have to detect two characters, \ and ".
The technique I outline can match most character classes that you run into (and there's a 3-PSHUFB variant that can handle all predicates on bytes), while the bit-twiddling version will get linearly slower as you add more cases. If there's a way to get at what we did with general purpose operations and bit twiddling hacks I'm too stupid to think of it.
Also SSE will be beneficial only if your strings are long enough. I'd be curious to benchmark my implementation against an SSE one on a representative dataset.
EDIT: the microbenchmark indicates that the bit-trick implementation spends about 1 cycle per byte (for the entire escaping routine, if there's nothing to escape).
Assuming you can prep everything in advance, the SIMD version is 3 loads and about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple transfer/scan operations (PMOVMSKB and whatever bit scan operation you want) to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you might care about reciprocal throughput or latency. Generally our work has been more throughput oriented so those 10 operations can be scheduled along with lots of others, but this isn't always the case.
The other way SIMD (or for that matter bit-twiddling) can be helpful is that you can cover a wider variety of sizes in a branch-free fashion, so if your sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be straight line with no branch misses. So a SIMD version of the code might be "slower" than a bit-twiddling version even if both sides always get 9 bytes, but if it's a mix of sizes and thus unpredictable, it might take fewer branch misses (unless some clown starts giving you 33 byte inputs half the time, of course).
Again, benchmark data is always a good cure for folkloric beliefs whether pro-SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never worth it").
Parsing will always be slow no matter what you do, and you can hardly parallelize it. Of course it's fast on a non-mobile CPU, but it will always be slow on a smartphone, hence the need for such a format and it's the reason smartphones use apps instead of HTML+JS.
It's funny because computers are very fast, but batteries are not so good, so are we realizing that well designed software formats have an effect on battery life?
I've seen several articles about how webpages are bloated, but I don't see any real proposition on how to fix it.
Densely coded formats, such as msgpack, are neat for archival / long term storage, but their coding means that the unpackers are quite complex and not very fast (perhaps drawing some lessons from parallel-length x86 decoding could help implementations?); they tend to get slower as the data structure becomes more complex. Since the output format of these are cumbersome, applications will copy everything yet another time. In case we're not using C/C++/Rust, these will also mean a lot of allocator activity. Plus, data validation is separate from format validation, which means at least another pass over the structure.
So purely from a performance perspective these formats are not that good (despite all their claims). Designing purpose-built formats is obviously superior, but also takes quite a bit of time, and means more parsers to test (and fuzz). Stuff like Capnproto might be an adequate middle ground, I'm not sure, never worked with it / I don't know much about how they realized it (I assume custom accessor code that mostly does nothing on x86/ARM).
Flatbuffers is kind of like this. With untrusted data you just have to run the validation routine (that the code generator spits out) to check that all the internal offsets are closed within your buffer.
Amp - Google = Fast webpages anyone can host.
I have several hot paths in server projects where parsing IP addresses, dates, and other such data types that JSON doesn't support, eat cycles.
Or maybe nobody else bothers to validate their data.
I found this Intel article for XML (but not JSON):
Regardless, in the OP, they're specifically looking for one of a small number of bytes, which is exactly what PCMPISTRI is supposed to be good for.
With that said, my experience mirrors glangdale's. Every time I've tried to use PCMPISTRI, it's either been slower than other methods or not enough of an improvement to justify it.
So I wrote a new parser that trades lower RAM usage for higher CPU usage: https://bitbucket.org/tlabwest/json-reader/overview
json_find(&json, "key1/subkey:2/value1", &value);
json_decode_string(&value, s, bufsize);
It may not be as fast as these other guys but it's sure as heck a lot more intuitive and convenient. For casual JSON usage cases, I've been recommending this library.
as a simple tool you could try je, which is similar to a simple version of jq but to extract json objects as a csv/tsv table. its about 10x faster than jq.
But that aside, I wonder how this compares to the compressed indexes approaches (see https://www.youtube.com/watch?v=uA0Z7_4J7u8&t=2850s and http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf), obviously, once the index is built.
But regardless, it's still a non-zero amount of work, and in my experience choosing a mutual format isn't always a simple meeting...
Not to mention that if JSON is still needed elsewhere in your system (either internally, or externally with others) all you've done is add complexity to your application and spend developer time all for some savings on parsing speed.
On the other hand, there are CBOR or MsgPack libraries that are fast and native for many languages.
With the Mill's Loop pipelining and Mill's Vector ops this could be pretty well-optimized.