
Improving Redis CRC performance - localhost
https://matt.sh/redis-crcspeed
======
colanderman
_CRCs are inherently non-parallelizable_

No.

CRC is a bitwise linear function over XOR (+). Digesting a message ABCD,
crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To
parallelize, this can be rearranged (via distributivity of CRC of XOR) as
crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the
second half, XORed with the CRC of the first half passed through the function
crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like
CRC is.

This extends to arbitrary block sizes and/or arbitrary strides through the
message. So e.g. you can have four threads compute the CRCs of separate
64-byte (cache-line-sized) blocks within 256-byte chunks, separating each
block within a thread using crc^192, offset the results as appropriate with
crc^64^k, and then combine them with XOR.

(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for
cryptography!)

~~~
seiji
_you can have four threads compute_

You could. You could also have a billion threads. But the article is talking
about single-core instruction level parallelism where normal tricks like loop
unrolling won't work.

~~~
colanderman
Regardless of what the post is about, its lede is demonstrably incorrect.
Speeding up CRC is great, but "we have to because we can't parallelize it" is
a false reason.

Anyway the tricks I outlined would probably help even on a single core, since
it allows you to pipeline fetches to the CRC table. (If the table is sitting
in L2 you've got ~10 cycles latency between successive CRCs. That's enough
downtime to run a second CRC in parallel.)

------
seiji
If you like this, also check out other Redis-related walkthroughs from this
year.

Improving the Redis list data structure: 1.) [https://matt.sh/redis-
quicklist](https://matt.sh/redis-quicklist) 2.) [https://matt.sh/redis-
quicklist-adaptive](https://matt.sh/redis-quicklist-adaptive) 3.)
[https://matt.sh/redis-quicklist-visions](https://matt.sh/redis-quicklist-
visions)

Adding geo commands to Redis as a loadable module: [https://matt.sh/redis-
geo](https://matt.sh/redis-geo)

Prototype of native JSON support for Redis: [https://matt.sh/redis-
json](https://matt.sh/redis-json)

------
cperciva
Two things come to mind here:

1\. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32
then using the CRC32 instruction in SSE 4.2 yields pretty impressive
performance.

2\. CRCs _are_ parallelizable, it just takes a bit more work (and math).

~~~
SFjulie1
2\. Theory of CRC comes with noisy telecom lines where noise have a weired non
linear random distribution making it impossible to "vectorize" the problem
(aka parallelizing) ([http://scienceblogs.com/goodmath/2007/07/27/fractal-
dust-and...](http://scienceblogs.com/goodmath/2007/07/27/fractal-dust-and-
noise-1/))

A dict/document/transaction maybe a fractal like structure, but the
probability of alteration is not happening with the same shape/locality.

After reading this [http://www.slideshare.net/Dataversity/redis-in-
action](http://www.slideshare.net/Dataversity/redis-in-action) , I am
surprised.

Why the heck did they decided to focus on CRC? Early optimization?

First CRC16 is in fact used as a "perfect hash function". It is confusing to
focus on the implementation and not the purpose. It is used to evenly
distribute data in bucket. It is a 'perfect hash function' they should use and
focus on (CRC being a special subcase of totally legitimate candidates for
being perfect hash function, but if tommorrow a super hyper fast perfect hash
function is out there ... ).

The CRC64 is used for a valid case of CRC BUT in an invalid domain.

Mathematically I could prove they are wrong. And I know where they have a
probability problem and how there are sharpening the transition between
functional/dysfunctional state, but how they are making so irreversible they
will be FUBAR.

1) They are leaving of course there ass open to an attack by image by a
malevolent attacker (having access to the storage for which they do the
checksum (it is almost used as a digital signature so it should be a crypto
hash function)). 2) probability of random collisions are not balanced by the
fact a cause altering the signal normally also has some odds to alter the CRC
and normally the signal is also a validation of the CRC that is the reason why
CRC in real life should be a fixed fraction of the signal according to the
probability of coalteration of both. So if you have a tolerance for faulty
storage, as a result you will use it, and you might also reconstruct data from
randomly corrupted with same CRC. It will be a mess the very exceptional but
predictable day one shit happens, that's all I am saying. Every safeguards
will be green :)

Something seems fishy.

~~~
seiji
_First CRC16 is in fact used as a "perfect hash function"_

It must be implemented by all clients too, so it needs to be simple and
readily available.

 _open to an attack by image by a malevolent attacker_

That wasn't one of the design goals.

 _if you have a tolerance for faulty storage,_

There's always the problem of storage corrupting only the checksum even if
your data is okay, leaving you in a pretty pickle.

------
Scaevolus
I doubt this is as big a win as claimed. Sure, a benchmark managed to achieve
1.6GB/s CRC-64 throughput, but that requires 16KB (!) of lookup tables. If
those don't reliably stay in at least L2 cache, the throughput could be very
poor in practice.

If speed is important, xxHash is probably a better choice than CRC-64.
[http://fastcompression.blogspot.com/2012/04/selecting-
checks...](http://fastcompression.blogspot.com/2012/04/selecting-checksum-
algorithm.html)

If you want both speed and security, Siphash-2-4 is a cryptographically secure
hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even
crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It
also doesn't require 16KB of lookup tables.
[https://131002.net/siphash/](https://131002.net/siphash/)
[http://bench.cr.yp.to/results-
auth.html#amd64-titan0](http://bench.cr.yp.to/results-auth.html#amd64-titan0)

~~~
seiji
The constraint was: the output must be CRC-64-Jones, so switching
implementations wasn't an option.

So, given that constraint, any better solutions are welcome. :)

~~~
Scaevolus
Why is that a hard constraint?

~~~
fitshipit
in some cases the checksums are persisted, so future checksums must be
computed in the same way.

------
baruch
I've collected quite a few CRC32 methods and compared them all, it would be
interesting to do the same for CRC64 to get the whole range of options too.

The results are here:
[https://github.com/baruch/crcbench/blob/master/log.i3-2330M-...](https://github.com/baruch/crcbench/blob/master/log.i3-2330M-2.20GHz.txt)

~~~
b3tta
"crc-mark-adler-hw" uses SSE4.2 just like your "crc-sse42". The latter one is
just very poorly written, since it only uses _mm_crc32_u8() once per iteration
(instead of 3 times).

~~~
cperciva
_crc-mark-adler-hw_

You mean crc32c_mark_hw, I assume?

------
MrBuddyCasino
Nice! I've ported it to Java, and got roughly a 2x speed increase over the
version posted at
[http://stackoverflow.com/a/20586626/58962](http://stackoverflow.com/a/20586626/58962)
(same question Mark Adler posted in).

On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320
MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.

 _Update_ : with some optimizations and eliminating I/O in the benchmark, we
now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).

~~~
istvan__
I am not surprised about the Java performance.

URL related: [http://java-is-the-
new-c.blogspot.com/2014/12/a-persistent-k...](http://java-is-the-
new-c.blogspot.com/2014/12/a-persistent-keyvalue-server-in-40.html)

~~~
jerven
I really hope this gets into Java9
[http://objectlayout.org/](http://objectlayout.org/) Gives the same benefits
as C structs while still behaving like Java objects. i.e. has the structs
memory semantics while having java function/method semantics associated with
the memory.

Better explained in this presentation
[http://medianetwork.oracle.com/video/player/3730888956001](http://medianetwork.oracle.com/video/player/3730888956001)
(there are other sources as well).

~~~
MrBuddyCasino
Thats a good one, Gil Tene is really a good speaker. For anyone who (like me)
thinks: "structs are just classes without methods, why are they faster?". The
answer is that this is about arrays, which are currently slow in Java because
they are:

\- mutable: elements can be changed, nulled etc.

\- polymorphic: they can contain T or subtype-of-T

This makes fast streaming of objects impossible, because you need to deref the
object. If this changes, object addresses can be calculated like in C, which
makes things fast - flat memory is fast, prefetchers work their magic.

------
Isamu
The improvement is to switch from a 256-entry table to an 8x256 table,
apparently unrolling the 8-bit loop to get a full 64 bits in each iteration.

So if speed is the thing, why not a 16-bit table? Same idea, faster. 8*64k =
512 kbyte table.

~~~
seiji
If your lookup table is too big it'll blow out your L1 cache too.

 _apparently unrolling_

It's not just _unrolling_, but it's merging positional-dependent pre-computed
values and xor'ing them together. A little more complicated than just a quick
loop unrolling (hence, nobody discovered it until 2006).

~~~
Isamu
> blow out your L1 cache

Fair enough.

> more complicated than just a quick loop unrolling

Yeah, or else it would be 8 repetitions of the same line.

But its speed is from from eliminating *byte++ in each of the unrolled lines.
Instead it just fetches one 64-bit word into a register and works on that for
the unrolled lines.

I guess this surprised me - how much of a difference that makes.

~~~
MrBuddyCasino
I mistook it for simple loop unrolling too, and was surprised at the
difference. Take a look a the "cycles per byte" \- the difference is big, and
I guess the linear access patterns involved mask the memory latencies very
well.

------
tlipcon
Existing library which does the same thing, and faster, with arbitrary
polynomials:
[http://code.google.com/p/crcutil/](http://code.google.com/p/crcutil/)

~~~
seiji
Not entirely the same. The problem isn't the polynomial, but the entire model.

crcutil also has many fatal flaws for inclusion in other projects: the code is
_horrifically ugly_ , it's hosted on google code so it looks unmaintained and
abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not
portable across all the architectures Redis runs on, and it has a horrible
build system (because of ASM and C++ and other issues).

~~~
tlipcon
True it's a pain to embed. We embedded it in my current project, and it only
took a few hours to add a CMakeLists.txt and make it reasonably embeddable.

I guess if your project doesn't want to include C++ (and you don't want to
write a simple C wrapper), you've got a fair point. But it sounds like their
implementation is significantly faster than the one in the linked post.

------
MichaelGG
Why not use xxHash?
[https://code.google.com/p/xxhash/](https://code.google.com/p/xxhash/)

Docs say it hits over 13GB/sec on a Core i5 3340M @2.7GHs.

Or if this is an older design decision, why not use MurmurHash, which is also
pretty fast and has been around for a while.

~~~
seiji
That's almost like answering "do you want chicken or steak?" with "I'd like a
balloon!"

The article didn't pretend to compare all possible options, it aimed to
improved exiting options while maintaing backwards compatibility.

Redis doesn't make design decisions based on what's fast, it prefers to use
what's simple.

------
tedunangst
> This is a pure performance optimization, not a correctness requirement.
> Removing the initial pre-alignment loop doesn't change any results.

Somebody needs to test their code on a CPU that requires alignment before
loading 64bit values.

