
The search for a faster CRC32 - alfiedotwtf
https://blog.fastmail.com/2015/12/03/the-search-for-a-faster-crc32/
======
minimax
I read more about the slice-by-N algorithms because they sounded really
interesting. The way they work is by using a set of lookup tables that are 4k
to 16k in size (larger lookup table for larger N). The reason they are fast is
because the lookup tables fit within the L1 cache on modern CPUs. So when you
do 100M rounds of CRC32 it is super fast because the table is always cache
hot, but I don't think this result is informative if you just want to
occasionally do a CRC in between doing other types of work (especially for
small buffer sizes). You will have to wait as the lookup tables are brought up
through the cache heirarchy _and_ you are potentially evicting other useful
data from the cache at the same time. Presumably PCLMULQDQ does not have this
drawback.

~~~
cmurphycode
This is a great point and a huge mental problem I have when looking at (and
writing) benchmarks. If anyone knows more about this (i.e. if they can show
why it isn't true) please speak up.

~~~
robn_fastmail
I think that mostly, your benchmarks have to match your workloads. Most of the
CRC32 benchmarks I've seen are looking at larger buffers. The xxhash function
mentioned elsewhere in this thread was claimed to be "an order of magnitude"
faster, but again, large buffers - the gain over CRC32 on the same tests were
rather more modest (though not at all insignificant).

In this case, I think (but am curious, will investigate further at some point)
our Cyrus servers are doing enough checksumming work to keep any tables hot in
the cache. So the tests are hopefully a useful indicator of where improvements
can be made.

~~~
minimax
According to the Stephan Brumme website you linked to, the slice-by-8 lookup
table is 8K and the slice-by-16 table is 16K, so your combo version of crc32
needs 24K of L1 cache to run at full speed. Modern server class CPUs typically
have 32K of L1 dcache so that doesn't leave much room for the rest of your
work. Maybe that's reasonable (I don't really know what Cyrus does), but I
thought it was worth thinking about.

~~~
brongondwana
Most of the time we're iterating through a cyrus.index, where there's 104
bytes per record, and we're doing a CRC32 over 100 of them, or we're reading
through a twoskip database where we're CRCing the header (24+bytes, average
32) and then doing a couple of mmap lookup and memcmp operations before
jumping to the next header, which is normally only within a hundred bytes
forwards on a bit and mostly sorted database. The mmap lookahead will also
have been in that close-by range.

Also, our oldest CPU on production servers seems to be the E5520 right now,
which has 128kb of L1 data cache.

~~~
vardump
I'm fairly sure E5520 has 32 kB L1 data cache, not 128 kB. L1 caches are core
local, not shared like L3.

~~~
vidarh
This is what the datasheet says [1]:

    
    
        - Instruction Cache = 32kB, per core
        - Data Cache = 32kB, per core
        - 256kB Mid-Level cache, per core
        - 8MB shared among cores (up to 4)
    

So I guess the confusion is that Intel moved the L2 cache onto each core (from
Nehalem onwards, I think?) and used that opportunity to substantially lower
latency for it.

[1]
[http://www.intel.com/content/www/us/en/processors/xeon/xeon-...](http://www.intel.com/content/www/us/en/processors/xeon/xeon-5500-vol-1-datasheet.html)

------
mockery
> cloudflare is amazing until the input buffer gets under 80 bytes. That's the
> point where it stops using the optimised implementation and falls back to
> the regular zlib implementation (slice-by-4). I'm not sure why (no
> explanatory comments I could find), but it's a showstopper for our uses.

Why on earth is this a showstopper?

~~~
krallja
> The buffers we checksum are small, the minimum being 24 bytes (a twoskip
> header), average perhaps 32 bytes. This is our target case.

------
revelation
If they have CRC accounting for 10% of CPU, they must be using these checksums
a lot. At some point I'd imagine the false error rate simply due to bit flips
and other random errors on the path from database through CRC function will
outlast whatever value you are getting from the constant rechecks.

Also, literature suggests a throughput of ∼2.67 bytes per cycle for the CRC32
instruction, a three fold improvement over best in class non-HW routines. I'm
quite sure it would be worth it to reconvert previous checksums if you can do
so in a way that minimizes downtime (think TrueCrypt doing a transparent
initial encryption; not encrypting when theres IO load).

~~~
amluto
Is the CRC32 instruction much faster than an optimized implementation using
PCLMULQDQ? It's available on a wider range of CPUs, but I thought I remembered
that PCLMULQDQ worked very quickly.

~~~
mcbain
I think when I looked single instr crc32 was ~5 times faster than the
pclmulqdq version.

As an aside, if they have a server in production that doesn't support CLMUL,
they should junk that machine - it appeared in Westmere and Bulldozer -
everything earlier is EOLed. Crc32 was in sse4.2, so Nehalem.

The crc32 instruction is hard to beat, has been around for years, and I would
guess the different poly could be phased in on their systems like you would
for changing a password algorithm, or just try the new poly since it is quick,
and if it fails use the old poly.

------
ambrop7
> kernel, oh, how I wanted to like this. Every CRC32 operation consists of a
> write to and read from a socket...

Why didn't they extract the CRC32 code from the kernel and try it directly?

[https://github.com/torvalds/linux/blob/master/lib/crc32.c](https://github.com/torvalds/linux/blob/master/lib/crc32.c)
[https://github.com/torvalds/linux/blob/master/include/linux/...](https://github.com/torvalds/linux/blob/master/include/linux/crc32.h)

------
melted
Modern Intel CPUs have an instruction specifically to compute CRC. This
instruction is easily consumed through a C++ intrinsic, literally in one line
of code. You can't do any better than that, no matter what you use.

~~~
jfb
In this particular case, however:

 _[Intel 's CRC32 CPU instructions] uses different inputs to the CRC32
algorithm (known as the polynomial) which is apparently more robust, and is
used in networks, filesystems, that sort of thing. It gives different results
to the "standard" polynomial, typically used in compression._

They would have to go back and recompute all their existing stored checksums.

~~~
toolslive
That's why you have to tag the value with an identifier of the algorithm, so
you can change your mind later. The problems they currently face are a direct
consequence of an architectural mistake a few years ago.

~~~
jfb
Maybe; maybe not. Regardless, they aren't in a position to alter their
software to the degree required _at this time_.

~~~
robn_fastmail
Indeed. As I note in the post, this wasn't something we _needed_ to do at that
time, we were just curious.

Replacing a function with a faster implementation is trivial; its just another
deployment and we do several of those each week. Changing data format adds
operational complexity - two codepaths need to be run and maintained and we
increase system load and lower response times for the duration. (The data
format is versioned, so that's easy - no architectural problem there).

Its totally worth doing if you need to, and we're not afraid of that, but you
don't do it on a whim - it takes proper planning and testing and needs
multiple people involved.

------
hackcasual
> Next, we have to get our optimisation settings right. We compile Cyrus with
> no optimisations and full debugging because it makes it really easy to work
> with crash dumps.

Wait, so are these results without letting the compiler optimize?

~~~
robn_fastmail
The results in the tests are all with optimisations (-O3 -march=sandybridge
-mtune=intel), as mentioned in the post.

The exception is the Debian packaged version of zlib, because we don't control
that. That's the reason I include stock zlib in the tests - if it had been
wildly different from the system zlib, I would have looked into recompiling
the Debian package with more optimisations. There were no major differences
and indeed, inspecting the package further shows that it is compiled with
optimisations.

On the Cyrus side, we used to link to the Debian zlib, so we get those
optimisations. For the new code bundled in Cyrus itself, we have to enable
optimisations ourselves.

~~~
coherentpony
Did you guys every try splashing out for a small intel compiler licence and
bootstrapping a test system?

------
vidarh
I love how the linked Twoskip PDF presentation [1] lists as one of the
features:

"Easy to rewrite in other languages (e.g. Perl!)"

[1]
[http://opera.brong.fastmail.fm.user.fm/talks/twoskip/twoskip...](http://opera.brong.fastmail.fm.user.fm/talks/twoskip/twoskip-
yapc12.pdf)

------
jszymborski
why not rip out CRC32 and put in xxhash?

~~~
baudehlo
Last time I needed really fast hashing I used FNV. How does it compare to
xxhash?

~~~
jandrewrogers
There are hash functions that are as fast or faster than FNV and stastically
stronger (and faster) than xxhash. There are multiple hash function families
that should be used before either of the above in modern applications unless
you need backward compatibility (like the CRC case in the article). This is an
active research area and both of the above, while adequate for many casual use
cases, should not be used for checksums for the same reason CRC32 should not
be used for checksums in large systems.

As an example from one of my hash function research projects, MetroHash64 will
outperform xxhash both for speed and stastical robustness (which is quasi-
cryptographic). If you only need 32 bits, truncate larger hashes; if they have
very strong stastical properties, that works well.

Lots of people working on this stuff.

~~~
jorangreef
Comparing xxHash with FNV is a strawman, it's apples and oranges in terms of
quality.

You are discounting xxHash considerably. xxHash is one of the best, if not the
best, all factors considered.

SMHasher:

    
    
      xxHash, speed=5.4 GB/s, quality=10
      MurmurHash 3a, speed=2.7 GB/s, quality=10
      SBox, speed=1.4 GB/s, quality=9
      Lookup3, speed=1.2 GB/s, quality=9
      CityHash64, speed=1.05 GB/s, quality=10
      FNV, speed=0.55 GB/s, quality=5
      CRC32, speed=0.43 GB/s, quality=9	
      MD5-32, speed=0.33 GB/s, quality=10
      SHA1-32, speed=0.28 GB/s, quality=10

~~~
jandrewrogers
With all due respect, xxHash is a decent hash function but you are comparing
it against some relatively weak or slow hash functions. As was pointed out
last time this came up, the Metro64 hash example I offered is simultaneously
faster and higher quality than xxHash. There are hash functions that are
faster than xxHash _and_ stronger than some of the cryptographic hashes, and
xxHash definitely has much more bias than even MD5.

Getting a perfect score with SMHasher is table stakes. The default settings on
SMHasher are much too loose for current research; I know it well, I have
hacked it extensively. A state-of-the-art non-cryptographic hash function
today generally has the following properties:

\- statistical quality greater than or equal to MD5, a cryptographic hash.
xxHash is not close to MD5 in quality.

\- faster than xxHash on large keys, and for really good hashes, memory
bandwidth bound. (Metro64 is, again, faster than xxHash, though not memory
bandwidth bound.)

\- faster than xxHash on small keys (Metro64, to use that example, is almost
2x faster)

There are a lot of good hash functions being developed by researchers. But
empirically, xxHash is nowhere near the state-of-the-art any way you slice it.
It is a decent hash function but there are many functions produced by many
researchers that are both faster and higher quality. It isn't personal, people
can measure it for themselves.

~~~
jorangreef
Yes, I remember reading your post on Metro64 at the time:

[http://www.jandrewrogers.com/2015/05/27/metrohash/](http://www.jandrewrogers.com/2015/05/27/metrohash/)

Interesting that you based your results there on SMHasher. Are there better
benchmarks available?

------
aidenn0
I wonder if some other hash might be faster (e.g. fletcher 16).

~~~
dalke
If they were going to use a different hash function then they would more
likely uses the built-in CRC32 instruction, which computes CRC-32C. That would
certainly be faster.

------
ape4
Hope they contribute the change back to Cyrus.

~~~
robn_fastmail
Same day, in fact:
[https://git.cyrus.foundation/diffusion/I/browse/master/lib/c...](https://git.cyrus.foundation/diffusion/I/browse/master/lib/crc32.c)

