Hacker News new | past | comments | ask | show | jobs | submit login

I expect that's because it was designed very well for the days of serial, in-order, static execution. But, CPUs today are all about pipelining and parallelism. Herb Sutter's "Free Lunch" has been over for longer than most people here have been writing software. http://www.gotw.ca/publications/concurrency-ddj.htm





Sutter's solution was to write more concurrent code, but that's not what helps in this case. Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Traditionally the fastest CRC code has used lookup tables, I wonder if that's creating cache pressure these days? Or maybe that approach was abandoned while I wasn't paying attention.


> Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Instruction-level parallelism is incredibly important. I'd say that any optimizing programmer needs to fully understand ILP, and how it interacts with pipelines and dependency cutting (and register renaming).

Modern CPUs are extremely well parallelized with ILP. Any good, modern hash function will take advantage of this feature of modern CPUs.

Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No vectorization as far as I can tell, its purely using ILP to get its speed.

Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit multipliers). I've theorized to myself that this 64-bit multiplier could lead to better mixing than the vectorized instructions, and it seems like xxhash goes for that.

The 32-bit version of xxhash can likely be vectorized and optimized even further.



Nope. The fastest crc code now uses the intrinsics and some parallel execution tricks, see crclib from Cloudflare.

Where do I find this library? Googling “crclib from Cloudflare” without quotes returns your comment as the top hit :P


The OP probably finds it funny because Intel processors since Nehalem have had an instruction for computing CRC32.

Though, sadly, the CRC32 instruction computes the wrong polynomial for Zip and most other formats. :(

Where is the disparity between the two coming from?

There are two widely used polynomials, IEEE and Castagnoli. These parameterize the CRC-32 algorithm, and are sometimes known as crc32 and crc32c.

The IEEE polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA, Zip and other formats.

The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI, SCTP and other formats.

For better or worse, the hardware instruction computes CRC-32-Castagnoli, which means it's not relevant for e.g. Zip.


Ahh interesting. Thank you.

Is there any particular advantage of one polynomial over the other?


https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standa... suggests that Castagnoli has better "error detection capacity". What that means exactly (in terms of information theory) and whether that difference matters in practice is getting beyond my expertise level. In any case, existing formats like Zip are what they are, and as I mentioned in the blog post comments, it is still possible to SIMD-accelerate the IEEE flavor.

It still should be vectorizable without the dedicated instruction. That isn't evident here so the comparison is questionable.



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: