Traditionally the fastest CRC code has used lookup tables, I wonder if that's creating cache pressure these days? Or maybe that approach was abandoned while I wasn't paying attention.
Instruction-level parallelism is incredibly important. I'd say that any optimizing programmer needs to fully understand ILP, and how it interacts with pipelines and dependency cutting (and register renaming).
Modern CPUs are extremely well parallelized with ILP. Any good, modern hash function will take advantage of this feature of modern CPUs.
Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No vectorization as far as I can tell, its purely using ILP to get its speed.
Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit multipliers). I've theorized to myself that this 64-bit multiplier could lead to better mixing than the vectorized instructions, and it seems like xxhash goes for that.
The 32-bit version of xxhash can likely be vectorized and optimized even further.
The IEEE polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA, Zip and other formats.
The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI, SCTP and other formats.
For better or worse, the hardware instruction computes CRC-32-Castagnoli, which means it's not relevant for e.g. Zip.
Is there any particular advantage of one polynomial over the other?