On a side note, same goes with LZ4 being twice as fast as Speedy (from the same author).
It would be interesting to see how they stack up on other platforms like ARM with and without NEON, PPC with and without Altivec, and MIPS. Without that data, the algorithms look pretty tailored for SSE3, and so I'd hesitate to build them into any application or protocol that needs to be portable or implemented in embedded devices.
Lots of comparison to MurmurHash3, which is my preferred hashing function for generic (read: non-string data). I have since switched to CityHash for string hashing.
Case in point: Judy arrays  . That is some of the most heavily-optimized code I have ever seen, optimized through-and-through for each and every possible CPU-related factor including the various cache levels, instruction pipeline optimizations, and more. Yet the codebase is positively gargantuan compared to any other implementation of a dynamic array or hash table, mainly as a result of these optimizations.
Don't be so quick to judge by LoC.
On a single core of a 2.67GHz Intel Xeon X5550, CityHashCrc256 peaks at about 5 to 5.5 bytes/cycle.
That is pretty impressive, I don't think there's room for a lot of cache missing in that performance envelope.