Hacker News new | past | comments | ask | show | jobs | submit login
Why Can't Hashes Just Agree on Endianness?
8 points by DrFrugal 17 days ago | hide | past | favorite | 2 comments
Endianness is the fourth Programming Horseman of the Apocalypse - right after Timezones, Character Encoding and Concurrency... ... but sometimes it's actually the first one. (shamelessly stolen reply from the Rust Discord because it made me chuckle lol)

I am replacing an old C++ project in Rust, and one of it's features is showing CRC32 (4 Byte) and MD5 (20 Byte) file hashes. During implementation I noticed, that the old program shows the value of CRC32 as Little Endian, while MD5 is printed in Big Endian.

At first I thought that it might be a mistake. Soon I realized, that this difference seems to be consistent across other applications. I checked CRC32 in 7-Zip File Manager, the crc32 utility on Linux -> Little Endian. Then MD5 via md5sum on Linux and Get-FileHash of Powershell/Windows -> Big Endian. Confusion set in - I ran the 64 and 128 bit variants of xxHash3 via xxhsum on Linux and Windows separately -> Big Endian. SHA256 via Get-FileHash of Powershell on Windows and sha256sum on Linux -> Big Endian.

My programmer gut feeling tells me, that the reason for this difference is, that some of those hash values can be stored in integer primitives, while the others have to use arrays for storage. The ones which can be stored in integer primitives are shown as Little Endian (could only check on my Little Endian machine). Everything which has to use arrays is shown as Big Endian.

To be honest, I am actually questioning myself: why don't we treat hash values as arrays regardless of size in the first place? Why not just type the CRC32 as [u8; 4]? Printing the hex representation of that would give you Big Endian, regardless on which architecture you are on. This would be consistent and with MD5 being [u8; 20] and SHA256 being [u8; 32] for example and result in the same treatment for writing out values for humans to read.

My best guess is, that this is most likely some technical debt of ye olden days. Someone decided to use primitives for their hash functions return value and everyone afterwards wanted to match that implementation, until the issue of non-primitive-sized hashes arose. Now we have this mess and nobody wants to be "the one who breaks convention and not return the same string as everyone else".

If you, dear reader, have a Big Endian machine, please do me a favor and check CRC32 or XXH3_64/XXH3_128 hashes. Tell me what your outcome was and if it differs to a Little Endian machine. In case the returned checksum is printed differently for the same input on a Big Endian machine,... better not even think of it.

This is not some low effort rant... I actually did quite a bit of research, but I didn't find anything about a common convention about how hash values should be printed. Google searches were inconclusive. Maybe I was searching for the wrong terms? xxhsum: actually talks about endianness in it's man page and uses Big Endian as default, while providing a Little Endian switch... whoever implemented/documented this, I love you <3 md5sum: neither "endian" nor "order" are mentioned in the man page sha256sum: same as above Get-FileHash of Powershell: same as above

If you are up for the challange, I would be happy to raise some discussion, get different viewpoints and maybe even gain more insight about how we got to the current predicament.

My opinion is that Big Endian is probably the most readable form for humans when serializing, since it is the value you see when looking with a memory viewer (this might differ on Big Endian machines, I guess). Serialized hash values are for humans, not for the machines. It would be a consistent form of writing hash values, regardless of the exact length. And it probably would be best to always document what the default behavior of an application is and provide switches to change it, just as xxhsum does.

You will remember this article everytime you will look at hashes from now on, have a nice day ;)




Pythons hash method returns int. I guess changing it to return bytes will be something for Python 4.


fwiw, my hashing library (http://guava.dev/HashFunction) went little-endian. It was probably for some reason.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: