Hash functions operate on byte strings. But, sometimes you want to hash data structures. So you serialize the structure and hash the serialization.
You need to be very careful about how you serialize. It's critical that the serialization actually be unique to the particular input. E.g. if you have two different types of data structures that you hash, it's important that no instance of the first type has the same serialization as some instance of the second type. Another common problem is when people hash a structure containing multiple values by simply concatenating the values and hashing the concatenation. If you serialize both `["a", "bc"]` and `["ab", "c"]` as "abc", then they will have the same hash. That's bad!
One way to think about this is to design your serialization such that it can be unambiguously parsed back to the original structure. It doesn't necessarily have to be convenient to parse, just possible. If you aren't experienced with designing serialization schemes, though, it may be best to use a common scheme like JSON or Protobuf. But, don't forget that if you have multiple types of structures, your serialization must specify its own type. For JSON, you could add a `"type": "MyType"` property. For Protobuf, define a single top-level type which is a big "oneof" (union) of all possible types, and always serialize as that top-level type.
I'm pretty sure that JSON does not have a well defined serialization pattern between languages. If you rely on ECMAscript 6 (ES6) engine, then folks in other languages will have to reproduce the quirks of how serialization is done there to verify and reproduce hashes.
Heh, I guess in my comment I was only thinking about how to make sure two different values don't end up with the same serialization, but indeed if you aren't careful then there is also the problem of two identical values ending up with different serializations, e.g. because you used a different JSON encoder or the field order was inconsistent.
One thing I would add: if you’re serializing with something off the shelf like JSON, you’ll probably also want to (recursively) sort object fields as well as non-ordered collections like Maps or Sets. At which point JSON may still be a good starting point but doesn’t give as much “for free” as it might seem at first.
I had a similar take on the article. Also, hashing JSON is something which looks somewhat dangerous.
I guess one aspect which was a bit implicit in the article is that if the thing one is hashing has a limited number of states, then a preimage for ordering more apples than intended could be found (in addition to a lack of authentication data). That's where adding more information would also be helpful, and using DER would not fix that.
EDIT: I realised that I made a mistake. A preimage can not be found when a strong hash function is used. What can, however, happen is that differently structured data can have a nonunique mapping to a byte vector which can be exploited.
> Also, hashing JSON is something which looks somewhat dangerous.
Hashing JSON is an idea that gives me the creeps because two identical bits of JSON can have different hashes, which sounds like a much bigger problem to me than two different bits of JSON having the same hash.
{"k1":40,"k2":25}
and
{"k2":25,"k1":40}
are the same object; no matter what you're using the hash for, they should have the same hash.
Hash functions operate on byte strings. But, sometimes you want to hash data structures. So you serialize the structure and hash the serialization.
You need to be very careful about how you serialize. It's critical that the serialization actually be unique to the particular input. E.g. if you have two different types of data structures that you hash, it's important that no instance of the first type has the same serialization as some instance of the second type. Another common problem is when people hash a structure containing multiple values by simply concatenating the values and hashing the concatenation. If you serialize both `["a", "bc"]` and `["ab", "c"]` as "abc", then they will have the same hash. That's bad!
One way to think about this is to design your serialization such that it can be unambiguously parsed back to the original structure. It doesn't necessarily have to be convenient to parse, just possible. If you aren't experienced with designing serialization schemes, though, it may be best to use a common scheme like JSON or Protobuf. But, don't forget that if you have multiple types of structures, your serialization must specify its own type. For JSON, you could add a `"type": "MyType"` property. For Protobuf, define a single top-level type which is a big "oneof" (union) of all possible types, and always serialize as that top-level type.