
Open-sourcing homomorphic hashing to secure update propagation - ingve
https://code.fb.com/security/homomorphic-hashing/
======
jawns
The post offers a brief explanation of homomorphic hashing:

> homomorphic hashing answers the question, “Given the hash of an input, along
> with a small update to that input, how can we compute the hash of the new
> input (with the update applied) without having to recompute the entire hash
> from scratch?” We use LtHash, a specific homomorphic hashing algorithm based
> on lattice cryptography, to create an efficiently updatable checksum of a
> database.

Imagine adding and subtracting hashes!

> For any two disjoint sets S and T, LtHash(S) + LtHash(T) = LtHash(S ∪ T).

This is cool because now you don't have to recompute from scratch a hash
representing a large array; you can just compute the hash of the parts you
update and perform addition and subtraction.

~~~
zzzcpan
Years ago I was trying to use Merkle trees for checksuming and synchronizing
database replicas, since everyone was doing it this way. It was immediately
obvious how impractical it was. So the first thing that came to mind was to
use a fixed sized Merkle tree (half trees or whatever they are called). It was
essentially an array of 64k hashes with each element storing a hash for a set
of database keys. That, however, required rescanning multiple keys on updates
and the number of keys depended on the database size. Of course this wasn't
going to work well. Naturally the same idea of adding and subtracting hashes
came to mind. Adding a hash of a key on update and subtracting on removal
allowed to completely eliminate reads from disk. And rescanning a particular
set of keys was only necessary during background synchronization and only when
a key was missing on a replica, which was pretty rare. Although I don't use
this approach anymore, it did work well and was trivial to implement.

------
dfox
I somehow fail to see why the example in the article needs the hash to have
some special property of homomophicity. When you represent the overall hash of
your dataset as a sum of individual item hashes it trivially follows that
change of one item means that you substract the hash of original version and
add the hash of new version.

Or am I missing something? (apart from the somewhat obvious security
implications of doing such a thing in the first place)

~~~
ivmaykov
Disclaimer: I’m one of the authors of the paper/blog post/code.

If you want to use signatures over the hash as proof of data set integrity,
you need two things. 1) you need to make sure that hash({a}) + hash({b}) ==
hash({a, b}). 2) ensure that hash() is collision resistant - in other words,
it needs to be computationally infeasible to find hash(S) == hash(T), S != T
for any sets S and T. We prove that LtHash with our choice of parameters has
this property in the paper (which is linked from the blog post).

~~~
dfox
My reading of the post is that the Hash({a, b}) is infact computed as Hash'(a)
+ Hash'(b) given that a and b are "rows". And thus my question is why Hash'
has to have any special properties.

~~~
ivmaykov
I can think of hash functions that are homomorphic, but are not secure. A
simple example is something like “sha256 each element separately and XOR all
the resulting hashes together.” This would not be collusion resistant.

We offer a proof that LtHash with our choice of parameters provides over 200
bits of security. You would have to read the paper for the details.

~~~
Scaevolus
I can see how you might lose collision resistance with a normal 256-bit hash
function, but it's not clear why a "stretched" hash with 2048 bytes of output
wouldn't work.

E: Ah, there it is, in the paper:

> However, Wagner [Wag02] later showed an attack on the generalized birthday
> problem which could be used to find collisions > for AdHash on an n-bit
> modulus in time O(2^(2√n)), and that the AdHash modulus needs to be greater
> > than 1600 bits long to provide 80-bit security. Lyubashevsky [Lyu05] and
> Shallue [Sha08] showed > how to solve the Random Modular Subset Sum problem
> (essentially equivalent to finding collisions > in AdHash) in time
> O(2^(n^ε)) for any ε < 1, which indicates that AdHash requires several more
> orders > of magnitude larger of a modulus just to provide 80-bit security.

