Isn't the "".join also dangerous? get_str_hash( "".join( [ build_request.distro,...

blueflow · 2024-12-09T10:44:11 1733741051

Yes, one should use a hmac for hashing multiple inputs, for the reason you explained.

Edit: s/hmac/incremental hashing/

BoppreH · 2024-12-09T10:54:13 1733741653

Not quite. HMAC helps to prevent length extensions attacks (if the underlying hash was vulnerable in the first place), and the secret prevents attackers from predicting the hash value (like OP did).

But HMAC doesn't help against ambiguously encoded inputs:

  hmac(key, 'aa'+'bb') == hmac(key, 'aab'+'b')

You want a way to unambiguously join the values. Common solutions are:

- prepending the length of each field (in a fixed number of bytes);

- encoding the input as JSON or other structured format;

- padding fields to fixed lengths;

- hashing fields individually, then hashing their concatenation;

- use TupleHash, designed specifically for this case: https://www.nist.gov/publications/sha-3-derived-functions-cs...

__david__ · 2024-12-09T18:20:20 1733768420

Wouldn’t “x”.join(…) be enough?

blueflow · 2024-12-09T18:26:07 1733768767

Possibly not:

  "x".join({'aa'+'bxb'}) == "x".join({'aaxb','b'})

The separator should not be able to show up in the inputs.

deathanatos · 2024-12-09T21:15:54 1733778954

This is why I raised an eyebrow when TFA wrote,

> When I saw this, I wondered why it has several inner hashes instead of using the raw string.

The inner hash constrains the alphabet on that portion of the input to the outer hash, thus easily letting you use a separator like "," or "|" without having to deal with the alphabet of the inner input, since it gets run through a hash. That is, for a very simplistic use case of two inputs a & b:

  sha256(','.join(
    [sha256(a), sha256(b)]
  ))

If one is familiar with a git tree or commit object, this shouldn't be unfamiliar.

Now … whether that's why there was an inner hash at that point in TFA's code is another question, but I don't think one should dismiss inner hashes altogether.

bmicraft · 2024-12-10T04:59:52 1733806792

I could see an attack vector here based on file/directory names or the full path. Different inputs could lead to the same order of enumerated checksums.

blueflow · 2024-12-09T21:47:50 1733780870

I'm not dismissing them, inner hashes returning a hexadecimal string fulfills the "the separator should not be able to show up in the inputs" constraint.

__david__ · 2024-12-09T22:19:52 1733782792

Thanks—that makes sense. I was struggling to come up with an example that would fail but I was just unconsciously assuming the separator wasn’t showing up naturally in the individual parts instead of explicitly considering that as a prerequisite.

Terr_ · 2024-12-09T22:49:25 1733784565

Only if you can guarantee it that possible for someone to sneak in an input that already contains those "x" characters.

blueflow · 2024-12-09T11:00:13 1733742013

Yeah i confused hmac's with incremental hashing, i use both at once.

agwa · 2024-12-09T11:22:55 1733743375

What do you mean by "incremental hashing"? Note that the Init-Update-Finalize API provided by many cryptography libraries doesn't protect against this - calling Update multiple times is equivalent to hashing a concatenated string.

blueflow · 2024-12-09T11:33:28 1733744008

I mean the same what you call Init-Update-Finalize.

link needed about the dysfunctional implementations.

vsl · 2024-12-09T13:08:10 1733749690

No, these APIs are intentionally designed to be equivalent to hashing all data at once - i.e. to make it possible to hash in O(1) space.

There's nothing "disfunctional" about that.

"Incremental hash function" has a very different meaning and doesn't seem to have any relevance to what is discussed here: https://people.eecs.berkeley.edu/~daw/papers/inchash-cs06.pd...

blueflow · 2024-12-09T13:12:31 1733749951

I guess the PHP documentation is wrong then. Look at this: https://www.php.net/manual/en/function.hash-init.php

Thorrez · 2024-12-09T14:12:30 1733753550

That page includes an example that shows PHP's incremental hashing is what you describe as "dysfunctional". It hashes "The quick brown fox jumped over the lazy dog." in 1 part, and in 2 parts, and shows that the resulting hashes are equal.

blueflow · 2024-12-09T14:38:52 1733755132

I did a mistake.

ddtaylor · 2024-12-09T13:24:55 1733750695

For anyone curious PHP ultimately uses this definition in their introduction portion of the hash extension:

> This extension provides functions that can be used for direct or incremental processing of arbitrary length messages using a variety of hashing algorithms, including the generation of HMAC values and key derivations including HKDF and PBKDF2.

agwa · 2024-12-09T12:54:42 1733748882

For example, try running this Go program: https://go.dev/play/p/atvS3j8Dzg-

Or see the Botan documentation that explicitly says "Calling update several times is equivalent to calling it once with all of the arguments concatenated": https://botan.randombit.net/handbook/api_ref/hash.html

I've worked with many cryptography libraries and have never seen an Init-Update-Finalize API that works the way you think it does. It does not protect against canonicalization attacks unless you're using something like TupleHash.