
Why Java's Object.hashCode() is unsafe for use in distributed systems - martinkl
http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html
======
ExpiredLink
This article is misguided, at least wrt Java. For "distributed systems" object
identity needs to be created and ensured programmatically (e.g. by an object
id). equals() and hashCode() must be derived from an object's immutable
properties and they are as safe or unsafe as _you_ implement them.

~~~
scott_s
I'm not sure if you understood the author correctly - your points don't
disagree with him. He is not talking about using hashes for identity, but for
routing. And he is advocating something more general than what you said. You
said, basically, "implement your own hash functions for your objects." He
said, basically, "don't use Object.hashCode(); you can easily just hash the
serialization of the object, or implement a hash function yourself."

~~~
cbsmith
Except he cited specific examples of distributed systems (Hadoop, Storm) that
use hashCode() but also make it quite clear you need to ensure you provide a
hashCode() implementation that is entirely derived from the state of the
object which is preserved in serialized form.

Seriously, it's just a bad article.

~~~
scott_s
No, not _need_ to. He said it was convenient if you're already serializing the
object _anyway_. Which, if you're routing an object in a distributed system,
is probably true.

~~~
cbsmith
Yes, it is terribly convenient because if Type.hashCode() has problems with
copies of objects, this will help:

    
    
        int computeHash(byte[] serializedBytes) {
            return serializeBytes.hashCode();
        }
    

Now you've gone from a hashCode() method that won't work if the programmer
screwed the pooch, to a hashCode() method that won't work period. ;-)

[Yes, it won't be screwed up if also you use your own hash.]

The problem with serialization is you lose the semantics of the object. The
"same values to the same sequence of bytes" is a tighter constraint than you
might think. The same "values" might have multiple representations which are
logically identical but might still be relevant in any number of ways (for
example, additional information that has nothing to do with identity).

~~~
scott_s
_[Yes, it won't be screwed up if also you use your own hash.]_

It's strange that you disregard that in an aside, as it's his entire point.

Personally, I prefer to implement hashes on the values inside the object, and
not its serialization. They are unrelated concepts, and it helps to keep the
concepts separate in the implementation. But for the purposes of routing in a
distributed system - which is what the post is about - it's an acceptable
solution.

~~~
cbsmith
> It's strange that you disregard that in an aside, as it's his entire point.

Perhaps I misunderstood as to what his point was.

If it is his entire point, I'd wish he'd not add in the bit about
serialization. It just confuses the point.

To be honest, what the point of the article _is_ strikes me as confused,
because he's crossing the gamut of a variety of topics even in his summary at
the end. Maybe you can help me understand my reading comprehension problem.
Looking at the summary I see a variety of points that aren't really related:

1) __Don't use the implementation of Object.hashCode() for your distributed
hash table. __Okay, this should not be news to anyone who is at all proficient
with Java. It also doesn't require a long article. I'm not sure if another
article needs to be written. If one must, you can just say, "Object.equals()
doesn't provide magical equivalence powers; it's a place holder for when you
implement equivalence. BTW, as you'd expect, the Java standard library
documents its equivalence semantics whenever it overrides the method itself."
You don't even have to mention hashCode().

2) __Your serialization framework ought to be how you implement equivalence.
__Umm, no. That couples the concepts serialization with equivalence. It is
essentially arguing that you ought to implement equals() by first serializing
both objects and then comparing their output! I'd argue they are related in
the same manner as the articles description of the relationship between
Object.hashCode()'s implementation and doing good distributed partition.
Serialized equivalence is correlated with logical equivalence just enough to
create a huge mess. Hadoop wanders about as close to this as one should by
providing hooks for doing comparisons of objects without deserializing them.
One should really, really NOT go down this road.

3) __Since Object.hashCode()'s implementation doesn't provide equivalence, you
should not use overrides of that implementation for your partitioning.
__That's just stupid, even beyond the confusion of interface and
implementation. You're going to need equivalence logic for your keys. Are you
really going to create your own "T.equivalent(T)" method whose behaviour is
distinct from "T.equals(T)"? What does it mean when those two don't have the
same value? This suggests a lack of understanding as to why
Object.hashCode(Object) is there in the first place.

4) __Your distributed system needs its own hash function, but not necessarily
a secure one. __Talk about getting it backwards. Built in hash mechanisms
generally work fine for short circuiting equivalence tests and already make a
pretty reasonable effort to deal with poor programming. In the end, using a
different hash function might only help if there is a lot of entropy in the
key that is being lost by the existing one, which tends to only occur with
large numbers of reasonably large keys (ironically, String.hashCode() makes
some pretty good trade offs here as compared to a generic hash function). This
tends to be an issue that you never encounter, and to the extent that you do,
it's just an optimization issue, rather than the logical one the article
described. The only major hurdle that is (wisely) left on the table by built
in hash tables is the security one. For tackling that you are best off using
an HMAC (designed to address that space of problem) instead of a bare hash.
MD5 really isn't the right choice for any case where you have a choice about
your hash function, and SHA-1 really isn't any more either, so mentioning them
in the article is just silly.

5) __Implementations of hashCode() generally work great for in-process hash
tables, but are lousy for distributed systems. __For the vast majority of
hashCode() implementations, if it provides a good in-process partition input,
it provides a good out-of-process partition input. Along with its
corresponding equals(Object) method, the implementation of Object.hashCode()
is, by design, an exceptional case and shouldn't normally get invoked.

6) __Form parameters sent to Java web servers are a great example of this
problem, and they wisely avoid the DoS problem by limiting the number of keys.
__Ummm no. First, the form parameters sent to a Java web server are no more a
distributed systems problem than any other user provided data shoved in to a
lookup table. It's a pure trusted systems problem. The only thing
'distributed' about it is that it tends to involve a network somewhere. In
fact, a web server ought to be worried about DoS problems that have nothing to
do with hash tables (like, the memory footprint of the form data). Given their
solution to these DoS exposures and the context, they really ought to just
shove the parameters in to some kind of efficient tree.

------
jdsullivan
Please see javadoc for String.hashCode:
[http://docs.oracle.com/javase/6/docs/api/java/lang/String.ht...](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#hashCode\(\))

It's true that relying on hashCode to remain stable across arbitrary objects
is brittle, but this is not the case for String.

~~~
scott_s
It's also true that the author made the same point.

~~~
jdsullivan
Not quite: "On strings, numbers and collection classes, hashCode() always
returns a consistent value, apparently even across different JVM vendors. It’s
like that despite the documentation for hashCode() explicitly not guaranteeing
consistency across different processes"

String.hashCode explicitly guarantees consistency across processes by
specifying how the hash code is computed, this is something that can be safely
relied upon. (Same for Integer, Long, Float and Double)

------
andrewvc
FWIW in JDK 8 hash map will most likely degrade into a lookup on a red-black
tree rather than a linked list for buckets with more than 8 collisions.

~~~
martinkl
Interesting. But wouldn't that require keys to have an ordering (e.g.
implement Comparable)?

~~~
ufo
The implementation is free to use forbidden voodoo powers. In this case it
could be whatever it is also using behind the scenes to implement pointer
equality for objects.

~~~
martinkl
Using pointer equality is fine, but doesn't help in the case where you have a
custom equals() implementation. Maybe the JVM could inspect the implementation
of equals() and spot certain patterns (e.g. it's the conjunction of equals()
of a set of constituent objects), but doing it in general sounds impossible to
me.

~~~
mikeash
For hash tables where the number of buckets is much smaller than the number of
hash codes (i.e. all of them) and where hash codes are well distributed and
collisions are caused by different hash codes mapping onto the same bucket
(not always true, but often), the tree can be implemented by comparing hash
codes, with a fallback to handle the case of multiple identical hash codes.

------
politician
So, this is true of the CLR as well. The moral of the story is that you
shouldn't assume that hashes are one-size-fits-all. Instead, design your own
scheme fit for your own requirements (and no, I'm not suggesting that you
write your own hash algorithm).

------
b7kich
Two wrongs don't make a right. The fact that "a".hash == "\0a" hash is an
issue with the hash function. nothing else. Seeding the hash function so it is
unique for each process does not solve the issue.

~~~
martinkl
Recent versions of Ruby also use a much better hash function. Java's hash
function on strings has a slightly less trivial process for generating
collisions, but it's still very easy to generate enough of them to pose a DoS
risk.

Edit: actually, Java's String.hashCode() has exactly the same problem —
prepending null chars doesn't change the hash code. And because the hash
function is actually part of the Java standard library docs, it will probably
never be changed (unlike Ruby's).

~~~
spullara
They have already announced that the hash function will be changed for Java 8
and you can enabled the change in the current version of Java 7 (u6).

[http://mail.openjdk.java.net/pipermail/core-libs-
dev/2012-Ma...](http://mail.openjdk.java.net/pipermail/core-libs-
dev/2012-May/010238.html)

------
jcdavis
Overall an excellent article, but its worth mentioning that as I understand
it, String's hashCode() method pre-1.3 or 1.2 (forgot exactly which version
they switched over) used a different function, so it can indeed change (as the
hashCode() contract allows)

~~~
jbooth
It could change, but Object.hashCode(), if not overridden, is operating on the
most basic type of Object with zero data members. So they pretty much have to
go to the memory address for the hashing functionality, since there's no other
information to hash off of. Unless they wanted to just "return 0" for every
single one which would be pretty dumb.

------
ddlatham
In fact, in Oracle JVMs Java enum hashCode() values are almost never the same
across JVM instances. This has bitten me in the past for Hadoop MapReduce jobs
specifically.

------
ShabbyDoo
For the system I've been working on for the past year, most domain-ish objects
extend from a class called AbstractImmutableObject which implements both
hashCode() and equals() as final methods. Then, subclasses are forced to
implement a "collectObjects()" method which effectively creates an array of
instance variable references (plus autoboxed primitives) for a deep
implementation of equality and a guaranteed consistent hash code. Immutability
currently is enforced by convention, but I think FindBugs could help out.
Presuming that all the instance variables in the graph of
AbstractImmutableObjects are primitives or JDK objects with consistent hash
functions, the resultant hashcode will be consistent as well.

[As an aside, I dislike this sort of utility base class, but it's a small
price to pay for avoiding awful bugs related to hashcodes being inconsistent
with equals, etc.]

I also argue that computing a hash on a serialization of an object graph is
problematic as well. What are the rules for computing the hash of a
serialization of a set of Strings? Must one sort the Strings before
serializing them to ensure that any two Sets with the same values have the
same hashcode? Sets are not required to maintain an ordering and equals() on
sets ignores order (but not on SortedSet), so I would expect my "distributed"
hashcode to be the same for any Set containing the same Strings. Such a scheme
couples my serialization scheme tightly with my rules for equivalence. As I
like to use serialization schemes written by others when possible and I don't
want to eat the cost of serialization whenever I want a consistent hashcode,
I'd prefer to decouple these ideas.

Recently, I've gone a step further with our domain objects and have written a
generator of builders and instances using dynamic proxies (just the out-of-
the-box JDK 1.3 era stuff). Developers just define interfaces, and unit tests
enforce rules about what objects may be contained within the object graphs.
This makes it really simple to implement things like round-trip JSON
ser/deser. Because the realm of possibilities is so limited, it's easy to
ensure correctness.

[As an aside, I'd love to have a DSL for such domain objects and then write a
bunch of dynamic pieces parts for serialization, meta-stuff (creating a Guava
function in a literate sort of way for any path through a domain object,
etc.), builders, etc. I guess this is one step down the rabbit hole of model
driven development, but I'm looking for something a bit lighter than, say,
AndroMDA).]

------
moondowner
Isn't this why we very often override with custom hashCode and equals methods?
What's the big deal..

Here's from Hibernate's WIKI on these two methods:
<https://community.jboss.org/wiki/EqualsAndHashCode>

~~~
cbsmith
Nailed It.

------
cbsmith
This article and this discussion is filled with too much "missing it", so I'm
writing this wordy diatribe.

tl;dr: before writing an article advising people about writing a distributed
routing framework, write a distributed routing framework. Also: if your
"distributed system" is also a "trusted system", realize that fusing the two
together requires some genuine expertise.

To paraphrase a comment about something I wrote the other day: "that article
isn't even wrong".

The protobuf reference was just ridiculous. Protobufs are not a distributed
system. They're a serialization framework that is intrinsically copying
(append the same Protobuf twice to the a repeated field, and when you read it
back, you get two distinct objects).

    
    
        class Demo extends Object implements Cloneable {
        }
        
        boolean alwaysFalse(Demo a) {
            return a.equals(a.clone());
        }
    

No Java programmer anywhere near the level of proficiency where they are
writing a distributed routing framework would be surprised by the above.
Really, anyone who has bothered to read how equals(), hashCode() or
HashSet/HashMap/Hashtable work in Java ought not to be surprised. Nobody
expects you can throw copies of arbitrary subclasses of Object in to a hash
and expect some notion of what is and is not an equivalent object to be
magically discerned by the runtime (and if it isn't clear, learning about
IdentityHashMap ought to bring it home).

From the article:

    
    
        On strings, numbers and collection classes, hashCode()
        always returns a consistent value, apparently even
        across different JVM vendors. It’s like that despite the
        documentation for hashCode() explicitly not guaranteeing
        consistency across different processes...
    

The article says "hashCode()" instead of "Object.hashCode()", which just shows
the author is missing what is going on. String.hashCode(), Integer.hashCode()
etc. are not the same as Object.hashCode and are all documented with semantics
distinct from Object.hashCode(). This is by design. It's well established
design that predates Java by a good two decades too.

The article then goes on:

    
    
        The problem is that although the documentation says
        hashCode() doesn’t provide a consistency guarantee,
        the Java standard library behaves as if it did provide
        the guarantee.
    

No. It behaves as though the users had read the actual documentation for the
classes in question before using them and knows better than the library author
about what the intended behaviour is.

The only surprise in protobufs is that they don't generate their own
equals()/hashCode() methods to provide more flexible equivalence relations; it
becomes abundantly clear as to why once you try to answer the question "well
just how would you do it without adding additional semantics (at the expense
of complexity and performance) to the protocol buffer language?".

The Java spec is very clear that Object.equals(Object) provides object
identity as the default implementation of equality. Java 101 covers the
difference in both concept and protocol between identity and equality, with
the former immutable while the latter can be user defined using by overriding
Object.equals(Object). Next you learn that if you do override, you need to
also override hashCode() accordingly. You learn this when you use the relevant
classes for the first time.

It is outright _weird_ for a Java developer to expect equivalence semantics
(as opposed to identity) when invoking equals(Object) or hashCode() unless
those semantics have been defined for an object's type.

This article is coming at it assbackwards from how anyone who has written a
distributed routing framework in Java (or for that matter most other
languages) ends up coming at it.

If you are writing a distributed routing framework, and long before you start
thinking about hash algorithms, you need to define "equality and identity"
semantics for your distributed objects. That invariably comes down for way to
identify that two keys of some kind represent the same object. You might
manage this entirely for the user, but typically you end up providing a user
defined hooks because there are plenty of business rules for what is a unique
identifier. In Java (as is the case in the afore mentioned Hadoop and Storm),
that typically means your keys override Object.equals(Object), often with a
framework provided type to tag tighter semantics.

When you overload Object.equals(Object), now you have to override
Object.hashCode(). Sometimes this is missed during Java 101, but you tend not
to get far in the Java world without having learned it. So once you have setup
your equivalence relationship, how you correctly tackle Object.hashCode()
becomes very, very straight forward.

In short: if you have reached the point where you are writing a distributed
routing framework in Java that assumes Object.hashCode() is stable across the
framework, you were almost certainly screwed long, long before you made that
mistake.

Eventually, you get to the point where you realize that users might define a
really bad hash, and you don't want to fail horribly when that happens. So,
you might want to defend against it. Turns out, the JDK actually does defend
against poor hashes. HashSet/HashMap/etc. tweak the hash when selecting
buckets to avoid hashes with entropy in the wrong places (or more accurately,
a _lack_ of entropy in the right places ;-).

Then _after_ that, you start to realize that not only are user defined hash
functions likely bad in that sense, but if they have user defined data (in
this case, "user" is not the developer, but the developer's "user"... so that
means "incompetent" is no longer our scary adjective... it's now
"untrustworthy") they are probably vulnerable to hash collision attacks. At
that point, you are outside the scope of what Java's hashCode() is intended to
address, so you really have to go a bit further.

There are basically four approaches, and the article appears to miss all of
them (or at least glosses over them like the author doesn't see how it all
fits together).

1) Don't use a hash lookup for user provided keys. Use a tree. Consider a
critbit tree. Most assume this sucks until they have either reflected on the
matter a fair bit, or have been schooled painfully by experience.

2) Don't let the end user define the keys.

3) Don't let the end user control/have full insight in to the hashCode()
calculation. Typically this involves using an HMAC as your underlying hash
function rather than a straight hash. For real security, you use an HMAC with
a random "password" (kept from the user) for each hash table.

4) Letting the hash go unmodified, but randomizing bucket assignment somehow
(a shared secret or a secret per hash) and blocking a user who generates more
than N distinct keys (key1.equals(keys2) is false) whose hashCode()'s are
equivalent, where N is a fixed number or ideally, a ratio based on their total
# of keys.

What generally doesn't work is simply using a "better" hash. In some cases a
simple secure (and by that I definitely don't mean MD5, and really also not
SHA-1 ;-) hash can work. It works only if the user never, ever has a way to
determine what the hash of a given key is (that means they can never see it or
side effects of it... even in the network stack). It also means your hash
function is so computationally expensive that the tree approach is likely far
more efficient.

What also doesn't work is using a serialization framework to serialize the
bytes in a deterministic way (it isn't clear, but it reads like the author
might think protobufs are guaranteed to be deterministic, which they aren't).
Serialization frameworks are, by their very nature, a horrid way to add
unpredictable entropy to your data (and determinism just makes it worse). That
is tantamount to padding your data with a secret in a very non-
cryptographically sound way. It depends not only on keeping said secret, but
also keeping the use of the serialization framework secret (effectively the
"cryptographic protocol"). If you think you can keep a secret from an
attacker, then save yourself the trouble and just XOR your keys with it before
you hash. ;-)

In short: at no point along the way in the process should you be encountering
problems where you get the wrong idea about how hashCode() works and how to
use it properly in your distributed routing framework.

