Wednesday, January 12, 2011

Hashing things out

Now that I talked about equals(), I can discuss its partner in crime, the hashCode() method. This is another method on java.lang.Object that is therefore available on every Java object and, like equals(), you'll often want to override it with your own implementation.

In fact, the rule is: if you override equals(), you need to override hashCode() as well. I'll explain why.

First, though, you need to understand how hash tables work. If you don't, I suggest perusing the Wikipedia article for hash table, at least as a starting point.

The purpose of the hashCode() method is to provide the index of the bucket within a hash table where an object should go. At least in Java, it's not necessary for you to work in how big the hash table is when computing the code. For a hash table of size n, an object's hash index can be o.hashCode() modulo n.

One reason why the hashCode() method is part of Object is that it's just so darn useful. Hash tables show up everywhere in computing. Java's own data structures use hash tables internally. So it is convenient to have the method available all the time for any object.

A more important reason is that the concept of a hash code is so closely tied to the concept of equality. If two objects are equal, then they must reduce down to the same hash code. Said in Java terms, if a.equals (b) is true, then a.hashCode() == b.hashCode() must also be true. That's why whenever you override equals() for a class, you have to override hashCode() as well, so that they continue to agree.

All right, so why do two equal objects need to return the same hash code? Let's take the highly useful class java.util.HashSet as an example. This class implements a set of objects. A set cannot contain an object more than once. Sets cannot have duplicates.

This particular class implements its set storage by using an internal hash table. When you add something to a HashSet, that object is stored in the hash table, inside the bucket determined by its hashCode() value. That makes it fast and easy to find again, especially so that the HashSet can stop you from successfully adding the object a second time.

A HashSet, as well as almost all other instances of set implementations in Java, decides if it already contains some object if it finds an equal object within it. That is, equal as in equals(). So here are the steps for adding an object to a HashSet, laid out.
  1. Compute the hash code "c" of the object to be added.
  2. Find the bucket corresponding to "c" in the hash table.
  3. Check every object in the bucket. If any one is equal to the object according to equals(), don't add it. Otherwise, add it.
If two equal objects return different hash codes, then it will be possible to add them both to the same set (because step 2 can target the wrong bucket). That violates the "semantics" for Java sets, and will pretty much screw everything up for you.

It's important to mention that it is still OK for two unequal objects to return the same hash code. A collision like this is good to avoid, since it makes hash tables less effective, but it's not the end of the world. Those two unequal objects will just end up in the same bucket in hash tables.

The default implementation of hashCode() takes into account all the bits and pieces of your class, including all its fields. For classes where every field is "significant" with respect to equals(), this is correct. However, if you redefine equals() to only check some subset of your class's fields, then you can end up with two equal objects having different hash codes because the fields you left out are different. If that happens, you won't be able to use instances of your class in hash tables.

So, when you override equals(), you need to override hashCode() to compute its value based only on those significant fields. There are recipes for creating hashCode() methods, and I'll plug Effective Java by Joshua Bloch again as a great resource. Heck, some Java IDEs will even write them for you. Here is an example for the StringPair class we created in the last post, and since I'm hardcore I wrote it by hand.
@Override public int hashCode() {
    int c = 17;
    c = 37 * c + (s1 != null ? s1.hashCode() : 0);
    c = 37 * c + (s2 != null ? s2.hashCode() : 0);
    return c;
}
Since both string fields s1 and s2 are involved in equating StringPair objects, they both appear here. The important thing to notice is that the hash code for StringPair is calculated based on the hash codes of the String fields within it. As long as String.hashCode() follows the rules (which it does), then this implementation will render the same value for two objects with equal strings.

No comments:

Post a Comment