Saturday, December 3, 2016

Why should you care about equals and hashcode

Equals and hash code are fundamental elements of every Java object. Their correctness and performance are crucial for your applications. However often we see how even experienced programmers are ignoring this part of class development. In this post, I will go through some common mistakes and issues related to those two very basic methods.

Contract

What is crucial about mentioned methods is something called "contract." There are three rules about hashCode and five about equals (you can find them in the Java doc for Object class), but we'll talk about three essential. Let's start from hashCode():

"Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified."

That means the hash code of an object doesn't have to be immutable. So let's take a look at the code of really simple Java object:

public class Customer {

 private UUID id;
 private String email;

 public UUID getId() {
  return id;
 }

 public void setId(final UUID id) {
  this.id = id;
 }

 public String getEmail() {
  return email;
 }

 public void setEmail(final String email) {
  this.email = email;
 }

 @Override
 public boolean equals(final Object o) {
  if (this == o) return true;
  if (o == null || getClass() != o.getClass()) return false;
  final Customer customer = (Customer) o;
  return Objects.equals(id, customer.id) &&
    Objects.equals(email, customer.email);
 }

 @Override
 public int hashCode() {
  return Objects.hash(id, email);
 }
}

As you probably noticed equals and hashCode were generated automatically by our IDE. We are sure those methods are not immutable, and such classes definitely are widely used. Maybe if such classes are so common there is nothing wrong with such implementation? So let's take a look at simple usage example:

def "should find cart for given customer after correcting email address"() {
 given:
  Cart sampleCart = new Cart()
  Customer sampleCustomer = new Customer()
  sampleCustomer.setId(UUID.randomUUID())
  sampleCustomer.setEmail("emaill@customer.com")

  HashMap customerToCart = new HashMap<>()

 when:
  customerToCart.put(sampleCustomer, sampleCart)

 then:
  customerToCart.get(sampleCustomer) == sampleCart
 and:
  sampleCustomer.setEmail("email@customer.com")
  customerToCart.get(sampleCustomer) == sampleCart
}

In above test, we want to ensure that after changing email of a sample customer we're still able to find its cart. Unfortunately, this test fails. Why? Because HashMap stores keys in "buckets." Every bucket holds the particular range of hashes. Thanks to this idea hash maps are so fast. But what happens if we store the key in the first bucket (responsible for hashes between 1 and 10), and then the value of hashCode method returns 11 instead of 5 (because it's mutable)? Hash map tries to find the key, but it checks second bucket (holding hashes 11 to 20). And it's empty. So there is simply no cart for a given customer. That's why having immutable hash codes is so important! The simplest way to achieve it is to use immutable objects. If for some reasons it's impossible in your implementation then remember about limiting hashCode method to use only immutable elements of your objects.

Second hashCode rule tells us that if two objects are equal (according to the equals method) the hashes must be the same. That means those two methods must me related which can be achieved by basing on the same information (basically fields).

Last but not least tells us about equals transitivity. It looks trivial but it's not - at least when you even think about inheritance. Imagine we have a date object with extending the date-time object. It's easy to implement equals method for a date - when both dates are same we return true. The same for date-times. But what happens when I want to compare a date to a date-time? Is it enough they will have same day, month and year? Can wet compare hour and minutes as this information is not present on a date? If we decide to use such approach we're screwed. Please analyze below example:
 2016-11-28 == 2016-11-28 12:20
 2016-11-28 == 2016-11-28 15:52
Due to transitive nature of equals, we can say, that 2016-11-28 12:20 is equal to 2016-11-28 15:52 which is, of course, stupid. But it's right when you think about equals contract.

JPA use-case

Not let's talk about JPA. It looks like implementing equals and hashCode methods here is really simple. We have unique primary key for each entity, so implementaion based on this information is right. But when this unique ID is assigned? During object creation or just after flushing changes to the database? If you're assigning ID manually it's OK, but if you rely on the underlaying engine you can fall into a trap. Imagine such situation:

public class Customer {

 @OneToMany(cascade = CascadeType.PERSIST)
 private Set
addresses = new HashSet<>(); public void addAddress(Address newAddress) { addresses.add(newAddress); } public boolean containsAddress(Address address) { return addresses.contains(address); } }

If hashCode of the Address is based on ID, before saving Customer entity we can assume all hash codes are equal to zero (because there is simply no ID yet). After flushing the changes, the ID is being assigned, which as well results in new hash code value. Now you can invoke containsAddress method, unfortunately, it will always return false, due to the same reasons which were explained in the first section talking about HashMap. How can we protect agains such problem? As far as I know there is one valid solution - UUID.

class Address {

 @Id
 @GeneratedValue
 private Long id;
 
 private UUID uuid = UUID.randomUUID();

 // all other fields with getters and setters if you need

 @Override
 public boolean equals(final Object o) {
  if (this == o) return true;
  if (o == null || getClass() != o.getClass()) return false;
  final Address address = (Address) o;
  return Objects.equals(uuid, address.uuid);
 }

 @Override
 public int hashCode() {
  return Objects.hash(uuid);
 }
}

The uuid field (which can be UUID or simply String) is assigned during object creation and stays immutable during the whole entity lifecycle. It's stored in the database and loaded to the field just after querying for this object. It or course adds some overhead and footprint but there is nothing for free. If you want to know more about UUID approach you can check two briliant posts talking about that:

Biased locking

For over ten years the default locking implementation in Java uses something called "biased locking." Brief information about this technique can be found in the flag comment (source: Java Tuning White Paper):

-XX:+UseBiasedLocking 
Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.

Something that is interesting for us regarding this post is how biased locking is implemented internally. Java is using the object header to store ID of the thread holding the lock. The problem is that the object header layout is well defined (if you're interested, please refer to OpenJDK sources hotspot/src/share/vm/oops/markOop.hpp) and it cannot be "extended" just like that. In 64 bits JVM thread ID is 54 bits long so we must decide if we want to keep this ID or something else. Unfortunately "something else" means the object hash code (in fact the identity hash code, which is stored in the object header). This value is used whenever you invoke hashCode() method on any object which doesn't override it since Object class or when you directly call System.identityHashCode() method. That means when you retrieve default hash code for any object; you disable biased locking support for this object. It's pretty easy to prove. Take a look at such code:

class BiasedHashCode {

 public static void main(String[] args) {
  Locker locker = new Locker();
  locker.lockMe();
  locker.hashCode();
 }

 static class Locker {
  synchronized void lockMe() {
   // do nothing
  }

  @Override
  public int hashCode() {
   return 1;
  }
 }
}

When you run the main method with the following VM flags:
-XX:BiasedLockingStartupDelay=0 -XX:+TraceBiasedLocking
you can see that... there is nothing interesting :)

However, after removing hashCode implementation from Locker class the situation changes. Now we can find in logs such line:
Revoking bias of object 0x000000076d2ca7e0 , mark 0x00007ff83800a805 , type BiasedHashCode$Locker , prototype header 0x0000000000000005 , allow rebias 0 , requesting thread 0x00007ff83800a800

Why did it happen? Because we have asked for the identity hash code. To sum up this part: no hashCode in your classes means no biased locking.

Big thanks to Nicolai Parlog from https://www.sitepoint.com/java/ for reviewing this post and pointing me some mistakes.

2 comments:

tratata said...

Hi Jakub, is it possible to ask you some questions on PRIV , somewhere ? somehow ?

P.S I am not from corpo-bank HR :D

Thanks in advance.
Cheers :)

Jakub Kubrynski said...

Just send me private message on twitter