Preventing Hash Collisions

Articles —> Preventing Hash Collisions

Hashing is an irreversible digestion of data into a data type if uniform length. In Java, hashing of objects occurs via the hashCode method, and is important for storing and accessing objects in data structures (such as a Map or Set).

Because the hashCode method in java returns an int data type, it is limited to only the size of the int: 32-bits of information. Therefore with a large number of objects hash collisions are likely. This being said, even with a small number of objects, if the hashCode method does not return a number that is uniformly distributed across all plausible int values, hash collisions can be inevitable.

Take the following use case: a Line class defined by two end Point's:

Point


/**

 * Point class based upon an x and y coordinate

 * @author gcope

 *

 */

public class Point {

    private final int x;

    private final int y;

    

    public Point(int east, int south) {

        this.x = east;

        this.y = south;



    }

    

    public int getX(){

    	return x;

    }

    

    public int getY(){

    	return y;

    }



    @Override

    public int hashCode() {

        final int prime = 31;

        int result = 17;

        result = prime * result + y;

        result = prime * result + x;

        return result;

    }



}

The Line class:


/**

 * Line class defined by two end Points

 * @author gcope

 *

 */

public class Line {

    private Point p1;

    private Point p2;



    public Line(Point startCell, Point endCell) {

        this.p1 = startCell;

        this.p2 = endCell;

    }

    

    @Override

    public int hashCode() {

        final int prime = 59;

        int result = 43;

        result = prime * result + ((p1 == null) ? 0 : p1.hashCode());

        result = prime * result + ((p2 == null) ? 0 : p2.hashCode());

        return result;

    }

}

For brevity, accessor and equals methods are omitted, as are comments. Each class defines a simple hashCode method, returning an int value based upon its fields. The danger here of course, comes from hash collisions. A simple example:


    Line line1 = new Line(new Point(154, 156), new Point(154, 156));

    Line line2 = new Line(new Point(153, 156), new Point(151, 158));

    System.out.println(line1.hashCode() + ":" + line2.hashCode());

Both line1 and line2 have the same hashCode: 1429303. In fact, in this particular case the level of collision is extremely high. Consider the test case below, in which 6,250,000 Lines with different endpoints get generated:


public static void main(String[] args) {

	List<Point> points = new ArrayList<Point>();

	for ( int i = 0; i < 50; i++ ){

		for ( int j = 0; j < 50; j++ ){

			Point c = new Point(i,j);

			points.add(c);

		}

	}

	System.out.println(points.size() + " points generated");

	System.out.println("Testing " + (points.size()*points.size()) + " number of Lines");

		Set<Integer> set = new HashSet<Integer>();

		int collisions = 0;

		for ( int i = 0; i < points.size(); i++ ){

			for ( int j = 0; j < points.size(); j++ ){

				Line r = new Line(points.get(i), points.get(j));

				if ( set.contains(r.hashCode() ) ){

					collisions++;

				}

				set.add(r.hashCode());

			}

		}

		System.out.println(collisions);

	}

}

The above results in an astounding 6,155,919 collisions!

How might one lower the probability of collisions? A hash can be defined by the fields of a class, but also inter-dependent properties of those fields. In simpler terms, a line has a length, and a line has a slope. What happens if we include these calculations within the hashCode method of the Line class?


slope = (startCell.getY() - endCell.getY());

if ( slope == 0 ){//prevent division by 0

	slope = startCell.getX() - endCell.getX();

}else{

	slope = slope / (startCell.getX()- endCell.getX()) / slope;

}

length = Math.sqrt(Math.pow(startCell.getX() - endCell.getX(), 2) + Math.pow(startCell.getY() - endCell.getY(), 2));

....

@Override

public int hashCode() {

    final int prime = 59;

    int result = 43;

    result = prime * result + ((p1 == null) ? 0 : p1.hashCode());

    long temp;

    temp = Double.doubleToLongBits(length);

    result = prime * result + ((int) (temp ^ (temp >> 32)));

    temp = Double.doubleToLongBits(slope);

    result = prime * result + ((int) (temp ^ (temp >> 32)));

    result = prime * result + ((p2 == null) ? 0 : p2.hashCode());

    return result;

}



With the above changes, there are 870116 collisions: still a lot, but an 85% reduction in hashCode collisions. Something to consider when hashing is an integral part of your application.



There are no comments on this article.

Back to Articles


© 2008-2017 Greg Cope