# Preventing Hash Collisions

Articles —> Preventing Hash Collisions

Hashing is an irreversible digestion of data into a data type if uniform length. In Java, hashing of objects occurs via the hashCode method, and is important for storing and accessing objects in data structures (such as a Map or Set).

Because the hashCode method in java returns an int data type, it is limited to only the size of the int: 32-bits of information. Therefore with a large number of objects hash collisions are likely. This being said, even with a small number of objects, if the hashCode method does not return a number that is uniformly distributed across all plausible int values, hash collisions can be inevitable.

Take the following use case: a Line class defined by two end Point's:

Point

```
/**

* Point class based upon an x and y coordinate

* @author gcope

*

*/

public class Point {

private final int x;

private final int y;

public Point(int east, int south) {

this.x = east;

this.y = south;

}

public int getX(){

return x;

}

public int getY(){

return y;

}

@Override

public int hashCode() {

final int prime = 31;

int result = 17;

result = prime * result + y;

result = prime * result + x;

return result;

}

}

```

The Line class:

```
/**

* Line class defined by two end Points

* @author gcope

*

*/

public class Line {

private Point p1;

private Point p2;

public Line(Point startCell, Point endCell) {

this.p1 = startCell;

this.p2 = endCell;

}

@Override

public int hashCode() {

final int prime = 59;

int result = 43;

result = prime * result + ((p1 == null) ? 0 : p1.hashCode());

result = prime * result + ((p2 == null) ? 0 : p2.hashCode());

return result;

}

}

```

For brevity, accessor and equals methods are omitted, as are comments. Each class defines a simple hashCode method, returning an int value based upon its fields. The danger here of course, comes from hash collisions. A simple example:

```
Line line1 = new Line(new Point(154, 156), new Point(154, 156));

Line line2 = new Line(new Point(153, 156), new Point(151, 158));

System.out.println(line1.hashCode() + ":" + line2.hashCode());

```

Both line1 and line2 have the same hashCode: 1429303. In fact, in this particular case the level of collision is extremely high. Consider the test case below, in which 6,250,000 Lines with different endpoints get generated:

```
public static void main(String[] args) {

List<Point> points = new ArrayList<Point>();

for ( int i = 0; i < 50; i++ ){

for ( int j = 0; j < 50; j++ ){

Point c = new Point(i,j);

}

}

System.out.println(points.size() + " points generated");

System.out.println("Testing " + (points.size()*points.size()) + " number of Lines");

Set<Integer> set = new HashSet<Integer>();

int collisions = 0;

for ( int i = 0; i < points.size(); i++ ){

for ( int j = 0; j < points.size(); j++ ){

Line r = new Line(points.get(i), points.get(j));

if ( set.contains(r.hashCode() ) ){

collisions++;

}

}

}

System.out.println(collisions);

}

}

```

The above results in an astounding 6,155,919 collisions!

How might one lower the probability of collisions? A hash can be defined by the fields of a class, but also inter-dependent properties of those fields. In simpler terms, a line has a length, and a line has a slope. What happens if we include these calculations within the hashCode method of the Line class?

```
slope = (startCell.getY() - endCell.getY());

if ( slope == 0 ){//prevent division by 0

slope = startCell.getX() - endCell.getX();

}else{

slope = slope / (startCell.getX()- endCell.getX()) / slope;

}

length = Math.sqrt(Math.pow(startCell.getX() - endCell.getX(), 2) + Math.pow(startCell.getY() - endCell.getY(), 2));

....

@Override

public int hashCode() {

final int prime = 59;

int result = 43;

result = prime * result + ((p1 == null) ? 0 : p1.hashCode());

long temp;

temp = Double.doubleToLongBits(length);

result = prime * result + ((int) (temp ^ (temp >> 32)));

temp = Double.doubleToLongBits(slope);

result = prime * result + ((int) (temp ^ (temp >> 32)));

result = prime * result + ((p2 == null) ? 0 : p2.hashCode());

return result;

}

```

With the above changes, there are 870116 collisions: still a lot, but an 85% reduction in hashCode collisions. Something to consider when hashing is an integral part of your application.