A collision happens when distinct values produce the same hash code.
If you can artificially create a hash collision, that’s an issue. Because this means you can inject malicious scripts (see Collision Attack).
Given Document A and Document B.
Document A is completely different from Document B.
But hash(Document A) is equal to hash(Document B).
How can we avoid collisions?
It’s important to keep in mind that collisions are possible. And this is expected and normal!
As the Pigeon hole principle says, we can’t avoid collisions: try to put 6 pigeons into 5 holes.
Because hash functions have infinite input length and a predefined output length, there is inevitably going to be the possibility of two different inputs that produce the same output hash.
Pigeon hole principle
“if you have 50 pigeons into 25 pigeon holes, you have to stuff 2 of the pigeons into 1 of the holes”
The Pigeonhole principle says we can’t avoid all collisions:
— try to hash without collision m keys into n slots with m > n
— try to put 6 pigeons into 5 holes
What do we do when two keys hash to the same entry?
— open hashing: put little dictionaries in each entry (shove extra pigeons in one hole!)
— closed hashing: pick a next entry to try