Before diving into consistent hashing, let’s understand the naive approach first.
Naive: Distributed hashtable based on modulo
Sharding can be achieved via a simple modulo shardNumber = hash(key) % TOTAL_SHARDS
The hash function might differ based on expected properties of the key.
If the key is an auto-incremental user_id, then hash(key) = key hashing might work well.
One downside to this is that if the total number of shard changes, all the currently cached data becomes invalid and all requests would have to hit the DB to warm up the cache.
Drawbacks
- It is NOT horizontally scalable
- It may NOT be load balanced, especially for non-uniformly distributed
data. Hot shards that gets saturated while others are idle and empty.
What if you add a new shard?
All the currently cached data becomes invalid and all requests would have to hit the DB to warm up the cache.
It’s as if the cache suddenly disappeared. Your DB wil be hit hard and may go down.
You just DOSed yourself when scaling up your cache.
Consistent hashing is ideal for the situation described above.
How does consistent hashing work?
Based on an hash-code too. In Java or C#, hashCode method on Object returns an int, which lies in the range -2^31 to 2^31-1.
A 2 B 3 C 4 1
A value (integers) belongs to the next Node (letters)
Object 2 belongs in cache B.
Object 3 belongs in cache C.
Object 1 and 4 belong in cache A.
Consider what happens if cache C is removed: object 3 now belongs in cache A, and all the other object mappings are unchanged.
Another cache D is added;
A 2 B 3 C 4 D 1
^ new node
D will take objects 3 and 4, leaving only object 1 belonging to A.
Virtual Nodes
This works well, except the size of the intervals assigned to each cache is pretty hit and miss. Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches. The solution to this problem is to introduce the idea of “virtual nodes”, which are replicas of cache points in the circle. So whenever we add a cache we create a number of points in the circle for it.
Implementation?
TreeMap<Integer, MachineNode> nodes;
A 2 B 3 C 4 1
| | |
1 2 3 cache node hash codes
nodes.put(1, machineA);
nodes.put(2, machineB);
nodes.put(3, machineC);
I want to get the machine for 2: nodes.ceilingEntry(2)
FYI ceilingEntry returns a key-value mapping associated with the least key greater than or equal to the given key, or null if there is no such key.
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Integer, T> circle =
new TreeMap<Integer, T>();
public ConsistentHash(HashFunction hashFunction,
int numberOfReplicas, Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
for (T node : nodes) {
add(node);
}
}
public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hash(node.toString() + i),
node);
}
}
public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hash(node.toString() + i));
}
}
public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
int hash = hashFunction.hash(key);
if (!circle.containsKey(hash)) {
SortedMap<Integer, T> tailMap =
circle.tailMap(hash);
hash = tailMap.isEmpty() ?
circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}
Resources
Amazing article: http://tom-e-white.com/2007/11/consistent-hashing.html
Leave a Reply
Want to join the discussion?Feel free to contribute!