Wednesday, December 20, 2017

Closed hashing probes and clustering..

Closed hashing with linear probing has a problem: once with hit a filled slot, there's a good chance we'll be searching O(n) through the cluster of collisions. This is well-known and called primary clustering.

So, we'll use quadratic probing? Using (hash+(i+i*i)>>1)%N with a N=2^n tables we will still search all the slots, but probe sequences starting from different slots will diverge. Unfortunately they will match if we start from the same slot (we're just using n bits of the hash), which is known as secondary clustering and also well-known.

The usually suggested remedy for secondary clustering is double hashing. We take another hash (or even just hash the original full hash; this is enough to avoid the "only using n low-order bits" problem as long as your second hash is good at mixing the bits.. or even just bit-reversing the first hash could work), let's call it hash2 and use it to control the probe sequence. The standard approach is (hash+(hash2|1)*i)%N (where the bitwise-or with 1 will enforce full sequence for 2n tables) and this is now perfect?

What about the case where we are really unlucky and a bunch of entries all get the same low bits in hash2 and then collide? Now they could theoretically cluster just like in linear probing, since they end up using the same linear sequence. This obviously shouldn't happen if hash2 is good, just like we shouldn't get a collision in the first place if hash is good, so maybe we should think about it anyway?

So why not fix that too? It doesn't even seem complicated. See, we can combine the double hash with the quadratic probing and end up with (hash+(hash2|1)*((i+i*i)>>1))%N. That's still guaranteed to hit every slot (in a "randomly" permuted order) of a power-of-two table! If two entries with same low bits of hash2 collide, then the quadratic will force them to diverge as long as their home slots are different. If two entries have the same home slot (low bits of hash) then hash2 will force a different permutation.

Now, it seems like the above would lead to a rather "fool-proof" collision resolution, with little cost over regular double hashing and even somewhat resistant against the second hash being poor (eg. you still get a quadratic probe too). The only way you can then cluster is by hitting both primary AND secondary clustering at the same time (ie. collision in both hashes at the same time).

So .. why have I not seen this mentioned anywhere? Am I missing some non-obvious problem?