Friday, May 30, 2008

More Robust Cuckoo Hashing (ESA paper)

I have a paper at ESA, with Adam Kirsch and Udi Wieder, entitled "More Robust Hashing: Cuckoo Hashing with a Stash." (Here's an extended preprint version.) The starting point of the paper is that cuckoo hashing is wonderful, and everybody should know about it and use it. (If you don't know about it or use it, see here or here. But basically, cuckoo hashing is multiple choice hashing with the added bonus of getting to move items around among their multiple choices as needed to balance load on-line.)

So, of course, we want to make cuckoo hashing even better. And one of the problems of cuckoo hashing is that it "fails" with non-trivial probability. For example, in standard 2-choice cuckoo hashing, putting n items into 2(1+eps)n cells with at most 1 item per cell can be done with probability 1 - Theta(1/n). That gives a Theta(1/n) "failure" rate, which doesn't mean anything in terms of "average" behavior, because it can always be handled by re-hashing everything with a new hash function, so that's the end of the story in most theory papers.

In practice, that kind of failure rate is a bit high. Re-hashing an entire hash table could just be too expensive an operation to have occur that frequently for many applications. Can we improve the failure rate somehow?

Let's look at where that Theta(1/n) failure rate comes from. When there are 2 choices per item, the simplest type of failure one might imagine is that 3 items try to use the same two cells -- that is, 3 items have the same 2 choices of location. Indeed, such a problem occurs with probability Theta(1/n) [exercise LTR]. But when such a problem occurs, we can imagine taking one of those 3 items and stashing it somewhere -- in a stash as suggested by the paper title. The stash should be a little space on the side, that we may have to check whenever we look for something. If we implement a stash, how does the size of the stash affect the failure rate?

If failures were essentially independent, so that each item "fails" independently with probability proportional to 1/n^2, then we'd expect f failed items with probability proportional to O(n^{-f}). This turns out to be the right intuition; 0ur analysis shows that a stash sized for s items (for constant s) reduces the failure rate to O(n^{-s-1}). The analysis is a bit trickier, since of course item failures aren't independent, but the result seems natural.

So a small, constant-sized hash -- easily implementable in software or hardware -- reduces the failure probability dramatically, allowing one to avoid rehashing in practice. What I found particularly interesting from the theory+practice perspective is the power in this setting of a constant-sized stash. It's a very natural addition for practitioners -- I don't think it would cause an engineer to blink -- but I think it really changes the potential for cuckoo hashing, making it a structure you could imagine employing on devices used by millions of customers, without having to worry about this otherwise far-too-common failure mode.

We show similar results for multiple cuckoo hashing variations. (The analysis, unfortunately, is different for all of the variations; it would be nice for someone to develop a cleaner, nicer, more unified analysis of all the varieties of cuckoo hashing.) The case of 2 choices is of non-trivial interest, however, since Udi, along with Moni Naor and Gil Segev, have recently shown that 2-choice cuckoo hashing can be used to develop a very efficient history-independent hash table (check out Udi's web page for the paper). The main problem with such a structure is, naturally, the non-trivial probability of rehashing, which can be avoided using a stash.

The stash idea also fits nicely with my previous paper with Adam on the Power of One Move for cuckoo hashing style schemes, although there the stashes (with suggested implementation as Content Addressable Memories in hardware) were larger than constant-sized. I'm beginning to think the idea of stashes (or CAMs) should be a standard section in practically oriented hashing related papers.

3 comments:

David Molnar said...

Thanks for posting this - sounds like an exciting practical result. This reminds me a little bit of a paper we read in an architecture class (CS 252). The authors considered the problem of when to evict data from a cache. They found that if they kept around a small additional holding area of memory for recently evicted data (under several different eviction policies), they could dramatically reduce their cache miss rate and improve performance. I'll see if I can dig up the reference...

Michael Mitzenmacher said...

Hi David. If I remember by CS 252 correctly, what you're talking about is called a victim cache, and there is a similar intuition. If you have a direct-mapped or k-way-set-associative cache for a small k, you may get into a situation where a (hopefully) small number of your cache lines are overloaded, causing a large number of misses, and you generally can't tell which of these cache lines will be overloaded in advance.

With a victim cache -- which handles recent evictions over ALL cache lines -- you can use a small amount of memory to cover all the cache lines, protecting yourself when these small overloads occur.

Again, similar in spirit. Of course, here we get some nice analytical results too. :)

David Molnar said...

Yes, that's exactly it. Thanks for the reminder. Would expect nothing less in the way of analytical results from you and your co-authors. :)