Wednesday, July 25, 2007

Why Simple Hash Functions Work Well, Part II

In my last post, I described a problem that's bothered me for some time: why do simple hash functions work so well -- that is, as well as the analysis for truly random hash functions? The natural answer is that there is some sort of randomness in the data interacting with the randomness in how the hash function is chosen that combines to give these results. In our paper, Salil Vadhan and I try to give a solid foundation for this approach.

To prove things about this interaction, we need models. For hash functions, we naturally look primarily at 2-universal (or pairwise independent, or sometimes 4-wise independent) hash functions. Such families are well-studied in theory and have practical implementations.

To model the data, we assume that it comes from a block source. This means that each new data item, conditioned on the values of the previous items, still has some entropy to it. In other words, each new piece of data is at least somewhat unpredictable. It can depend on the previous data in essentially arbitrary ways; for example, it might be possible to narrow the possibilities for the next data item to a small range based on the past, or the distribution of what the next item will be could be highly skewed. But as long as there is still enough unpredictability, we are fine. This seems like a perfectly reasonable model for network or streaming data to me, although as far as I know the block source model has not been used in this context previously.

Once we have the model in place, in some sense, we're done. The Leftover Hash Lemma of Impagliazzo, Levin, and Luby essentially says that, if there's enough entropy in your block stream, you get a near-uniform output over the entire stream when using even just a 2-universal hash function. Our paper improves on and specializes this flavor of result for the setting we're looking at -- hashing items into tables -- and examines the implications for many applications.

Finally, I feel I have a clear and clean justification for why truly random analysis is suitable even if only simple hash functions will be used. I expect there's still more that one can do to improve this framework, and it raises a number of interesting open questions that lie at the intersection of algorithms, information theory, and networking. For more details, I refer you to the paper.

I'll leave off with an interesting open question. In the "power of two choices" (or Balanced Allocations) scenario, a sequence of n items is hashed into a table with n bins. Each item is hashed twice, with 2 different hash functions (more generally, hashed d times, with d different hash functions), with each hash giving a possible location for the item. The item is actually placed in the least loaded of the 2 choices (breaking ties arbitrarily). Now on a lookup one has to check 2 places for the item, but the load distribution is much more even. Specifically, the maximum number of items in a bucket with d choices is only log log n/log d + O(1), instead of (1 +o(1)) log n/ log log n one gets from one choice. (You might check out this somewhat outdated survey if you're not familiar with the problem.)

As far as I know, we don't have a worst-case analysis (that is, as opposed to our paper, with no assumptions on the input data) for say pairwise-independent or even k-wise independent (for some constant k) hash functions, for some constant number of hash functions d. This has been open for some time.


Anonymous said...

Neat. Practically this implies compression your data will significantly improve the efficiency of hashing.

Anonymous said...

I don't think that compressing data will improve performance of hashing. Compressed data has the same antropy but occupies same space as uncompressed. However for predecessor queries, order preserving compression may improve performance. With uniformly distributed keys one can get predecessor queries in expected constant time. Keys that are perfectly compressed will be unoformly distributed.

Anonymous said...

I have read recently on a paper that dynamic hashing methods were rarely used in DBMS. Quote from article :
"One of the main reasons was their susceptibility to skewed data, which robust, data-driven index structures like B+ trees were able to handle much better".
This might mean that skew in the input might reflect in the hashed output. My question is if this is this really true and for which class of hash functions.