Tuesday, March 06, 2018

Optimizing Learned Bloom Filters with Sandwiching

For the small-ish subset of people out there who care about "learned Bloom filters" (the subject of my last post), I have a small-ish update.  I guess the data structure has been unconsciously sitting around in the back of my head, and when I paged it back into my conscious self, I noticed there was an (in hindsight) obvious improvement. 

Recall that the learned Bloom filter uses a (magic) black-box predictor obtained from learning theory that predicts whether an element is in a set or not;  to avoid false negatives, it then uses a backup Bloom filter to rescue any set elements that have been incorrectly predicted as not being in the set.  This leads to two sources of false positives, from the predictor and the backup Bloom filter.

It turns out it is better to get rid of some of those false positives up front.  That is, you get better performance if you have a regular old Bloom filter up front, followed by the learned predictor, followed by a (much smaller) backup Bloom filter to round up stray false negatives.  Because you have the predictor sandwiched between two Bloom filters, I refer to this as the "sandwiched learned Bloom filter". 

The math is surprisingly easily -- and at the same time, really beautiful in terms of the result.  It turns out that, if you think of distributing "Bloom filter bits" out to the up-front Bloom filter and the backup Bloom filter, the right thing to do is first put bits on the backup Bloom filter, up to a certain fixed amount, which depends on the the parameters of the predictor (false positive rate, false negative rate).  After that, all the bits should go to the up-front Bloom filter.

Like most Bloom filter optimizations, the gains are worthwhile but do not seem to be another-factor-of-two;  it looks like it this can cut the size another 10-25% or so, depending on the settings, and it essentially free.   

The whole writeup is less than 2 pages, so rather than write and explain it all here, if you're interested, check the arxiv draft.  Hopefully, more fun in this setting will be forthcoming....

Thursday, January 25, 2018

Some Notes on "Learned Bloom Filters"

About a month ago, a draft paper was put out on arxiv called The Case for Learned Indexed Structures by the people at Google Brain.  The paper has received some significant attention, perhaps because of its pedigree and perhaps because it makes some rather bold claims.  For example, a line from the abstract states "In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes."

There has already been some pushback on some of the claims from the draft paper (see here for a response to the paper's discussion of B-trees, and here for a discussion regarding that cuckoo hash tables seem to do just as well as "learned" hash tables).  I also read the paper and had some back and forth with the authors about some thoughts and concerns I had;  they have been both responsive and thoughtful, and I think the conversation was fruitful on both sides.  (At least, they helped me understand their work, and I hope my suggestions for them were useful.)  In what follows, I'll explain some of specific thoughts and concerns regarding their proposed learned Bloom filters.  I'll try to keep the blog post self-contained, but you may want to read their paper.  Also, I've tried to write my ideas down more formally in a short write-up.  Here's a link to a draft;  this link may age out or disappear if/when I put this on arxiv.  

What follows is my description (so any issues are mine, not the authors of the Learned Index paper).

Recall that a Bloom filter approximately represents a set S in small space.  One can ask queries of the form:  is y an element of S.  If y is an element of S, the answer will always be yes (no false negatives), if not there is some chance of false positive.  That is, it trades off some amount of error to save space.  A key feature is that the probability that any given input y yields a false positive does not depend on y.  (Well, it does after the Bloom filter has been instantiated -- some elements give false positives, some don't.  But a priori, before you build the Bloom filter using hash functions, if you don't know the hash functions, you don't know what the false positives will be.)

For the learned Bloom filter, suppose we have a very good (black-box) predictor P, which takes an input x and returns P(x), which should be interpreted as an estimate of the probability that x is an element of S.  That is, we have a function that can give a very good "guess" if an input x is an element of S.  Then we can use P as a "pre-filter" to improve our Bloom filter as follows.  When asked if x is an element of S, compute P(x), and simply return that x is in S if P(x) is larger than some suitable threshold z.  This may introduce false positives, but it may also introduce false negatives;  there may be elements x of S for which P(x) < z.  To deal with those, we use a backup filter, which is just a standard Bloom filter that holds the false negatives from the pre-filter using the predictor P.  So now we accept that x is an element of S if P(x) is above the threshold, or x obtains a positive answer from the backup filter.  The backup filter also introduces false positives.  

The authors of the learned index paper suggest that machine learning allows for good predictors P that require small space, thereby improving the Bloom filter in terms of space while achieving the same false positive probability.  Note that the predictor does not necessarily have to be that great;  even if half the elements of S are false negatives, the backup filter would then be a little bigger than half the size of a standard Bloom filter, so if the predictor P is suitably small, it is plausible that one could get the same performance with smaller space.

I don't know if this idea is new.  I don't think I've seen a probabilistic pre-filter for a Bloom filter before;  usually a Bloom filter is a pre-filter to some other part of the system.  It seems like an interesting idea;  certainly, the potential to layer filters is useful, and putting a filter before a Bloom filter is sensible.  

A major issue, though, is that the nature of the guarantees one obtains are quite different for learned Bloom filters and standard Bloom filters, which is a subtlety that I worry may not be clear to all readers of the original paper.  In particular, it seems in part that confusion might arise because the term "false positive rate" appears to have slightly different meanings in the Bloom filter community and the machine learning community;  in the Bloom filter setting, false positive rate generally refers to that false positive probability, which is independent of the query.  In the machine learning setting, the false positive rate is an empirical quantity, which depends on the input data and the queries.  

An example may be helpful;  here's one I used from the writeup.  (This is a "thought-experiment" example, not based on actual experiments.)  Suppose we are working with numbers in the range [0,1000000), and our learned Bloom filter is supposed to hold a set S that contains a 1000 elements.  Half of those elements are random numbers in the range [1000,2000], and half those elements are random over the rest of the range.  A good machine learning function might learn a predictor P that says that if x is in the range [1000,2000], then P(x) is "large" -- say around 1/2 -- and otherwise P(x) is small.  It could use a threshold z of around 0.4, in which case about half the elements of S are false negatives, and must be stored in the backup filter.  This seems like a reasonable outcome for learning.

What is the false positive rate now?  Well, it depends on the queries.  If all the queries are uniform over the range [1000,2000], the false positive rate would be quite high -- every query to a non-set element would be a false positive.  If the queries are uniform over the range [0,1000000), the false positive rate is quite low, since one will rarely get unlucky and ask a query in the range [1000,2000], and otherwise the only false positives will come from the backup Bloom filter.  The point is that the false positive probability depends on the distribution of the queries (as well as the predictor).  But there may be many applications where the queries are naturally more often for elements that are "similar to" set elements -- that is, they have high P values, like the values in [1000,2000] that are not elements of S -- which may lead to high false positive rates.  And if you don't know the query stream in advance, you don't even know what the false positive rate will be.  

The response to this seems to be that if you have data about the queries, you can figure out the false positive rate empirically.  That is, if you have a test set of queries that you assume is a sample from a distribution over all possible queries, you can determine the false positive rate of that test set, and use it to predict the false positive rate of future queries.  My writeup provides basic definitions to formalize this idea, which seems correct.  However, this still assumes that "future queries" will come from the same distribution as the test set.  Returning to our example, if in our test set queries are uniform over [0,1000000), but then in the future queries change so that they are uniform over a smaller interval [0,100000), that significantly changes the false positive rate of the filter.  

My takeaways were the following:
  • The idea of adding learning into data structures is certainly an interesting idea.
  • But it's still important to try to set up a theoretical framework, to understand what kinds of guarantees can be obtained.
  • At the moment, I can see applications where learned Bloom filters might be helpful -- if your data set has significant structure that can be useful to a machine learning function, if you don't have to deal with insertions and deletions frequently, if you have access to query data, and if you have reason to believe the query stream isn't changing (in a distributional sense).  
  • At the same time, I can see applications where I would be very concerned about the use of learned Bloom filters, such as in a security setting.  Also, modern Bloom filter alternatives, such as cuckoo filters (which use less space, and allow insertions/deletions more easily) should be considered and compared when thinking about using learned Bloom filters.  
  • There's probably more to think about here.  In particular, I wonder if there are other places where additional "filter layering" may be useful.  


Tuesday, January 09, 2018

Double-blind, ALENEX

I wanted to point people to a pair of blog posts (Part 1, Part 2) by Suresh Venkatasubramanian at the geomblog discussing the experience of having the ALENEX conference go double-blind for reviewing this year, from the perspective of him and his co-chair Rasmus Pagh.  The high order bits are that they thought it worked fine despite the additional work, and they recommend doing it again in future years.  

I am in favor of double-blind reviewing, which is standard for many other conferences in many other areas, but is somehow considered impossible for CS theory.  (Suresh does an excellent job addressing why that has been historically, and why the arguments against double-blind reviewing do not really hold up.)  The theory community hasn't historically taken "conflicts of interest" issues seriously, as I've written about before (I guess a rather long time ago!).  Double-blind reviewing helps with CoI issues, but as Suresh discusses, also deals with the huge implicit bias issues that I think are also a problem in the theory community. 

(Except that, really, calling it implicit bias is something of a misnomer -- in many cases it's really explicit bias.  As Suresh notes, in providing responses to common arguments:
But how am I supposed to know if the proof is correct if I don't know who the authors are. 
Most theory conferences are now comfortable with asking for full proofs. And if the authors don't provide full proofs, and I need to know the authors to determine if the result is believable, isn't that the very definition of bias?   
This is a real argument commonly presented at PC meetings.  I've had people say, "I haven't verified the proofs, but XXX is one of the authors, so I'm sure it's all right.")

I encourage people to go read Suresh's discussion, and I would like to thank Suresh and Rasmus for taking it upon themselves to perform this experiment. 

(If you're interested, you can also read the paper by Tomkins et al discussing double-blind reviewing experiences recently at WSDM as well.) 

Thursday, November 16, 2017

Harvard CS Concentrators Jump Again

During my time as Area Dean for computer science, the number of computer science majors at Harvard more than doubled.  Growth has continued, and according to the Crimson, we're now clearly the second largest concentration at Harvard.  (It looks like we just barely surpassed Government last year, and now we're clearly larger.  Economics is still bigger.)

When you consider that Applied Math is the 4th largest major group, and that's a group that is largely composed of people who do some mix of Math/Computer Science/Economics, I think the numbers actually underestimate the growth in CS at Harvard. 

It's been an exciting time since I started at Harvard in 1999.  One could see what was coming.  Though, sadly, the administration still seems far from having caught up -- we (like many other CS departments) really could use some more resources coming our way.   Universities are a bit slow to respond (sometimes for good reasons), and the speed of these changes is, in context, just remarkable.  And, I'd have to say, have added to the fun of being a CS professor.  Back in 1999, CS was incorrectly perceived to be a backwater at Harvard, small and unnoticed, but where there was a lot of excitement and energy.  Fast forward less than two decades, and now we're clearly at the core of Harvard and its future. 

Thursday, November 09, 2017

BARC, Copenhagen

A few summers ago, I had an opportunity to visit Copenhagen, and work with Rasmus Pagh for a month.   I (and the family) liked it so much we went back the next summer for a hashing workshop.  While algorithms and data structure people were already familiar with Denmark's Aarhus and MADALGO, it was clear to me that Copenhagen was on the rise as an algorithmics center.

That's now been formalized, thanks to a huge new grant for algorithms research centered around a collection of fantastic researchers in Copenhagen -- and they're looking for graduate students, post-docs, visitors, and faculty.  Rasmus sent me a blurb that I'm posting below, introducing the new Basic Algorithms Research Copenhagen, or BARC.

This is wonderful in so many ways.  First, it is always great generally for our community when algorithmic work is recognized and funded.  Second, this is a fantastic group of researchers -- for example, obviously I am biased, but Rasmus and Mikkel Thorup, both individually and together, have just been knocking it out of the park over and over again for years in algorithms and data structures (particularly with regard to hashing).  Third, for many, this is a marvelous opportunity.  Copenhagen is just a wonderful city, and the chance to go and work there with the top people is something many people will enjoy.  (And don't we do better work when we're someplace we enjoy being?)

With all that, let me put up the blurb on BARC, and point you to the home page at barc.ku.dk and the Facebook page at http://fb.me/barcdk.  (Note there are deadlines coming up for some of the positions.)

BARC (Basic Algorithms Research Copenhagen) is new center for foundational research in design and analysis of algorithms. The center, which will be inaugurated onDecember 1, is led by Mikkel Thorup. It aims to "attract top talent from around the world to an ambitious, creative, collaborative, and fun environment” in Copenhagen, Denmark. Other core scientists in the center are Stephen Alstrup, Thore Husfeldt, and Rasmus Pagh. The center is part of a larger effort increasing the number of faculty members in algorithms research, and making Copenhagen a hub for algorithms research. The BARC center has received funding of over 6 million USD for the next 6 years from VILLUM Fonden to employ graduate students and post-docs, and to attract a stream of long- and short-term visitors. People interested in BARC should visit the web pages at barc.ku.dk or follow @barcdk on social media — the first assistant/associate professor, post-doc, and PhD student positions are already announced, with deadlines in November and December.

Wednesday, September 06, 2017

Simulated Annealing for JPEG Quantization

I have a new paper on the arxiv (web page describing results, and full version** available) with two Harvard undergrads, Max Hopkins and Sebastian Wagner-Carrena (who may both be applying for grad school, and who are both great, hint, hint everybody...), titled "Simulated Annealing for JPEG Quantization", on finding better default JPEG quantization tables.   This work started as their class project for my graduate class on "Algorithms at the end of the wire" last year, and they stuck with it through the year and into the summer to hone it to a paper.  I'm always excited when class projects to turn into papers, especially when it's done by undergrads.

Here's the story of the paper.  To start, as the wikipedia page on JPEG says, in JPEG after you transform 8 by 8 pixel blocks into 64 coefficients in the frequency domain, you "quantize" the coefficients -- that is, divide each coefficient by a fixed number (with a different number for each coefficient), and then round.  This reduces the amount of information needed to represent the block, at the expense of some loss of fidelity.  In particular, high frequency coefficients are quantized with much higher numbers, so that most high frequency coefficients become 0, further improving available compression.  This is generally OK because reducing the higher frequency components has less of an effect on the human interpretation of the image than reducing the lower frequency components.  

There are "default" quantizations tables built into the JPEG standard.  Back when the powers that be developed the original JPEG quantization tables, they were essentially choosing them "by hand", with much less to go on today about models for the human visual system, and much less computing power available.  These days, we have better models, or better said actual metrics, of how "good" a compressed image is with respect to the original in terms of the human visual system, and a great deal more computing power.  This suggests we should be able to effectively explore the very large space of possible JPEG quantization matrices using heuristic techniques, such as simulated annealing.  This idea has been considered before, but we seem to be at a much better point in terms of metrics and computation now than with past attempts.   

Of course it's still not so easy (it is a 64-dimensional space, and the metrics have their own issues), and there was a good amount of work and testing to be done (and some insight to be gained).  But we found what we think are clearly some better "default" JPEG tables (that we've made available at the web page above), and because of the way JPEG works, you can add them in to your JPEG file to use in place of the "standard default".  So it's all backward-compatible, ready-to-use, no need to change JPEG, which is ubiquitous enough that actual changes are unlikely to happen.  Any experts out there, go ahead and try them out and let us know what you think.  Also, I think the work generally opens the door to others to aim for additional improvements, using different metrics and/or different heuristic techniques.

** The arxiv version is, strangely, not the full version.  Apparently, arxiv has "size limits", and because we're doing side-by-side comparison of many images, apparently we're above the size limit, and they wouldn't give us an exception.  I hadn't known that could happen on arxiv, something to keep in mind in the future.  

Tuesday, September 05, 2017

Grace Hopper College

I hadn't seen the news about Yale renaming Calhoun College to Hopper College;  it just popped into one of my newslines here in the New York times.  I guess I hadn't seen that it was in the works months and months ago, as described in this Yale News article.  Consider me behind the times, but wanting to spread the news in case others hadn't seen it.

That's just awesome.

Saturday, August 12, 2017

Best Post I've Read on the Google Memo

After the shout-out to Meena, she suggested I might have more to say on the issue of the Google memo.  I (like I imagine so many others) have been following the array of follow-up articles and posts since this issue erupted.  There is so much to that needs to be said, and I feel that I am not the best person to say it -- what so many others have said in response.  In fact, I'm sad and angry that collectively so many of us are spending so much time thinking about this memo -- to counter that, my hope is that just having the issue of diversity prominently in the news and in the tech-sphere mindset will lead to more discussions and action that will result in actual progress.  

But maybe I have a few things to say, and a pointer to someone who I think has said it best.

1)  Anyone who somehow thinks there doesn't remain active bias against women in mathematics, engineering, and computer science hasn't been talking to actual women in these fields.

2)  In my opinion (and I owe this expression of thought to David Andersen, although these are my words), this memo wasn't about "the science".  My opinion is that the memo author went in with a prior belief of the truth, and then found some science that (at best, does a very poor job) of justifying what he believes is the truth.  That is not how science works.  (David called this "confirmation bias in action".)

3)  While there have been many good posts on the topic, I found this post today says it best, and encourage you all to read it.