Friday, October 24, 2014

Harvard Junior Faculty Job Search

Harvard Computer Science is doing a junior faculty search this year.  We pretty much have been doing one every year, but it's nice to be able to make an announcement and point people to the ad.

So here is the ad, telling you what you need so send in if you're applying. 

Also as usual, we're looking in all areas, because we're always interested in great people.  However, in the interest of full disclosure, the focus this year is described as
This is a broad faculty search and we welcome outstanding applicants in all areas of computer science, including applicants whose research and interests connect to such areas as engineering, health and medicine, or the social sciences. Of particular interest are candidates with a focus on systems and data science, including operating systems, networking, software engineering, data storage and management, and computational science.
We'll look forward to your application.

Wednesday, October 22, 2014

Cuckoo Filters

An upcoming paper appearing at CoNext will be of interest to any Bloom Filter user or aficionado:

Cuckoo Filter:  Practically Better than Bloom
(Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher)

Let me describe a cuckoo filter and some of what's in the paper for you.  If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do).  If you want to look at code, there's even a github repository for you with code for cuckoo filters. 

Note:  I'll assume you probably know what a Bloom filter is in the discussion that follows.  As a reminder, it's a hash-based data structure for set membership that can give false positives;  by allowing false positives, you can save space in your representation of the set.  I'll assume you probably know what cuckoo hashing is.  As a reminder, in cuckoo hashing, you have buckets that can contain a fixed number of keys, and each key gets some number d of buckets (obtained by getting d values from hashing the key) where it can be placed.  When you insert a new key, if all of its buckets are full, you can move keys out of that bucket to one of its other buckets to make room.  Since apparently there are drinking games that exist based on taking a drink whenever I use the word "Bloom filter" or "cuckoo hashing" in a talk, I imagine most readers of the blog know about these things.  (Or, always, Wikipedia.)

The framework used in the paper comes from the following idea.  One way of implementing Bloom filter functionality when given a static set of items is to use a perfect hash function;  that is, as hash function that maps the n elements of your set into an array of size n in a 1-1 fashion.  Then, you use another hash function to store a fingerprint of the element instead of the element itself in the array location.  To do a set membership test on an element z, you hash z once to find its location in the array, and hash z again to find its fingerprint, and check against the fingerprint in the array location.  For elements not in the set, you get a false positive rate of 2^{# of fingerprint bits}.

The problem is that the perfect hashing framework doesn't allow you to insert new items naturally.  Or do deletions (but standard Bloom filters don't allow deletions also -- you need a counting Bloom filter for that).  So a natural thing to do is to think of using some other sort of hash table, besides a perfect hash table, and see what happens.  This idea was explored way back in for instance some papers I worked on in 2006 (here and here) using "d-left" hash tables, which are based on a multiple choice hashing scheme.

If you're going to use multiple choice hashing schemes, though, you should think about using cuckoo hashing.  The ability to move keys around means you should get better space utilization;  for example, even with 2 choices, if your buckets can hold 4 items, cuckoo hashing can get you about 95% space utilization.  The problem with cuckoo hashing in this setting is that, for a Bloom filter, you want to just keep fingerprints of keys, not the keys themselves.  So, when you want to move the key, how do you figure out where to move it to -- you no longer have the key to hash?

The approach used in the paper is a trick called partial-key cuckoo hashing.  To get two buckets for our key, we do a hash to get the first bucket (idealized to be uniformly at random), but to get the second bucket, we do not hash the key, but just the fingerprint of the key;  we XOR that value with the value for the first bucket to get the second.  The upside of this approach is given a bucket and a fingerprint in the bucket we can find the other bucket corresponding to the fingerprint.  (XORing the hash of the fingerprint works if the fingerprint is in the first bucket or the second, as is easily checked.)  The downside is if our fingerprint has F bits, that means our second bucket is really only chosen from 2^F possibilities;  it is not uniform conditioned on the first.

Because of this limitation, one interesting aspect of the paper is that we show, in theory, this approach just will not work.  Specifically, there is an Omega(log n) lower bound on the size of the fingerprint that must be stored; this means the cuckoo filter is super-linear in space, unlike standard Bloom filters which are linear in space.  This lower bound is easy to derive.  Suppose that you have buckets that can hold b keys.  If 2b+1 keys have the same pair of buckets, then there is trouble;  you can't fit 2b+1 keys into 2b bucket slots.  So the question is whether there are collections of 2b+1 keys that have the same pair of buckets.  With standard cuckoo hashing, this would be a low probability event.  With partial-key cuckoo hashing, the probability depends on the size of the fingerprint, since this determines how many possible second choices of a bucket there are for each key (given their first random choice of a bucket).  If you have too few choices, you will get too many collisions, or too many keys in a pair of buckets.  Some straightforward calculations yield the Omega(log n) lower bound on the size of the fingerprint.

So now that we've proven theoretically that this won't work, how could it work in practice?  If you go back to those calculations, you find that the lower bound has some nice constant factors.  The bounds you get on the fingerprint size are actually (roughly speaking) log n/(2b).  That constant factor in the denominator, even when b=4, covers a large range of n.  In practice, fingerprints can be quite a reasonable size, and still avoid the potential bottleneck of too large a group of keys all hashing to the same small number of buckets. 

In any case, this may be the first paper I've written that contains a proof that the described data structure fails in theory but works in practice.  (The other way around I generally find much easier.) 

Big open theoretical question:  try to analyze why partial-key cuckoo hashing gets roughly the same load threshold as regular cuckoo hashing.  (Anyone want to work on that?  Let me know...)

This post is long enough;  there's a lot more in the paper, including comparisons with other schemes, bit-saving tricks, and lots of performance results.  Have a look.  

Postscript/Tangent:  In a recent review (for a different paper, with a different theme) one reviewer made the following comment/criticism:  "Despite years of research, no one has found a variant of cuckoo hashing that can outperform simpler methods in realistic situations."  To which I say, hunh?  Yes, I understand linear probing is wonderful in practice due to cache effects and block sizes and in many and perhaps even most situations can yield better performance than cuckoo hash tables.  But certainly not everywhere.  Cuckoo hash tables certainly seem at least potentially better for many hardware settings, and for at least some important applications -- such as this one -- cuckoo hashing seems to be a clear win.  (The issue here is the number of fingerprints you have to deal with on a lookup affects the false positive probability -- so even if linear probing lookups are "faster" due to cache effects and block sizes, if you try to get this sort of space utilization you get larger contiguous blocks which increase your false positives.)

My co-author David Andersen also violently disagrees with this review comment, and points to these recent papers at NSDI and Eurosys.  

Perhaps I'm missing something, but my thought was this was a sign of a bad review (both in content, and in its overall rudeness).  I felt obliged to try to correct the record (and I can only hope the reviewer sees this post).     
 

Friday, October 03, 2014

Andreessen-Horowitz Again

Andreessen Horowitz had a second Academic Roundtable, gathering together academics, VCs, and people in startups to talk for a few days about a variety of topics.  I wasn't sure why I was invited the first time (which I wrote about last year), and am even less clear on why I was invited back, but I again had a very good time. 

Before a few words about the talks, the high order point:  I'm told that most/all of the talks will be put online over the coming weeks.  So if you're interested, you can experience them yourself.  I think it's great that they're doing that this year;  if you like the talks, you should tell them, so they keep doing it.  (If they don't have comments there, you can always comment here.)  Sadly, they don't seem to be up yet, but I'll post the links when they become available. 

Highlights would include a discussion of Bitcoin -- interesting to hear what Ed Felten, well-known Princeton professor and now ex-Chief Technologist of the FTC, thinks about the Bitcoin economy.  Dawn Song of Berkeley gave a general talk on security issues of the present and future, while Dan Boneh of Stanford gave a talk on the power of program obfuscation.  Raj Rajikumar of CMU gave some history and some peeks into the future of driverless cars -- it's not just Google, you know.  Tuomas Sandholm of CMU talked about his take on the starting of startups while still being an academic (based on now-multiple experiences), and Josh Bloom of UC Berkeley (and wise.io) described the differences between writing papers about machine learning and building products using machine learning.

Of course, some heated discussion about the variety of issues that arise between academic work and transfer of technology to startups inevitably ensued.  (Maybe that's why I get invited.)  The key idea repeated by many (on both sides of the fence) in various forms was that businesses (and in particular startups) are very busy going down their road of development and product, and they may see many things out the sides of that road that are very interesting, but don't have time to explore off the road.  Academics are interested in things way off the road, often thinking of issues much further out in time-scale.  And (at least in my opinion) the role academics play is a good thing;  there (obviously) remains a lot of ways the two worlds can interact and cooperate. 

Thursday, September 18, 2014

On Academia vs. Industry.... MSR SVC Closing

I'm one of those professor types that ends up defending the joys of life as an academic versus a career in industry -- some of you who read this blog have probably seen me comment at Matt Welsh's blog or Daniel Lemire's blog/Google+ chain.  And to be clear I'm not anti-industry, I just want to make sure there's a fair discussion.

In that vein, I'm sad to hear that Microsoft Research Silicon Valley will be closing, as described in this article.  I know many people at MSRSV, and have visited there often.  It's a loss to the community to have it close.  I have sympathy for those who work there -- this must be stressful -- but this sympathy is happily diminished because I know the people there are so talented, they will quickly move on to other employment.  (When Digital Systems Research Center closed long ago, a number of the people I worked with in the lab just moved on to some small company that seemed to be just starting to really get going, called Google.  I wonder how it worked out for them.) 

After a moment of silence for the lab, I do feel it necessary to point out that this is one of the "issues" in the academia vs industry life choice.  The life cycle for companies moves rapidly.  That's not a bad thing, but it's a thing.  Disruptions like this are a non-trivial risk in industry, much less so in academia.  (Just look at the history of research labs.)  Again, I'm sure it will work out for the people at MSRSV, but any life shift like this -- even if it ends positively -- is stressful.  Without wanting to overstate the consequences, it's worth pointing out.

Monday, September 15, 2014

Simultaneous Enrollment

The Crimson has an op-ed on simultaneous enrollment, that I agree with.

Harvard does not like simultaneous enrollment, which means a student taking two classes that meet at the same time -- any time overlap counts (whether the whole class or half an hour once a week).  If you want to take a class via simultaneous enrollment, you have to petition the Administrative Board, and your professor is supposed to provide direct hour-per-hour instruction for the class you can't intend.  As a previous Crimson article states:
The Faculty Handbook requires that “direct and personal compensatory instruction” for simultaneous enrollment, but only recently has the Ad Board refused to recognize videotaped lectures as a stand-in for class time.
The article references that for the past several years the Ad Board has accepted recorded lectures, under some additional conditions, as a suitable proxy for the direct and personal compensatory instruction.  This apparently represented a change from their past position, and this last year, while I was on sabbatical, some Standing Committee on Education Policy decided to push back and say no more recorded substitutions. 

(An aside:  I don't know why they used the dated term "videotaped".  My lectures have been recorded and made available online;  these days, I'm not sure any "videotape" is actually involved in the process.)  

I believe I had a great deal to do with the Ad Board accepting recorded lectures the last several years.  My undergraduate class, CS 124, was recorded with very high quality production for the Harvard Extension School, and I made this video available to the students.  Some years ago, students starting approaching me wanting to do simultaneous enrollment for CS 124, and I thought it was fine, because of the availability of class lectures.  (That was, for instance, how the Extension students were taking the class.)  In particular, every year some number of students seemed to want to take CS 124 and a conflicting math class that overlapped, which made it very difficult for a small number of students who wanted to pursue both CS theory and mathematics.  (Both classes were "gateways" to more advanced classes;  at least for me, changing my class time would have just introduced other more challenging time conflicts;  neither class appeared poised to change times.)  But other class overlaps came up as well. 

I initially faced some opposition from the Ad Board when I supported the students' petitions for simultaneous enrollment because I suggested the recorded lectures were a suitable proxy.  They objected, and I objected to their objection.  After some back and forth, we found language which we managed to agree met the language of the faculty rules, with the outcome being the students could, in the end, rely on the recordings of class lectures.  I admit, I believe it helped that I had served on the Ad Board, and had a good working relationship with them, so they were willing to work with me to come to a mutually acceptable solution.  The framework we established was later used for CS 50 and other computer science classes, because I passed on my successful approach with the Ad Board to others.  (David Malan asked me for assistance on the issue, for example.)  There was, however, always some tension, with the Ad Board continuously questioning whether using class recordings to justify simultaneous enrollment was suitable.  Meanwhile, the number of such petitions kept growing. 

Apparently, while I was sabbaticalling, various powers that be have acted to tighten the rules for simultaneous enrollment, and disallow my previous framework.  Interestingly, CS 50 -- the class with the most requests for simultaneous enrollment -- seems to have acquired a blanket exception, which I am happy for, but still leaves the rest of us facing unhappy students who often have to deal with ugly situations.

What I find frustrating is that, to my knowledge, this change is not based on any data about student performance, or any understanding of student behavior.  It's based on a presumption that students are supposed to sit in classes to learn.  David Malan has some great data showing this presumption doesn't seem to have a basis in reality.  Students in CS50 taking the class under simultaneous enrollment don't do any worse than other students.  In fact, if I recall the data right, they do better than average;  this would not be surprising, as they are a self-selected group with apparently high interest in the subject.  David also has great data showing how lecture attendance steadily drops over the semester.  If a substantial fraction of the students are missing class and watching the video anyway, what's the point of blocking simultaneous enrollment?  Maybe because of all this data at David's command CS 50 got its exemption, but I'm sure if we go back we'd find similar findings for CS 124 and other classes.   

Professors have always had the right to refuse simultaneous enrollment petitions for their class, so if the professor expects student attendance, there is no problem.  I don't think it's a radical suggestion to say that for recorded classes, professors and students should have the ability to jointly decide if simultaneous enrollment is a suitable solution to the small number of inevitable class scheduling conflicts.  I'm not sure what combination of faculty and administrative busybodies decided they had to fix what I see as a non-problem, but I'm sure their rule-making (or in this case rule-interpreting), while it only affect a small number of students, will be a significant net negative.  Kudos to the Crimson for pointing out that this is a poor decision, and I only wish I was optimistic that it could be quickly reversed.  

Thursday, September 11, 2014

Biggering

Preliminary course enrollment numbers are in.  Our new theory course, CS 125 (rapid introduction to theory), has 32 undergrads and 5 others, for an enrollment of 37.  Salil and I had predicted and desired in the 20-40 range for the first year of this course, so we're on target.  We hadn't thought about CS 125 being a class for grad students, but it actually makes sense.  Entering CS grad students who haven't had background in theory, or grad students from other related areas (statistics, mathematics, etc.) who want a fast-paced introduction to the basics of CS theory, would both fit in well for this class.  Maybe next year we'll even advertise it for grad students as well.

Our mainstream required-for-majors theory course, CS 121 (the "here's Turing machines and undecidability and NP-completeness" course), has an enrollment of just over 150, the largest it's ever been.  Given that CS 125 was supposed to draw some students from 121, that's even more amazing.  

And the standard CS introductory course, CS 50, has over 800 undergraduates (and over 850 total) signed up, making it now the largest class at Harvard.  This was so noteworthy that the Crimson had an article this morning about it.  As usual, you can find a great quote from Harry Lewis: 
“Harvard students are smart people,” said Harry R. Lewis ’68, former dean of the College and current director of undergraduate studies for Computer Science. “They have figured out that in pretty much every area of study, computational methods and computational thinking are going to be important to the future.”
So the growth continues, at least for another year.

Wednesday, September 10, 2014

You Can Rename My Family for.....

An interesting article in today's Crimson about the background to the announcement that Harvard's School of Public Health will be re-named the Harvard T. H. Chan School of Public Health, for a $350 million donation.  Feel free to comment on your feelings regarding the issue.  (The Kennedy School of Government notwithstanding, Harvard schools have not been "named" via donations historically, although certainly buildings and such have.)  I'm not publicly taking sides -- indeed, it seems like a wonderful debate challenge to figure out arguments on both sides of the issue as to whether naming schools in this way is a good or bad idea.  Even if you think it's a good idea, you might wonder about the price tag...

In particular, I can't help but wonder what the naming rights of the School of Engineering and Applied Sciences could go for?  In the course of our capital campaign, will I have the chance to find out?  Maybe we could get a bidding war going?  How much would SEAS have to get for me to feel good about having "Harvard's XXXX School of Engineering and Applied Sciences" on my letterhead for some appropriate name XXXX? 

At a baser level, Mitzenmacher has always been a problematic name.  Too long.  (On so many standardized forms, I end up as Michae Mitzenmache....)  Maybe I should offer myself up for naming rights.  Sadly, I don't think I'd get $350 million.  

Sunday, September 07, 2014

Teaching Sorting

For many years now, while teaching the introductory Algorithms and Data Structures, I have told students that we won't be starting with (or really covering) sorting and searching, because it's boring.  While that description is an exaggeration (the students see mergesort [an example of divide and conquer] and heapsort [heaps are important as an implementation of priority queues for Dijkstra's algorithm]), I've never thought that staring this class with a long unit on sorting/searching algorithms, which most textbooks like to start with, is the best use of time.

So it was a bit odd to find myself for our new "honors course", CS 125, spending the first day-and-a-half or so on sorting algorithms.  The key was that there was the right motivation for it.  We'll be tackling both algorithms/data structures and complexity, and the theme of how to model computation will play a big role throughout the course.  When we thought about what was the right way to introduce the class, and in particular this theme, sorting seemed like the right choice.

After very quickly refreshing the students on some basic comparison sorts (bubblesort and mergesort), we present the standard lower bound argument for comparison-based sorts.  But then we show how this lower bound can be broken, by assumptions about the data that allow one to go beyond comparisons to other operations.  Specifically, we raced through counting sort, bucket sort (for random data), and sorting via van Emde Boas trees.    (The students in particular seemed to be talking through the details of van Emde Boas trees at the class break, which I'd like to think because they were new and interesting, but possibly was due to the fact that it was the first time I'd ever presented them, and maybe my delivery for that topic needs some tuning.) 

Lo and behold, I find myself finding sorting interesting!  We didn't even get to more advanced interesting possibilities, like data-oblivious sorting or "bleeding edge" parallel sorting.  I feel I need to apologize to sorting, for mistakenly suggesting that it's a boring topic.  I still wouldn't want to start an algorithms/data structures class with multiple weeks of sorting and searching algorithms, but sorting is a natural way to introduce some key concepts in algorithms, and there's fun to be had in the large variety of approaches to sorting.

Thursday, August 28, 2014

Shout-out to Sabeti Lab

A shout-out today to my friend and colleague Pardis Sabeti (and her lab) for their Science article on the Ebola virus that appeared earlier today.  Pardis and her group have been studying the genetics of viral diseases, and in particular the Lassa virus.  So they were there and ready when the recent Ebola virus began and went to work.  They sequenced 99 Ebola virus genomes from 78 patients, and have analyzed the resulting information to gain insight into how the disease is spreading and mutating. They have released the genetic sequences to the public so that the information is available to groups who could use the sequencing to help find and test cures.  This is both very important work, and (sadly) very serious.  As reported, 5 co-authors of the study (health care workers, a lab technician, and the director of the national Lassa fever program in Sierra Leone) have died of Ebola before today's publication.  

Numerous news articles appeared today discussing this work, too many for me to link to.  But the SabetiLab page here contains a great deal of information about her research and various projects.  Her work in biology, which makes powerful use of statistics and computation, should serve as an inspiration for future scientists, and for people who want to use algorithmic and mathematical methods in ways to benefit the world.