Tuesday, March 03, 2009

Committee Games, Playing by the Rules, and Double-Blind

There's a phenomenon I've noticed in committee meetings. To avoid any issues related to discussing PCs, I'll discuss it in relation to a thesis prize committee I've served on multiple times at Harvard. It's very much like a PC -- usually about a 1/3 of the theses nominated receive the prize.

We're given a bunch of senior theses to read, and asked the score them on a 1-5 scale, with 3 being the nominal desired average. Obviously, it's very hard to give any of these theses a 1 or even a 2 -- faculty are generally good about being selective about nominating a thesis for this prize -- but if you want to give a 3 average, you have to make tough decisions.

At the meeting, your average on your theses is compared to the average of the other reviewers on your theses, and your scores are re-scaled accordingly. So if you worked to maintain a 3 average, but the other reviewers on average gave your theses a 3.5, those theses you read would get a higher "scaled score". Similarly, if you gave your theses a 4 average, they would get a lower scaled score.

Now, even with the scaled scores, I find in these meetings that when someone has given a much higher average than their peers, their scores are discounted even further. Perhaps they couldn't manage to be tough and give a low score to some theses, and they thought they were being nice by giving high scores (which the students never see) to everyone, but in the end it often makes things worse for the theses they've read. There seems to be a higher-level phenomenon going on here, where committee members who don't play by the rules have their theses suffer some extra punishment. Unconsciously or consciously, the mindset seems to be that if you couldn't be bothered to make the tough decisions before the meeting, why should your judgment be trusted as much as others who did?

There can also be a similar effect, where those who give scores close to a 3 average but find their theses overall have a much higher average get appear to have their theses get a slight bump even beyond the scaled score. The theses they read are sometimes rewarded for their playing by the rules.

[I should point out that I have no documented evidence of this effect. I could be imagining it all. I'm just relaying my perceived experience.]

I was originally going to bring this up as a point related to the 1-5 scale I used for STOC. The scale, I think, is naturally self-reinforcing; people who clearly go outside the rules (for whatever reason) by giving higher scores are easily seen, and unless evidence from other reviewers supports that they had exceptionally good papers, their papers will have no benefit -- in fact, they may even suffer a little.

But this also seems relevant to the discussion of what would happen if double-blind reviewing was introduced. People keep pointing out that, assuming papers were posted somewhere, double-blind reviewing wouldn't help anything. But I think it still could. PC members who tried to discuss papers making use of the authors' names would not be playing by the rules; my guess is there would be sufficient reaction from other PC members to squelch such behavior. Indeed, it may even be enough to discourage at least some people from going to look at who wrote the paper. While I'd understand if a PC member happened to know who wrote some of the papers based on their own personal knowledge, I admit I'd be troubled by a PC member who, knowing the rules were for a double-blind PC, felt the need to go outside the rules and find the authors for all of their papers. Would that cause me (and others) to discount their scores, to make up for the implicit bias they were bringing into the system? It would be interesting to find out.

[Side note: I'm greatly enjoying the debates on double-blind reviewing. And honestly, I can't say I think I know what the best answers to the underlying questions are. But I think it's great that the community is having an open, challenging discussion about conference reviewing processes; that's one of the things these blogs are for.]

17 comments:

Anonymous said...

Come on, Michael! Do you mean to say that 7 months after Irit Dinur had posted her PCP proof on ECCC and it had been widely discussed and disseminated, if DBR had been in place then the STOC 2006 PC should have gone through a charade and pretended that they were looking at an anonymous submission from an unknown author or authors?

Anonymous said...

I think that knowing an author's name (and thus track record) provides a *hint* as to how a paper will stack up. It does introduce bias, but I don't think the bias is horribly unfair. Also, the bias should be self-correcting in the long run. It provides a sanity check making the process slightly easier for the lazy/unconfident reviewer. Another problem is that people will guess at the author; it's our nature. No matter one's opinion on the subject, circumventing the agreed upon rules is bad.

The discussion seems to be focused on how this affects the careers of researchers. We must always consider how this affects the science. Does it help or hurt quality? Does it foster or hinder openness and the rapid/wide dissemination of work? Does it matter at all?

Posting Anonymously, hehe.

Anonymous said...

Come on, Michael! Do you mean to say that 7 months after Irit Dinur had posted her PCP proof on ECCC and it had been widely discussed and disseminated, if DBR had been in place then the STOC 2006 PC should have gone through a charade and pretended that they were looking at an anonymous submission from an unknown author or authors?

No one is suggesting any charades but you. In the DBR program committees that I know of, it is ok if you happen to know who is the author of the paper. The point is you are not supposed to go out of your way to find out who wrote the paper, so the vast majority of the papers are refereed under blind review.

As well, the PC chair usually peeks into the identity of the authors during the assignment process. The PC chair is also allowed to break the lock at any time during the discussion process at his/her own discretion if there are questions of plagiarism/authorship/duplicated submission, etc. Such power is assumed to be used judiciously and sparingly.

Matt Welsh said...

I've served on many double-blind PCs, and have never seen a situation where a PC member explicitly referred to the authors' identities during the discussion of a paper. Even if "everybody knows" who the authors are, there is a real expectation that everyone will play by the rules, and it generally does work that way.

Frankly I think double-blind reviewing works just fine, and any perceived problems (e.g., that it hinders the ability to review work in context) are, in my opinion, a lot less problematic than the bias that arises when author names are public.

Alice said...

I'm curious about the senior thesis committee process. I assume it's interdisciplinary and double-blind? So the readers on the committee don't know the name of the thesis writer or writer's adviser.

Presumably this means that the readers from other departments are assigned to read the thesis.

I see that the math department has posted senior theses on their website:

http://www.math.harvard.edu/theses/index.html

How many people outside the math department can actually make much sense of the senior theses written in that department?

As a former math major from a liberal arts college who went on to get a PhD from Harvard in a math related field, I am reasonably numerate and have a high tolerance for reading papers full of math-dense notation in my own field.

I look at those math senior theses, however, and my eyes glaze over pretty quickly. I certainly wouldn't feel competent to judge them. Who outside the math department does?

Anonymous said...

Why does Irit Dinur's name keep coming up during the discussion of DBR? Is he(or a she?) the same person
who was mentioned here
http://cr.yp.to/cubeattacks.html
?
If what Bernstein said is correct, that means that DBR is working in Asiacrypt, right? The paper is wrongly accepted in eurocrypt simply because it is not blind anymore.
But again, I am not working in the area so I dont who is right here.

Anonymous said...

"Do you mean to say that 7 months after Irit Dinur had posted her PCP proof on ECCC and it had been widely discussed and disseminated, if DBR had been in place then the STOC 2006 PC should have gone through a charade and pretended that they were looking at an anonymous submission from an unknown author or authors?"

Why did Irit Dinur even submit it to STOC? It could have been an invited talk and gone straight to JACM. That is what happened to Primes is in P. To me, there was no reason for that paper to be in STOC. Originally, I guess these conferences were organized to be forums to disseminate ideas, but it now goes to show how much (uneccesary) prestige is based on them if a paper like that, which needed no more publicity, was even submitted.

rrenaud said...

I am not an academic, but have you considered trying to write a name stripping proxy or browser plugin? Unintentional author leakage is mostly a technical problem. How hard could it be to solve? Would this take one dedicated grad student a month of work to get something that filtered out 90% of names in papers/search results? Would this enable a reviewer who wanted to be unbiased to mostly accomplish this?

11011110 said...

In all this discussion of double-blind reviewing, there's another effect I haven't seen much discussion of: with open reviewing, if one author or small group of authors submits three or four papers on closely related topics, those papers are at a disadavantage: it is likely that the committee will discuss them as a group and only include their favorites. If several disjoint sets of authors submit a similar closely related set of papers, however, they are at an advantage: it makes the committee feel that the area is more interesting, if all of these people are writing about it, and the committee may accept some weaker papers in order to prevent their authors from being unfairly scooped.

Therefore, in a double-blind system, it seems likely that there would be two effects that increase rather than decrease the feelings of submitters that the process is unfair:
(1) a relatively larger number of authors with multiple acceptances, and
(2) a relatively larger number of cases in which one paper on topic X is accepted and another by a different author is rejected, giving the first author unfair precedence for what should really count as an independent discovery.

Warren said...

1101110 11:28: one of the PC members e.g. the chair should know the author names and hence be able to answer PC member questions such as:
* Is this plagiarized off X?
* Is this cluster of papers by independent authors?

Stefan Savage said...

On the issue of double blind reviewing (which, given the amount of controversy, appears to be the substantive social issue of the day) I should point out that many communities support both approaches. Systems for example (SOSP is double blind, while OSDI is not) and security (Oakland is double blind, USENIX Security is not). While this could be taken to mean that there is no advantage to double-blind (unless one wanted to make a claim that one set of conferences publishes papers superior to the other which I wouldn't do) I think this isn't the right conclusion to draw. Rather, what it demonstrates is that there are strongly held feelings on both sides within both communities and and that its possible to for both approaches to co-exist.

Myself, I think the value of blinding is asymmetric. While it is certainly the case that I've seen benefit of the doubt given to well-known authors, this is the sin I can swallow hard and accept as one of life's imperfections (and indeed, I've seen the bias go strongly in the other direction as well). However, it's far worse when a comparatively unknown author's work is discounted because no one knows them or their institution; in my experience this negative bias can be very strong. I think this is the more significant danger -- particularly for papers that are not in the "clear admit" category but are in grey "maybe" area -- since I think we have a stronger interest in expanding our communities. For this reason, I've always thought there was room for an "optional blind" approach in which authors choose their status (and I think plenty of well-known authors would choose to be blind for a variety of reasons; I certainly would).

Finally, as an aside, I think that one issue that makes TCS somewhat different from the communities I participate in, is the extensive preprint culture. Systems, networking and the non-crypto side of security generally don't have this (indeed, many authors hoard their results before publication). Consequently, most reviewers have not seen any of the papers before. Moreover, I think its actually much harder to infer paper provenance than people believe (absent substantive efforts to deanonymize). Sure there are particular well-known "tells" (a classic example is in wireless networking where everyone in the field can identify the author's school based on the floorplan pictures that appear in the papers), but in general I find that I have no idea or am wrong about the authorship of at least 2/3 of the papers that turn out to be written by authors who I know of (I could be uniquely bad at this, but I suspect not).

Stefan Savage said...

On the topic of scoring (which this entry started with), I don't believe there is any ideal way to do this. I've used linear scores, I've used telescoping order statistic scores (i.e., the 5 point scale Michael talks about) -- with and without distribution enforcement, and I've used non-numeric scores... and they're all problematic.

Lets consider the order statistic scale mentioned before (e.g., bottom 50%, top 50% but not top 25%, top 25% but not top 10%, etc). In principal, this provides a way for people to norm themselves and make review scores comparable. In practice I find this doesn't work. It doesn't work for a few reasons. First, some people just refuse to abide by the semantics of the scale because they are either overly negative, overly positive or overly polite... so a particular PC member may have 2/3 of their papers in the bottom 50% category, 20% in the top 5% category or (more commonly) 50% of their papers in the middle category (which should only hold 15% of the distribution). If you put such people together with others who treat the scoring scale as cannon it increases variability. So what to do? Well, you can try to enforce that each PC member rescore their reviews until they match the expected distribution. However, this too has problems. In particular, some topic areas simply have more lower quality submissions than others (perhaps because the topic is overpopulated or underpopulated). Consequently, an area expert in such a field will in fact have a larger number of "worse" papers to review. If we effectively normalize their scores by forcing them to abide by a uniform distribution assumption then these papers will get scores that are inflated. At the last SIGCOMM we tried to out-smart this problem... we implemented a bias correction factor -- one based on doing a pair-wide linear regression (assuming a linear scale) and another a modified version assuming the order-statistic scale. Thus, if you consistently scored papers lower that others scored higher, this would tend to give you a positive bias. Well... it turns out this works ok at best. Among the many problems are a) that people have non-linear biases, b) that co-variance between reviewers in particular areas can be significant, c) that there just aren't enough sample points to have that much confidence in the results (either per reviewer or per paper). Frankly, I think this last issue is the tough one and is hard to overcome... This year SOSP is experimenting with a rank-order system -- maybe that will help. However, I tend to think that we're at the limit of the SNR for this process. The top papers always tend to get accepted (with counter-examples to prove the rule :-), the terrible papers tend to not get accepted, and then we participate in some non-deterministic process around the middle set. Call me a cynic, but I don't foresee a silver bullet to get us past this :-)

Michael Mitzenmacher said...

Stefan --

I share your frustration, though it seems the problems you outline are more a problem of individuals behaving badly (the original topic of the post) than of the scoring system itself. I agree there can be small problems in that different areas can have different quality -- but experienced reviewers can shade their scores up or down SLIGHTLY accordingly -- being prepared to justify their deviation -- and inexperienced reviewers are probably best off following the percentages closely.

I find this small problem is dwarfed by having even one (or more) reviewer(s) that take it upon themselves to ignore the quite clear guidelines and, as you say, put 50% of their papers in the top 25% of the scores. (Just so you know, oh guilty PC members, while I may be polite, I'm thinking unpleasant thoughts...) But at least this scoring system makes it clear (to the chairs, and to anyone who looks carefully) what's going on.

I've found using the 1-10 system in times past that people don't even have a clear agreement on the meanings of the numbers; and of course the same anti-social behavior of off-average scoring from a PC member can cause the same problems (and be harder to see clearly).

In short, I agree with the problem you've described; I just think the solution has to be encouragement to get people to go by the rules. I personally find the 5-point scale we've been talking about an easier rule system to deal with.

I do like the idea of a rank-order based system instead of a scoring system. Though I'd be fearful of being the one trying to implement it for the first time...

Paul Beame said...

Stefan and Michael's comments give me a natural place to express my personal PC scoring system views:

My belief is that a scoring system should be a way for PC members to express their planned behavior with regard to a paper. Any scoring system that makes distinctions less fine than this is too coarse and any system that makes more distinctions than this is too fine.

My experience is that there are really 7 different categories of behavior which I will list by letter:

Two versions of definite accept:

(a) A certain accept but it is one that I will argue deserves consideration for special issue of best paper award

(b) A paper that I will argue should definitely be accepted but not one I will argue should be part of a special issue.

Three versions of borderline, all of which acknowledge that the paper deserves serious debate:

(c) I am unsure but am leaning towards acceptance and would bias my vote that way. I could see that others might be negative and their unfavorable scores (c)-(g) could convince me otherwise, however.

(d) I really am ambivalent. I can be convinced either way.

(e) I am unsure but am leaning towards rejection and would bias my vote that way. I could see that others might be unfavorable and their (a)-(c) scores could convince me otherwise, however.

Two different versions of rejection:

(f) I will argue for rejection since I believe that this is a certain reject. I am quite confident in that view.

(g) This is not only a certain reject and I will argue for rejection. I think that this isn't even remotely close to acceptance. I will be vehement in my opposition to this paper.
-----------------

On the recent STOC PC with Michael's 5-point scale, the scores of 1, 2, 4, and 5 effectively played the roles of (a),(b),(f),and (g). However, the score of 3 encompassed all of (c),(d), and (e) which meant that there very large numbers of papers that had identical initial scores despite very different committee opinions in reality.

The main value in the scores is in ordering the consideration of papers and the problem with ignoring the (c),(d),(e) distinctions is that we ended up considering roughly 40 papers with the only order being their submission number; many were out of order with respect to actual committee opinions which made the initial scores much less useful.

There is a second question of how you map scores above to numerical values. A linear sequencing like 1-7 or -3 to 3 would work OK (though I observed that with Easychair which uses the -3 to 3 people are very reluctant to give negative scores for borderline papers because of their pejorative nature). In this case the systems that record controversy by absolute score difference might record a (g) vs a (d) as the same as an (f) vs a (c) which is not be the same kind of controversy. Alternatively one could do a non-linear map of scores to numbers such as:
1,3,4,5,6,7,9 which might make it easier to detect the level of controversy with such a linear score difference being used to measure controversy. I am not sure if this makes enough of a difference to care.

Finally there is the subject of uniformity in PC member scoring:

With the old SIGACT software there was the notion of Normalized Scores. This normalization not only adjusted them based on averages but on standard deviations - some people are more extreme in their distinctions than others as well as more positive or negative. It was a way to see individual PC member tendencies but it wasn't always indicative of relative opinions of papers and I didn't think it was better than weighted average for order of paper consideration.

Despite Michael's implication that scores outside the rank norms correspond to PC members behaving badly, one should not assume that the fact that PC member's scores have higher or lower averages than others means that they are out of bounds. Progress in science is not uniform. Some people will just see stronger papers than others. They should be ready to argue why this is the case, but the assumption should not be that they have done something wrong. The case that most requires readjustment of scores is when someone's scores are systematically higher or lower than those of the other PC members assigned to the same papers.

Stefan Savage said...

I also like scoring that is qualitative as Paul describes. In the most rarefied form this comes down to: "I will champion the paper at the PC meeting", "I will oppose the paper at the PC meeting" and "I'm ambivalent". Since these three states correspond to declarations of actions they tend to be pretty unambiguous. The problem with this system is that conflict is that some reviewers put everything in the middle category (this is somewhat mitigated by having three levels as in Paul suggestion -- which is like the natural harmonic of the three state system for timid reviewers)

The problem with all such systems is that the richness of each reviewers though process is completely lost when you quantify scores and then average them together (regardless of how you normalize them). The greatest source of noise is the "average overall quality" that is inevitably the strongest component in deciding which papers gets discussed at the PC meeting (or which papers get accepted for reviewing systems that don't have interactive PC meetings). I've been on PCs where only papers for which there is a specific champion will be discussed and I like this approach in principal. However, one foible in these cases is that strong papers that are outside the area of reviewers may not receive a champion and can fall through the cracks. I agree with Paul that different PC members perceive the same papers differently and this is, in part, because there is no Platonic ordering on paper quality regardless of our attempts to infer one.

Finally, as a practical matter, I continue to maintain that for a given PC, regardless of reviewing system, generally the same set of top papers will be accepted, the same set of bottom papers will be clearly rejected and that the subset of the acceptable papers that is taken will not be discernibly different in quality to a set of readers outside the process. Thus I think the most important thing is to make sure you have a process that doesn't accidentally reject the top papers or accept those that very clearly need a lot more work.

Boaz said...

I thought I'll put my 2cents on anonymous submissions / double-blind reviews. I've been in several PCs with both blind and non-blind reviews. I don't think it makes a huge difference, but overall I prefer non-anonymous submissions.

Some technical reasons:

1) With anonymous submissions, it often happened that I asked the author to review their own paper. It's not a big problem, but it is annoying. The only way you can avoid is to guess the author, which sort of defeats the purpose, and/or to avoid asking the #1 expert to review the paper.

2) I usually like to contact the author if there's a point in the proof I don't understand. Even if I can work it out myself eventually, it's easier to get an answer from the author. Communicating with authors is more cumbersome (though not impossible) with anonymous submissions.

3) When you submit an anonymous submission, it's sometimes cumbersome to write it when referring to your own previous work. You may have to remove discussion that is beneficial to the reader, but gives away that you are the author of that previous work.

A more fundamental reason is that I don't think the central goal of a conference is to rank the papers and accept the 60 "best ones". A conference has several goals, but perhaps the most important is to serve as a "newspaper" for the community, and advance research by putting a spotlight on certain papers.

Sometimes, the community would be better benefitted by paying attention to a paper putting forward an interesting conjecture (e.g., unique games) than a paper solving an open problem. It doesn't mean that the first paper is "better" than the second one.

For this reason I see fairness as important issue but not the top priority in selecting a program. This is not a sports competition. (I don't ignore the fact that a conference is also about giving honor and credit to deserving work, I just don't think it's the absolute top goal.)

To me the question is whether knowing the authors names can make a better program. It's possible that given the very tight time constraints a PC has, knowing the authors allows them to save time, and eventually make better decisions.

Of course probably the only way people save time is by having biases - things such "author is an expert on this topic and we can trust his word on prior work", "author is known to be very careful with proofs", "author can't write an introduction and his papers are often hard to read but great". There could be of course negative biases as well.

My gut feeling is that using such biases may make the program somewhat less "fair" for the authors, but better for the community as a whole. I think it's a tradeoff worth making.

Boaz Barak

Michael Mitzenmacher said...

Some comments:

I have no significant objection to Paul's scoring scheme; I think it's quite similar to the 5-point scheme I used, and as Stefan points out, it's really a 3-point scheme with some finer distinctions in the middle layer. My guess is it would work very similarly to what I used; those 3 scores in the middle are fine distinctions and (in my experience) PC members often change their mind about them at some point in the process.

My main beef was with the historical 10 point scale, which has too many categories and no consistent meaning to the numbers. If people can be consistent with that 7 point scale, sounds fine to me.

With regard to another of Paul's points, working in probability I'm aware of correlations and the laws of small numbers -- I'd agree that a PC member deviating from the average doesn't have to be a sign of bad behavior. My experience, however, is that usually it is. This is readily apparent when comparing their scores to other scores on the same papers. It also manifests itself with reviews that read, "I think this paper is barely acceptable, so I gave it a score of 4." (On my 5 point scale.) Consistent disconnects of this type happen more frequently than I would have thought, which was the point of my original post.

Also, I agree with Stefan's main points: it's a noisy process, and most systems will do a reasonable job on the top and bottom papers, with challenges in the middle grey area.

Also,