Friday, May 15, 2015

New Work on Equitability and MIC

A few years ago, working with a diverse group of scientists, I was looking at a problem related to big data analysis.  The setting was data exploration, where you are working with a high-dimensional dataset, and you are seeking pairs of dimensions that are related in interesting but a priori unknown ways;  that is, maybe there's a linear relationship, but maybe there's a more complicated relationship (sinusoidal, parabolic, etc.), and you don't know what you're looking for in advance.

One statistical approach is to throw a so-called measure of dependence at the problem, which will tell you whether variables are related or unrelated.  But in big data sets, you may expect to have lots of dependent pairs;  what you really want is not just a "dependent-independent" test, but a scoring mechanism that ranks the extent of dependence appropriately over a wide range of possible relationships.  This methodology seemed underserved by the literature.  We dubbed what we were aiming for "equitability", and designed a statistic we dubbed MIC (maximal information coefficient), a "bucketed" version of mutual information, that seemed to be a good ranking mechanism for equitability.  The work appeared in Science some years back (the link here will allow you to access it). 

The work has, I think, been quite successful -- a number of people have used MIC in their research to analyze data sets, and the idea of equitability seems to be catching on.  But after it came out, there were some questions and issues raised primarily by statisticians.  We had always planned to go back and revisit some of these issues, and, after various delays caused by life, a group of us started back on it again.  The project seemed to balloon, and we've been working on and off on it for at least a year now.  (Some earlier, initial drafts pointing to where we were going are on the arxiv.)   Finally, we've reached a stopping point, and we've just put up 3 papers on the arxiv, all of which are now being submitted to various journals.  In this post, I'll outline what the papers cover, after the links.
  1. Equitability, interval estimation, and statistical power. arXiv

  2. Measuring dependence powerfully and equitably. arXiv

  3. An Empirical Study of Leading Measures of Dependence. arXiv 
Really, there seemed to be 3 issues that people wanted to hear more about after the initial work, and each topic ended up being "large" enough that it merited its own paper.

First, people wanted further theoretical foundations for equitability.  Our Science paper, being for a general audience, provided an intuitive definition, but we didn't define it as one would in a mathematics paper.  (That was, I should be clear, intentional.)  We thought our definition was pretty clear in plain English, but for some it was not sufficient;  one group, in particular, seemed to ignore our explanations when in their own work they added restrictions to "equitability" and incorrectly ascribed them to us. 

So the first paper formalizes this notion of equitability.  As is usually the case, formalizing it turns out to be helpful.  In particular, it shows that there are (at least) two natural ways of thinking about equitability.  One way formalizes our original conception that an equitable statistic is a statistic that gives similar scores to relationships with the same amount of noise, even if the relationships are of different types.  But a different way of viewing the formalization shows that equitability can naturally be seen as an extension of statistical power against a null hypothesis of independence (i.e., relationship strength = 0), to power against null hypotheses representing all levels of dependence (i.e. relationship strength <=x for any x).  This view clarifies the relationship between equitability and power against independence, showing the former generalizes the latter.  In our new papers we have some really nice "heat-map" style visualizations for viewing equitability performance results based on this view that I think are really useful in thinking about and understanding equitability in the paper.  

Second, people wanted faster, better versions of our proposed scoring function, MIC.  We ended up going back to the theory of MIC, examining further its relation to mutual information, and, perhaps unsurprisingly, doing so allowed us to make tangible practical advances.  We ended up with algorithms for computing MIC that are both faster and more accurate.  (Our original algorithm in the Science paper used a heuristic approach based on dynamic programming to approximate the MIC score from the data;  our new approaches have both improved speed and accuracy.)  We're hopeful that people using MIC will be able to switch over time to these improved algorithms.  Connecting to the first paper, in re-examining MIC we also developed a variation of MIC (called TIC, for total information coefficient) that is better designed for achieving power against independence as opposed to equitability.  We are hopeful that TIC may prove useful to people on its own, as well as in conjunction with MIC. 

Third, people just wanted to see more comparisons.  Well, really, here I think everyone just has their own favorite measure of dependence, and before they go switching to this new-fangled thing that appeared in Science of all places, they want to see more.  Specifically, subsequent to that initial paper, there were people concerned about the power against independence of our methods -- although, again, to be clear, equitability is a larger notion that power against independence, so one might expect a method designed for equitability would not have as high power against independence as methods designed specifically for that purpose.  Also, there were some people who claimed, based on very limited experiments, that mutual information estimation would be more equitable than MIC.

So the third paper is a large-scale empirical study.  In terms of equitability, we find that MIC still seems to be the current best measure.  (Mutual information sometimes does well -- in some very particular circumstances, it can do better than MIC, which is not especially surprising since MIC is a "scaled" version of mutual information -- but MIC still performs much better overall.)  In terms of power against independence, we find that MICs power was underrated in previous studies.  The discrepancy seemed to be that previous studies used MICs default parameter settings, which were designed for equitability performance, not power against independence;  using different default parameters yields substantially better power against independence.  (The new algorithms in the second paper that yield more accurate calculations improve the power further.)  Finally, the TIC measure described in the second paper does even better, is easily computed when computing MIC scores (and so has negligible overhead if computing MIC scores), and appears to be comparable to other state-of-the-art measures in terms of power against independence.  We also find that the new ways of computing MIC are indeed faster and more accurate than previous methods. 

We're hoping these works, collectively, move the ball forward on this topic.  We like seeing MIC being used, and hope people will start using it more when analyzing their large dimensional data sets.  To be clear, perhaps tomorrow we will find that there are better scoring mechanisms than MIC for equitability;  perhaps even better mutual information estimators will suffice.  That would be great!  (I think of it like clustering algorithms;  there are lots of good clustering algorithms, different ones may be better suited to different situations.  The more the merrier.)  We also think equitability continues to be a useful framework, and that there's more to be done with equitability.  Though for now our group may again take a breather on this topic, and see what arises from our current work. 

No comments: