Friday, August 31, 2007

New Results in Trace Reconstruction

I've already talked a lot in this blog about deletion channels. Trace reconstruction involves a similar problem. We start with an original binary string X = X1,X2,...,Xn. A trace consists of a string Y1,Y2,...,Ym obtained from the original string by passing it through a deletion channel, where each bit is independently deleted with probability p. The trace reconstruction problem basically asks how many independent traces do you need to see to reconstruct the original string X with high probability. Unlike the coding setting, where X might be chosen from a codebook of our own design, in this setting two natural models to study are when X is uniform over binary strings (so the high probability is over the choice of X and the traces), and in the worst case (where the high probability is just over the traces). Variations of the problem include operations other than deletions (including, say, insertions and errors). As an example application, a set of sensors might be monitoring a sequence of events. Each individual sensor is weak and might miss a given event, in which case the question is how many sensors are needed to reconstruct the event sequence perfectly, with high probability.

Trace reconstruction has some history in the information theory community, and the first CS-style paper I saw on it was by Batu, Kannan, Khanna, and McGregor in SODA 2004. The main result of this paper dealt with random input X and considered p values that were O(1/log n). It seems to me much more natural for p to be constant, and it has remained an open problem to determine an efficient algorithm for constant p.

I mentioned this problem last time I visited Microsoft, and it seemed to resonate with some of the people there. Thomas Holenstein, Rina Panigrahy, Udi Wieder and I have a submission with several results, including an algorithm that for random X and sufficiently small constant probability p requires only a polynomial number of traces and polynomial time (with high probability).

The SODA 2004 uses a majority voting technique -- the bits are determined sequentially, with each string voting on the next bit. A key idea in our new algorithm is a "smart voting" technique. We only let traces vote if there is good reason (based on the already determined bits) to think that the trace has a good prediction for the subsequent bit. That is, only well-informed strings are allowed to vote. Feel free to make your own political analogies. My intuition is that this smart voting technique is a closer analogue to the full belief propagation (or Bayesian analysis) that we want to do than just majority voting. Because of this, I hope this "smart voting" technique is a general approach that will find other applications.

I don't yet have an analysis of a belief-propagation-based algorithm. Also, currently we can't analyze a maximum-likelihood algorithm, that finds the most likely original string X. I also don't know how to implement maximum-likelihood efficiently in this setting. So there's still plenty of open questions in this area.


Anonymous said...

V. B. Balakirsky, "Joint Source-Channel Coding Using Variable-Length Codes," Probl. Inf. Transm., vol. 37, no. 1, pp. 10--23, 2001.

Michael Mitzenmacher said...

Anonymous --

I'm not sure why you left me this citation. It seems like an interesting paper, but I don't see its relation to the problem described...

In particular, it doesn't seem to deal with probabilistic deletions. (Though it may be related to some other work I'm doing...)