Updates: papers and preprints for the first two topics (prediction, lossy compression, MRFs, entropy estimation) added 10/10. Others are being updated and should be done by 10/10. The page now uses javascript for rendering math and navigation: please turn it on if you have turned it off. If you have emailed me about any paper, I will email back the paper to you as well.
My work spans several applications, but are connected by a set of probabilistic and combinatorial approaches. A lot of it---though not all---has to do with high dimensional problems---genetic data in biology, text data, and risk management in large financial instruments such as reinsurance, or rare event detection in very complex systems such as smart grids. Essentially, it is either futile or useless to handle problems involving this sort of data in the traditional sense. Yet, insights gained in one of these problems carry over to others as well, which makes a systematic instead of ad-hoc approach desirable.
A starting point for my personal evolution of ideas has been the following paper on probability estimation that won the prestigious Information Theory Society Paper Award . Since then, I have enjoyed expanding the horizons of undersampled problems using approaches and techniques that have been, at least to me, surprising and elegant.
This is in line with similar observations made by others as well--in the high dimensional domain, one has to typical rework and look at fundamental statistical problems in estimation, classification, and regression in an entirely new light. Practically, the commercial importance of a systematic theory of large alphabet applications cannot be understated in the fields mentioned above. Theoretically, they often force a deeper insight into even the simplest problems, and often involve unexpectedly beautiful mathematical results.
Workshops I have co-organized two workshops in the broad areas of large alphabet information theory and statistics---the first sponsored by the American Institute of Math and the NSF (2009), while the second was at the Banff International Research Station (2011). A total of over 60 researchers attended one or the other of these workshops, from diverse areas including biology, computer science, economics, information theory, mathematics, networking, and statistics.
If you are looking for any of my publications other than those listed here, please send me an email.
This ongoing series of papers attempts to redefine the problem of risk management from ground up. Consider the problem of insurance with two parties---the insurer and the insured. In many instances such as automobile or home ownership, the losses potentially incurred by the insured is well understood using a lot of empirical data---to the point of actually finetuning the premiums to demographic, geography, and other factors. However, in other instances like big network outages, or in the future smart grid, the loss could be catastrophic. Ceilings on loss payments then makes schemes uninteresting to the insured. On the other hand, the insurer asks a legitimate question about whether it is even possible to have a business which pays out losses in full, yet is profitable.
We formalize any information known by the insurer on the possible loss models with a class of admissible probability distributions on losses, $\cal P$. The actual loss model is unknown, and there is no bound on loss, and no ceilings on payments---the insurer always pays out the loss in full. When is the game not futile? Namely, is there a way the insurer can set premiums so that the insurer is not bankrupted? My papers below characterize when exactly the insurance game is workable. In particular, we see how the topology of the model class determines whether it is insurable or not.
N. Santhanam and V. Anantharam.
Prediction over countable alphabets.
In Conference on Information Sciences and Systems, 2012. [ pdf ]
N. Santhanam and V. Anantharam.
What risks lead to ruin?
In Allerton Conference on Communication, Control, and Computing, 2010. [ pdf ]
While risk management important in itself, we should also realize that this forms a template for a much more general theory, essentially effecting a new way at looking at statistical problems. Typically, one estimates a quantity (in the above, an upper bound on unseen samples) in a probabilistic setting---data generated by an unknown probability law $p$. A statistician would usually ask if the estimate is consistent, and if so, whether it is uniformly consistent over the set of all possible models. The later is important, since it quantifies our performance for the data at hand. Without such a guarantee, it is possible that no matter now much data we have, we may still be doing very badly.
This is not usually an issue one would worry about in settings that do not involve large alphabets. But when we are dealing with large alphabets, classes of models that admit uniform consistency tend to be limited, and we need to work with richer classes. How does one resolve the practical need to know if we are doing well on the sample at hand, as well as admitting rich classes of models that do not necessarily allow for uniform consistency? My papers below take the first step towards a general theory of such algorithms.
N. Santhanam and V. Anantharam.
Agnostic insurance tasks and their relation to compression.
In International conference on signal processing and
communications (SPCOM), July 2012. [ pdf ]
Compression refers to encoding data using bits, so that the representation uses as few bits as possible. Compression could be lossless: i.e. encoded data can be recovered exactly from its representation) or lossy where the data is compressed more than the lossless case, but can still be recovered to within prespecified distortion metric. Algorithms for lossless compression are well understood. But embarrassingly enough, no simple, sequential algorithm exists for lossy compression of stationary ergodic sources is known yet, despite its commercial and theoretical importance---and this task has been a focus of the source coding literature for better part of three decades now. The task is difficult even for the simple case of iid sources.
We propose an algorithm in the spirit of lossless Lempel Ziv algorithm, and show that it is optimal for iid sources. Moreover, our algorithm incorporates combinatorial constructions which we intend to exploit using an extension of the Dvoretsky-Motzkin Cycle Lemma, in order to extend it to all stationary ergodic sources. For iid sources, the cycle lemma helps in improving the complexity of the algorithm from quadratic to linear time. Our algorithm is now being adapted for sources with memory, but our savings in complexity is expected to be higher for higher memory sources.
Consider Hamming Distortion. Let us say an iid Bernoulli p sequence X matches a given codeword y if it is within Hamming distortion d, and X strongly matches y if every prefix of X is within Hamming distortion d from the corresponding prefix of y. Consider the probability of all iid sequences that are within Hamming distortion d from y. The Cycle Lemma shows us that we only lose a polynomial factor (in sequence length) if we ask for the probability iid Bernoulli p X that strongly match y. Moreover, the Cycle Lemma implies that this relation is asymptotically true for all stationary ergodic sources.
Lossy Lempel-Ziv like compression algorithms for memoryless sources.
N. Santhanam and D. Modha.
In Allerton conference on control, communication and computation., Sep 2011.
[ pdf ]
A more up to date preprint is available here. Please note, this is a working copy and there will be an update almost every day. If you need a readable copy, please email me---I can send you a handwritten set of notes. Last updated 10/8/2012.
There are a couple of older papers on this topic in the Data Compression Conference, but the preprint and the conference paper above give a better picture of the algorithms and results involved.
Markov random fields (also known as undirected graphical models) provide a structured representation of the joint distributions of families of random variables. They are used in various application domains, among them image analysis, social network analysis, and computational biology. Any Markov random field is associated with an underlying graph that describes conditional independence properties associated with the joint distribution of the random variables.
The fly in the ointment is, however, that inferring the underlying MRF model is, in general, NP-complete. To counter this, focussing on Ising Models, we come up with model classes where inference is possible with low sample complexity. One way to see these results is through a universal compression framework---the redundancy capacity theorem tells us exactly how much information we can obtain about the underlying model from samples. A different way to look at some of our results is that low sample complexity comes because the local graph around each variable is decoupled from too many other nodes. Effectively, if we can identify the implicit local properties, we will be able to handle the classes in a computationally efficient manner.
N. Santhanam and M. Wainwright.
Information-Theoretic Limits of Selecting Binary Graphical Models in High Dimensions.
In IEEE Transactions on Information Theory, vol.58, no.7, pp.4117-4134, July 2012.
[ pdf ]
There are a couple of conference papers that are subsumed by the above paper. If for some reason you are looking for them anyway, send me an email.
We aim to obtain good models for gene regulatory networks by building on this work. Note that gene networks are not bounded degree models, not Eulerian, have several short cycles, and incorporate repeated motifs involving small numbers of genes. To get an idea for the magnitude of the problem, the simplest bacteria have genes numbering in the 1000s (E Coli has 4.2x10^3 genes). The number of microarray samples, the simulataneous measurement of the expression of all genes, is usually in 10s. It is useful to think of the number of microarray observations as scaling logarithmically with the number of genes.
Ising models do lend themselves to describing the networks well from a statistical point of view. The challenge is to use the insights described in the universal compression framework to come up with computationally amenable, yet statistically rich subclasses of Ising models to describe genetic data. In addition, there is extensive experimental data on these models already. To do so, we will be incorporating clustering algorithms based on large alphabet work, in addition to insights gained by universal compression to model genetic networks. Watch this space for more to come in the near future. The following is a preliminary "proof of concept" paper that infers the genes involved in the SOS network of E-Coli using microarray data.
N. Santhanam, J.Dingel and O. Milenkovic.
On modeling gene regulatory networks using Markov random fields.
In 2009 IEEE Information Theory Workshop on Networking and Information Theory, June 2009.
[ pdf ]
For cases where the size of the alphabet is large relative to the sample, the correct asymptotics come from assuming that the alphabet scales with the sample size, or is infinite. As mentioned above, this usually leads to fundamentally reworking all problems, and entropy estimation is perhaps one that is both important (for neurosciences, change detection for example), and one where we have to think beyond the box.
Example 1It has been appreciated for some time that large alphabet entropy estimation in futile in all generality. For example, consider the sequence of distributions $\{p_i: i\ge 1\}$ on natural numbers, where $p_i$ places probability $1-1/(i+1)$ on 1, and probability $1/(i+1)$ equally divided among $2^{(i+1)^2}$ numbers starting with 2. The entropy of $p_i$ is therefore $\ge i+1$. Note that the sequence $\{p_i\}$ of distributions converges to the singleton distribuiton on 1 in $\ell_1$ distance, but the entropy of the sequence diverges to infinity! Put another way, arbitrarily close distances could have arbitrarily different entropies in the large alphabet setting. So, given samples from an unknown distribution in $\{ p_i:i\ge1 \}$, we will never really be able to estimate the unknown entropy from the samples no matter who large the sample size is! $\Box$
Example 2 A second, common approach is to bound the supports of every distribution in $\cal P$ by $m$, and ask can we estimate entropy (to some specified accuracy) with $m/\log m$ samples? $\sqrt m$ samples? And so on. In a sense, all distributions in $\cal P$ behave uniformly---namely, we can get a sufficiently accurate estimates, no matter what the unknown distribution is, with ${\cal O}\bigl(m/\log m\bigr)$ samples. $\Box$
The approach in Example 2 continues the same line of attack as the finite alphabet case, but with an additional twist of asking the minimal number of samples needed. To model using richer classes of models, we take a fundamentally different approach. For starters, let us not bound the supports of distributions. It is quite possible now that different distributions in the set $\cal P$ of possible models require different sample sizes to estimate entropy to a given accuracy. The question we ask, then, is---when is it possible that, without knowledge of the underlying distribution, we can tell if our estimate is good based on just looking at the samples? In case of Example 1, it is never possible.
Example 3 But consider the set of all uniform distributions over numbers (any finite support size is possible---so there is no uniform upper bound on supports of all distributions as in Example 2). Suppose we take samples from a distribution, till the point any element repeats. By an argument similar to the Birthday Problem/"Paradox", the sample size when we see the first repeat is ${\cal O}(\sqrt{m})$, where $m$ is the support size of the true distribution, with high probability (the confidence independent of the underlying distribution). It is thus possible to estimate the size of the support, and hence the entropy---though the number of samples we need may be arbitrarily different for different distributions. If we need more accuracy or higher confidence, we sample till we see more repeats.
Example 3 is a new paradigm for entropy estimation, one based on pointwise convergence of the estimate (rather than uniform convergence), but at the same time avoiding the pitfalls of Example 1. It turns out that these formulations are closely related to the insurance task above, and indeed, it has very interesting and nontrivial connections with prediction (or weak compression). One of my goals is to build a theory of pointwise convergent algorithms for entropy estimation, but also for a more general class of problems (any admissible function of the next unseen sample), and formalize the hierarchy of difficulty of these various problems.
Universal algorithms: making a case for pointwise convergence.
In Allerton conference on communications, control and computation Oct 2012.
Paper to come shortly, I presented this recently on 10/2/2012.
(A. Orlitsky, N. Santhanam, and K. Viswanathan)*.
Population estimation with performance guarantees.
In Proceedings of IEEE Symposium on Information Theory , 2007.
[ pdf ]
Please email me for a preprint of the full version.
In statistics and theoretical computer science, the notion of exchangeability provides a framework for the study of large alphabet scenarios. This idea has been developed in an important line of work starting with Kingman's study of population genetics, and leading on to the paintbox processes of Kingman, the Chinese restaurant processes and their generalizations. In information theory, the notion of the pattern of a sequence provides a framework for the study of large alphabet scenarios, as developed in my work on large alphabet probability estimation. The pattern is a statistic that captures all the information present in the data, and yet is universally compressible regardless of the alphabet size. Formally, 2009 AIM workshop co-organized by me made the connection that patterns and Kingman's paintbox processes are equivalent, and this opens up alternate representations of patterns in terms of graph limits.
To put this observation in context, the information-theoretic approach to large alphabet problems using patterns has been tackled primarily from a frequentist perspective, whereas the statistics community has approached the analogous problems primarily from a Bayesian point of view. There are hence differences in the kinds of estimators that have been proposed. We are in the process of building on the cross-fertilization of these approaches here.