The Deterministic Information Bottleneck

Updated on  | By | Permalink

How does one cluster data in a maximally informative way?

The information bottleneck (IB) method is about finding the pieces of one signal, $X$, that are informative about another signal, $Y$. It can be described both in the language of compression and in the language of clustering. From a (lossy) compression perspective, the idea is to compress $X$ in a way such that, as information is thrown away, the bits most informative about $Y$ are selectively retained as long as possible. From a clustering perspective, the IB seeks to cluster inputs, $X$, such that the cluster IDs are as informative as possible about $Y$; this stands in contrast to clustering algorithms such as, say, $k$-means, which cluster data purely based on geometry, without regard for which dimensions might be more or less important for the task at hand.

The IB was originally formulated by Tishby, Pereira, & Bialek as an information-theoretic optimization problem, which produces a set of iterative equations leading to a stochastic encoder, or soft clustering.

Our (with David Schwab) contribution has been to introduce an alternative formulation of the IB, based on a modification to the cost function that we believe better represents the (biologically-relevant) goal of minimizing representational cost (storage size, in the language of compression, and number of clusters, in the language of clustering). Because our approach results in a deterministic encoder, or hard clustering, we call it the deterministic information bottleneck (DIB).

We presented early versions of this work at UAI, APS March Meeting, and the Physics-Informed Machine Learning meeting in Santa Fe. An updated version is currently in review, and should be available here soon.

Currently, we are extending this work in a number of directions. (1) We are trying to understand the relationship between DIB/IB, which cluster based on distributional information about an external variable, and geometric clustering algorithms such as $k$-means and gaussian mixture models. More generally, we are motivated by understanding how to blend distributional and geometric information in clustering problems. (2) In our derivation of the DIB solution, we introduced a “generalized IB” approach which naturally interpolates between IB and DIB. We are currently investigating whether methods lying “between” IB and DIB might be optimal for certain problems. In particular, the generalized IB can be viewed as a regularized version of DIB, and we’re testing whether it performs better on finitely sampled data. (3) We are interested in the DIB as a model of predictive coding in early sensory populations, and are collaborating with Stephanie Palmer to test that hypothesis (Stephanie and co wrote a beautiful PNAS paper on testing the IB in this same scenario). (4) We are testing the DIB and IB as methods for user segmentation in collaboration with Spotify (especially with Zack Nichols). Here, the idea is to compute groupings of users such that those groupings are most informative about musical taste, feature use, responses to ads, or other user behavior. This collaboration has also been a major driver of computational and practical improvements to the use of IB and DIB.

Code to run the DIB, IB, and interpolations between is available here.

This work originally stems from discussions with Rich Turner that we had while preparing to give a tutorial on the IB at Cambridge’s CBL and Microsoft Research Cambridge.