Data Clustering using the Information Bottleneck

edit

This application of the bottleneck method to non-Gaussian sampled data is described in [1]. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

Density Estimation

edit

Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points  . This is a well known problem with a number of solutions [2]. In the present method, probability densities at the sample points are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.

Define an arbitrarily increasing distance metric   between all sample pairs and define distance matrix   . Then compute transition probabilities between sample pairs   for some  . Treating samples as states, and   as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after   steps, conditioned on the initial state  , is  . We are here interested only in the equilibrium probability vector   given, in the usual way, by the dominant left eigenvector of matrix   and is independent of the initialising vector  . This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities here.

Clusters

edit

In the following, the reference vector   contains sample categories and the joint probability   is assumed known. A cluster   is defined by its probability distribution over the data samples  . In [1] Tishby et al present the following iterative set of equations to determine the clusters

 

The function of each line of the iteration is expanded as follows.

Line 1: This is a matrix valued set of conditional probabilities

 

The Kullback Leibler distance   between the   vectors generated by the sample data   and those generated by its reduced information proxy   is applied to assess the fidelity of the compressed vector with respect to the categorical data Y in accordance with the fundamental bottleneck equation.   is the Kullback Leibler distance between distributions  

 

and   is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.

Line 2: This is a second matrix valued set of conditional probabilities

 
The steps in deriving this are as follows. We have, by definition

 

where the Bayes identities   are used. Finally the integral is rewritten as the summation over the sample points   as in the first equation above.

Line 3: this line finds the marginal distribution of  

 

This is also derived from standard results.

Further inputs to the algorithm are the marginal sample distribution   which has already been determined by the dominant eigenvector of   and the matrix valued Kullback Leibler distance function

  derived from the sample spacings and transition probabilities.

The matrices   can be initialised randomly.

Defining Decision Contours

edit

To categorize a new sample   external to the training set  , first calculate the probabilities that it belongs to each of the various clusters which is the conditional probability  . In order to find this, apply the previous distance metric to find the transition probabilities between   and all samples in  ,  . Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.

 

Finally we have


 

Generally the algorithm converges rapidly, often in tens of iterations. However parameter   must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, click into focus at certain critical values.

There is some analogy between this algorithm and a neural network with a single hidden layer. The nodes are represented by the clusters  . The first and second layers of network weights are the conditional probabilities   and   respectively. However, unlike a standard neural network, the present algorithm always uses probabilities of samples as inputs rather than the sample values themselves and non linear function are encapsulated in the Kullback Leibler distances and the transition probabilities rather than sigmoid functions. Compared to a neural network this algorithm seems to converge much more quickly and by varying   and   various levels of focus on features can be achieved. There are also similarities to some varieties of Fuzzy Logic algorithms.

For blind classification and clustering, the transient behaviour of   is analysed and this is discussed in more detail in [2] but this extra complication is not necessary for the supervised training described here.

An Example

edit

In the following simple case we investigate clustering in a four quadrant multiplier with random inputs   and two categories of output,  , generated by  . This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.

20 samples are taken, uniformly distributed on the square   . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <m ath>\lambda = 3,\, \beta = 2.5</math> adn the distance function   where  . The figure shows the locations of the twenty samples with '0' representing Y = 1 and 'x' representing Y = -1. The contour at the unity likelihood ratio level is shown,   as a new sample  is scanned over the square. Theoretically the contour should align with the   and   coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.

 
Decision Contours

bibliography

edit

[1] N Tishby, N Slonim: “Data clustering by Markovian Relaxation and the Information Bottleneck Method”, Neural Information Processing Systems (NIPS) 2000, pp. 640-646

[2] B.W. Silverman: “Density Estimation for Statistical Data Analysis”, Chapman and Hall, 1986.