In probability theory, an -divergence is a certain type of function that measures the difference between two probability distributions and . Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of -divergence.

History

edit

These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár  -divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.

Definition

edit

Non-singular case

edit

Let   and   be two probability distributions over a space  , such that  , that is,   is absolutely continuous with respect to  . Then, for a convex function   such that   is finite for all  ,  , and   (which could be infinite), the  -divergence of   from   is defined as

 

We call   the generator of  .

In concrete applications, there is usually a reference distribution   on   (for example, when  , the reference distribution is the Lebesgue measure), such that  , then we can use Radon–Nikodym theorem to take their probability densities   and  , giving

 

When there is no such reference distribution ready at hand, we can simply define  , and proceed as above. This is a useful technique in more abstract proofs.

The above definition can be extended to cases where   is no longer satisfied (Definition 7.1 of [2]).

Since   is convex, and   , the function   must be nondecreasing, so there exists  , taking value in  .

Since for any  , we have   , we can extend f-divergence to the  .

Properties

edit

Basic relations between f-divergences

edit
  • Linearity:   given a finite sequence of nonnegative real numbers   and generators  .
  •   iff   for some  .
Proof

If  , then   by definition.

Conversely, if  , then let  . For any two probability measures   on the set  , since  , we get  

Since each probability measure   has one degree of freedom, we can solve   for every choice of  .

Linear algebra yields  , which is a valid probability measure. Then we obtain  .

Thus   for some constants  . Plugging the formula into   yields  .

Basic properties of f-divergences

edit
  • Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
     
  • Data-processing inequality: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
     
    The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
  • Joint convexity: for any 0 ≤ λ ≤ 1,
     
    This follows from the convexity of the mapping   on  .
  • Reversal by convex inversion: for any function  , its convex inversion is defined as  . When   satisfies the defining features of a f-divergence generator (  is finite for all  ,  , and  ), then   satisfies the same features, and thus defines a f-divergence  . This is the "reverse" of  , in the sense that   for all   that are absolutely continuous with respect to each other. In this way, every f-divergence   can be turned symmetric by  . For example, performing this symmetrization turns KL-divergence into Jeffreys divergence.

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution   then   is a monotonic (non-increasing) function of time, where the probability distribution   is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences   are the Lyapunov functions of the Kolmogorov forward equations. The converse statement is also true: If   is a Lyapunov function for all Markov chains with positive equilibrium   and is of the trace-form ( ) then  , for some convex function f.[3][4] For example, Bregman divergences in general do not have such property and can increase in Markov processes.[5]

Analytic properties

edit

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Naive variational representation

edit

Let   be the convex conjugate of  . Let   be the effective domain of  , that is,  . Then we have two variational representations of  , which we describe below.

Basic variational representation

edit

Under the above setup,

Theorem —  .

This is Theorem 7.24 in.[2]

Example applications

edit

Using this theorem on total variation distance, with generator   its convex conjugate is  , and we obtain   For chi-squared divergence, defined by  , we obtain   Since the variation term is not affine-invariant in  , even though the domain over which   varies is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.

Replacing   by   and taking the maximum over  , we obtain   which is just a few steps away from the Hammersley–Chapman–Robbins bound and the Cramér–Rao bound (Theorem 29.1 and its corollary in [2]).

For  -divergence with  , we have  , with range  . Its convex conjugate is   with range  , where  .

Applying this theorem yields, after substitution with  ,   or, releasing the constraint on  ,   Setting   yields the variational representation of  -divergence obtained above.

The domain over which   varies is not affine-invariant in general, unlike the  -divergence case. The  -divergence is special, since in that case, we can remove the   from  .

For general  , the domain over which   varies is merely scale invariant. Similar to above, we can replace   by  , and take minimum over   to obtain   Setting  , and performing another substitution by  , yields two variational representations of the squared Hellinger distance:     Applying this theorem to the KL-divergence, defined by  , yields   This is strictly less efficient than the Donsker–Varadhan representation   This defect is fixed by the next theorem.

Improved variational representation

edit

Assume the setup in the beginning of this section ("Variational representations").

Theorem — If   on   (redefine   if necessary), then

 ,

where   and  , where   is the probability density function of   with respect to some underlying measure.

In the special case of  , we have

 .

This is Theorem 7.25 in.[2]

Example applications

edit

Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.

Attempting to apply this theorem to the general  -divergence with   does not yield a closed-form solution.

Common examples of f-divergences

edit

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of  -divergence, or linear sums of  -divergences.

For each f-divergence  , its generating function is not uniquely defined, but only up to  , where   is any real constant. That is, for any   that generates an f-divergence, we have  . This freedom is not only convenient, but actually necessary.

Divergence Corresponding f(t) Discrete Form
 -divergence,      
Total variation distance ( )    
α-divergence  
KL-divergence ( )    
reverse KL-divergence ( )    
Jensen–Shannon divergence    
Jeffreys divergence (KL + reverse KL)    
squared Hellinger distance ( )    
Pearson  -divergence (rescaling of  )    
Neyman  -divergence (reverse Pearson)

(rescaling of  )

   
 
Comparison between the generators of alpha-divergences, as alpha varies from -1 to 2.

Let   be the generator of  -divergence, then   and   are convex inversions of each other, so  . In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.

In the literature, the  -divergences are sometimes parametrized as

 

which is equivalent to the parametrization in this page by substituting  .

Relations to other statistical divergences

edit

Here, we compare f-divergences with other statistical divergences.

Rényi divergence

edit

The Rényi divergences is a family of divergences defined by

 

when  . It is extended to the cases of   by taking the limit.

Simple algebra shows that  , where   is the  -divergence defined above.

Bregman divergence

edit

The only f-divergence that is also a Bregman divergence is the KL divergence.[6]

Integral probability metrics

edit

The only f-divergence that is also an integral probability metric is the total variation.[7]

Financial interpretation

edit

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[8]

See also

edit

References

edit
  1. ^ Rényi, Alfréd (1961). On measures of entropy and information (PDF). The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. Berkeley, CA: University of California Press. pp. 547–561. Eq. (4.20)
  2. ^ a b c d Polyanskiy, Yury; Yihong, Wu (2022). Information Theory: From Coding to Learning (draft of October 20, 2022) (PDF). Cambridge University Press. Archived from the original (PDF) on 2023-02-01.
  3. ^ Gorban, Pavel A. (15 October 2003). "Monotonically equivalent entropies and solution of additivity equation". Physica A. 328 (3–4): 380–390. arXiv:cond-mat/0304131. Bibcode:2003PhyA..328..380G. doi:10.1016/S0378-4371(03)00578-8. S2CID 14975501.
  4. ^ Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185–193. doi:10.1007/978-3-642-10677-4_21.
  5. ^ Gorban, Alexander N. (29 April 2014). "General H-theorem and Entropies that Violate the Second Law". Entropy. 16 (5): 2408–2432. arXiv:1212.6767. Bibcode:2014Entrp..16.2408G. doi:10.3390/e16052408.
  6. ^ Jiao, Jiantao; Courtade, Thomas; No, Albert; Venkat, Kartik; Weissman, Tsachy (December 2014). "Information Measures: the Curious Case of the Binary Alphabet". IEEE Transactions on Information Theory. 60 (12): 7616–7626. arXiv:1404.6810. doi:10.1109/TIT.2014.2360184. ISSN 0018-9448. S2CID 13108908.
  7. ^ Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].
  8. ^ Soklakov, Andrei N. (2020). "Economics of Disagreement—Financial Intuition for the Rényi Divergence". Entropy. 22 (8): 860. arXiv:1811.08308. Bibcode:2020Entrp..22..860S. doi:10.3390/e22080860. PMC 7517462. PMID 33286632.