In information theory and statistics , Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function .[ 1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q , i.e. P << Q , and whose first moments exist, then
D
K
L
(
P
∥
Q
)
≥
Ψ
Q
∗
(
μ
1
′
(
P
)
)
,
{\displaystyle D_{KL}(P\parallel Q)\geq \Psi _{Q}^{*}(\mu '_{1}(P)),}
where
Ψ
Q
∗
{\displaystyle \Psi _{Q}^{*}}
is the rate function, i.e. the convex conjugate of the cumulant -generating function, of
Q
{\displaystyle Q}
, and
μ
1
′
(
P
)
{\displaystyle \mu '_{1}(P)}
is the first moment of
P
.
{\displaystyle P.}
The Cramér–Rao bound is a corollary of this result.
Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q . Consider the natural exponential family of Q given by
Q
θ
(
A
)
=
∫
A
e
θ
x
Q
(
d
x
)
∫
−
∞
∞
e
θ
x
Q
(
d
x
)
=
1
M
Q
(
θ
)
∫
A
e
θ
x
Q
(
d
x
)
{\displaystyle Q_{\theta }(A)={\frac {\int _{A}e^{\theta x}Q(dx)}{\int _{-\infty }^{\infty }e^{\theta x}Q(dx)}}={\frac {1}{M_{Q}(\theta )}}\int _{A}e^{\theta x}Q(dx)}
for every measurable set A , where
M
Q
{\displaystyle M_{Q}}
is the moment-generating function of Q . (Note that Q 0 = Q .) Then
D
K
L
(
P
∥
Q
)
=
D
K
L
(
P
∥
Q
θ
)
+
∫
supp
P
(
log
d
Q
θ
d
Q
)
d
P
.
{\displaystyle D_{KL}(P\parallel Q)=D_{KL}(P\parallel Q_{\theta })+\int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P.}
By Gibbs' inequality we have
D
K
L
(
P
∥
Q
θ
)
≥
0
{\displaystyle D_{KL}(P\parallel Q_{\theta })\geq 0}
so that
D
K
L
(
P
∥
Q
)
≥
∫
supp
P
(
log
d
Q
θ
d
Q
)
d
P
=
∫
supp
P
(
log
e
θ
x
M
Q
(
θ
)
)
P
(
d
x
)
{\displaystyle D_{KL}(P\parallel Q)\geq \int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P=\int _{\operatorname {supp} P}\left(\log {\frac {e^{\theta x}}{M_{Q}(\theta )}}\right)P(dx)}
Simplifying the right side, we have, for every real θ where
M
Q
(
θ
)
<
∞
:
{\displaystyle M_{Q}(\theta )<\infty :}
D
K
L
(
P
∥
Q
)
≥
μ
1
′
(
P
)
θ
−
Ψ
Q
(
θ
)
,
{\displaystyle D_{KL}(P\parallel Q)\geq \mu '_{1}(P)\theta -\Psi _{Q}(\theta ),}
where
μ
1
′
(
P
)
{\displaystyle \mu '_{1}(P)}
is the first moment, or mean, of P , and
Ψ
Q
=
log
M
Q
{\displaystyle \Psi _{Q}=\log M_{Q}}
is called the cumulant-generating function . Taking the supremum completes the process of convex conjugation and yields the rate function :
D
K
L
(
P
∥
Q
)
≥
sup
θ
{
μ
1
′
(
P
)
θ
−
Ψ
Q
(
θ
)
}
=
Ψ
Q
∗
(
μ
1
′
(
P
)
)
.
{\displaystyle D_{KL}(P\parallel Q)\geq \sup _{\theta }\left\{\mu '_{1}(P)\theta -\Psi _{Q}(\theta )\right\}=\Psi _{Q}^{*}(\mu '_{1}(P)).}
Corollary: the Cramér–Rao bound
edit
Start with Kullback's inequality
edit
Let X θ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions . Then
lim
h
→
0
D
K
L
(
X
θ
+
h
∥
X
θ
)
h
2
≥
lim
h
→
0
Ψ
θ
∗
(
μ
θ
+
h
)
h
2
,
{\displaystyle \lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}\geq \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}},}
where
Ψ
θ
∗
{\displaystyle \Psi _{\theta }^{*}}
is the convex conjugate of the cumulant-generating function of
X
θ
{\displaystyle X_{\theta }}
and
μ
θ
+
h
{\displaystyle \mu _{\theta +h}}
is the first moment of
X
θ
+
h
.
{\displaystyle X_{\theta +h}.}
The left side of this inequality can be simplified as follows:
lim
h
→
0
D
K
L
(
X
θ
+
h
∥
X
θ
)
h
2
=
lim
h
→
0
1
h
2
∫
−
∞
∞
log
(
d
X
θ
+
h
d
X
θ
)
d
X
θ
+
h
=
−
lim
h
→
0
1
h
2
∫
−
∞
∞
log
(
d
X
θ
d
X
θ
+
h
)
d
X
θ
+
h
=
−
lim
h
→
0
1
h
2
∫
−
∞
∞
log
(
1
−
(
1
−
d
X
θ
d
X
θ
+
h
)
)
d
X
θ
+
h
=
lim
h
→
0
1
h
2
∫
−
∞
∞
[
(
1
−
d
X
θ
d
X
θ
+
h
)
+
1
2
(
1
−
d
X
θ
d
X
θ
+
h
)
2
+
o
(
(
1
−
d
X
θ
d
X
θ
+
h
)
2
)
]
d
X
θ
+
h
Taylor series for
log
(
1
−
t
)
=
lim
h
→
0
1
h
2
∫
−
∞
∞
[
1
2
(
1
−
d
X
θ
d
X
θ
+
h
)
2
]
d
X
θ
+
h
=
lim
h
→
0
1
h
2
∫
−
∞
∞
[
1
2
(
d
X
θ
+
h
−
d
X
θ
d
X
θ
+
h
)
2
]
d
X
θ
+
h
=
1
2
I
X
(
θ
)
{\displaystyle {\begin{aligned}\lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta +h}}{\mathrm {d} X_{\theta }}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left(1-\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\right)\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)+{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}+o\left(\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right)\right]\mathrm {d} X_{\theta +h}&&{\text{Taylor series for }}\log(1-t)\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left({\frac {\mathrm {d} X_{\theta +h}-\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&={\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\end{aligned}}}
which is half the Fisher information of the parameter θ .
The right side of the inequality can be developed as follows:
lim
h
→
0
Ψ
θ
∗
(
μ
θ
+
h
)
h
2
=
lim
h
→
0
1
h
2
sup
t
{
μ
θ
+
h
t
−
Ψ
θ
(
t
)
}
.
{\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}=\lim _{h\to 0}{\frac {1}{h^{2}}}{\sup _{t}\{\mu _{\theta +h}t-\Psi _{\theta }(t)\}}.}
This supremum is attained at a value of t =τ where the first derivative of the cumulant-generating function is
Ψ
θ
′
(
τ
)
=
μ
θ
+
h
,
{\displaystyle \Psi '_{\theta }(\tau )=\mu _{\theta +h},}
but we have
Ψ
θ
′
(
0
)
=
μ
θ
,
{\displaystyle \Psi '_{\theta }(0)=\mu _{\theta },}
so that
Ψ
θ
″
(
0
)
=
d
μ
θ
d
θ
lim
h
→
0
h
τ
.
{\displaystyle \Psi ''_{\theta }(0)={\frac {d\mu _{\theta }}{d\theta }}\lim _{h\to 0}{\frac {h}{\tau }}.}
Moreover,
lim
h
→
0
Ψ
θ
∗
(
μ
θ
+
h
)
h
2
=
1
2
Ψ
θ
″
(
0
)
(
d
μ
θ
d
θ
)
2
=
1
2
Var
(
X
θ
)
(
d
μ
θ
d
θ
)
2
.
{\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}={\frac {1}{2\Psi ''_{\theta }(0)}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}={\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}.}
Putting both sides back together
edit
We have:
1
2
I
X
(
θ
)
≥
1
2
Var
(
X
θ
)
(
d
μ
θ
d
θ
)
2
,
{\displaystyle {\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\geq {\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2},}
which can be rearranged as:
Var
(
X
θ
)
≥
(
d
μ
θ
/
d
θ
)
2
I
X
(
θ
)
.
{\displaystyle \operatorname {Var} (X_{\theta })\geq {\frac {(d\mu _{\theta }/d\theta )^{2}}{{\mathcal {I}}_{X}(\theta )}}.}
Notes and references
edit