Uniform convergence in probability

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

The law of large numbers says that, for each single event , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events [1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Definitions

edit

For a class of predicates   defined on a set   and a set of samples  , where  , the empirical frequency of   on   is

 

The theoretical probability of   is defined as  

The Uniform Convergence Theorem states, roughly, that if   is "simple" and we draw samples independently (with replacement) from   according to any distribution  , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.[2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class   is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[1] using the concept of growth function.

Uniform convergence theorem

edit

The statement of the uniform convergence theorem is as follows:[3]

If   is a set of  -valued functions defined on a set   and   is a probability distribution on   then for   and   a positive integer, we have:

 
where, for any  ,
 
 
and  .   indicates that the probability is taken over   consisting of   i.i.d. draws from the distribution  .
  is defined as: For any  -valued functions   over   and  ,
 

And for any natural number  , the shattering number   is defined as:

 

From the point of Learning Theory one can consider   to be the Concept/Hypothesis class defined over the instance set  . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

Sauer–Shelah lemma

edit

The Sauer–Shelah lemma[4] relates the shattering number   to the VC Dimension.

Lemma:  , where   is the VC Dimension of the concept class  .

Corollary:  .

Proof of uniform convergence theorem

edit

[1] and [3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

  1. Symmetrization: We transform the problem of analyzing   into the problem of analyzing  , where   and   are i.i.d samples of size   drawn according to the distribution  . One can view   as the original randomly drawn sample of length  , while   may be thought as the testing sample which is used to estimate  .
  2. Permutation: Since   and   are picked identically and independently, so swapping elements between them will not change the probability distribution on   and  . So, we will try to bound the probability of   for some   by considering the effect of a specific collection of permutations of the joint sample  . Specifically, we consider permutations   which swap   and   in some subset of  . The symbol   means the concatenation of   and  .[citation needed]
  3. Reduction to a finite class: We can now restrict the function class   to a fixed joint sample and hence, if   has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

Symmetrization

edit

Lemma: Let   and

 

Then for  ,  .

Proof: By the triangle inequality,
if   and   then  .

Therefore,

 

since   and   are independent.

Now for   fix an   such that  . For this  , we shall show that

 

Thus for any  ,   and hence  . And hence we perform the first step of our high level idea.

Notice,   is a binomial random variable with expectation   and variance  . By Chebyshev's inequality we get

 

for the mentioned bound on  . Here we use the fact that   for  .

Permutations

edit

Let   be the set of all permutations of   that swaps   and     in some subset of  .

Lemma: Let   be any subset of   and   any probability distribution on  . Then,

 

where the expectation is over   chosen according to  , and the probability is over   chosen uniformly from  .

Proof: For any  

 

(since coordinate permutations preserve the product distribution  .)

 

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

edit

Lemma: Basing on the previous lemma,

 .

Proof: Let us define   and   which is at most  . This means there are functions   such that for any   between   and   with   for  

We see that   iff for some   in   satisfies,  . Hence if we define   if   and   otherwise.

For   and  , we have that   iff for some   in   satisfies  . By union bound we get

 
 

Since, the distribution over the permutations   is uniform for each  , so   equals  , with equal probability.

Thus,

 

where the probability on the right is over   and both the possibilities are equally likely. By Hoeffding's inequality, this is at most  .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

edit
  1. ^ a b c Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
  2. ^ "Martingales", Probability with Martingales, Cambridge University Press, pp. 93–105, 1991-02-14, doi:10.1017/cbo9780511813658.014, ISBN 978-0-521-40455-6, retrieved 2023-12-08
  3. ^ a b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X
  4. ^ Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11