This section is concenred with the convergence of probability distributions, a topic of basic importance in probability theory. Since we will be almost exclusively concerned with the convergences of sequences of various kinds, it's helpful to introduce the notation \(\N_+^* = \N_+ \cup \{\infty\} = \{1, 2, \ldots\} \cup \{\infty\}\).
We start with the most important and basic setting, the measurable space \((\R, \ms R)\), where \(\R\) is the set of real numbers of course, and \(\ms R\) is the Borel \(\sigma\)-algebra of subsets of \(\R\). Since this \(\sigma\)-algebra is understood, we will usually not bother to reference it explictly. Recall that if \(P\) is a probability measure on \(\R\), then the function \(F: \R \to [0, 1]\) defined by \(F(x) = P(-\infty, x]\) for \(x \in \R\) is the (cumulative) distribution function of \(P\). Recall also that \(F\) completely determines \(P\). Here is the definition for convergence of probability measures in this setting:
Suppose \(P_n\) is a probability measure on \(\R\) with distribution function \(F_n\) for each \(n \in \N_+^*\). Then \(P_n\) converges (weakly) to \(P_\infty\) as \(n \to \infty\) if \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\) for every \(x \in \R\) where \(F_\infty\) is continuous. We write \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).
Recall that a distribution function \(F\) is continuous at \(x \in \R\) if and only if \(\P(X = x) = 0\), so that \(x\) is not an atom of the distribution (a point of positive probability). We will see shortly why this condition on \(F_\infty\) is appropriate. Of course, a probability measure on \(\R\) is usually associated with a real-valued random variable for some random experiment that is modeled by a probability space \((\Omega, \ms F, \P)\). So to review, \(\Omega\) is the set of outcomes, \(\ms F\) is the \(\sigma\)-algebra of events and \(\P\) is the probability measure on the sample space \((\Omega, \ms F)\). If \(X\) is a real-valued random variable defined on the probability space, then the probability distribution of \(X\) is the probability measure \(P\) on \(\R\) defined by \(P(A) = \P(X \in A)\) for \(A \in \ms R\), and then of course, the distribution function of \(X\) is the function \(F\) defined by \(F(x) = \P(X \le x)\) for \(x \in \R\). Here is the convergence terminology used in this setting:
Suppose that \(X_n\) is a real-valued random variable with distribution \(P_n\) for each \(n \in \N_+^*\). If \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) then we say that \(X_n\) converges in distribution to \(X_\infty\) as \(n \to \infty\). We write \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
So if \(F_n\) is the distribution function of \(X_n\) for \(n \in \N_+^*\), then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution if \(F_n(x) \to F_\infty(x)\) at every point \(x \in \R\) where \(F_\infty\) is continuous. On the one hand, the terminology and notation are helpful, since again most probability measures are associated with random variables (and every probability measure can be). On the other hand, the terminology and notation can be a bit misleading since the random variables, as functions, do not converge in any sense, and indeed the random variables need not be defined on the same probability spaces. It is only the distributions that converge. However, often the random variables are defined on the same probability space \((\Omega, \ms F, \P)\), in which case we can compare convergence in distribution with the other modes of convergence we have or will study:
We will show, in fact, that convergence in distribution is the weakest of all of these modes of convergence. However, strength of convergence should not be confused with importance. Convergence in distribution is one of the most important modes of convergence; the central limit theorem, one of the two fundamental theorems of probability, is a theorem about convergence in distribution.
The examples below show why the definition is given in terms of distribution functions, rather than probability density functions, and why convergence is only required at the points of continuity of the limiting distribution function. Note that the distributions considered are probability measures on \((\R, \ms R)\), even though the support of the distribution may be a much smaller subset. For the first example, note that if a deterministic sequence converges in the ordinary calculus sense, then naturally we want the sequence (thought of as random variables) to converge in distribution. Expand the proof to understand the example fully.
Suppose that \(x_n \in \R\) for \(n \in \N_+^*\). Define random variable \(X_n = x_n\) with probability 1 for each \(n \in \N_+^*\). Then \(x_n \to x_\infty\) as \(n \to \infty\) if and only if \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
For \(n \in \N_+^*\), the distribution function \(F_n\) of \(X_n\) is given by \(F_n(x) = 0\) for \(x \lt x_n\) and \(F_n(x) = 1\) for \(x \ge x_n\).
The proof is finished, but let's look at the probability density functions to see that these are not the proper objects of study. For \(n \in \N_+^*\), the density function \(f_n\) of \(X_n\) (with respect to counting measure) is given by \(f_n(x_n) = 1\) and \(f_n(x) = 0\) for \(x \in \R \setminus \{x_n\}\). Only when \(x_n = x_\infty\) for all but finitely many \(n \in \N_+\) do we have \(f_n(x) \to f(x)\) for \(x \in \R\).
For example below, recall that \( \Q \) denotes the set of rational numbers. Once again, expand the proof to understand the example fully
For \(n \in \N_+\), let \(P_n\) denote the discrete uniform distribution on \(\left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) and let \(P_\infty\) denote the continuous uniform distribution on the interval \([0, 1]\). Then
As usual, let \(F_n\) denote the distribution function of \(P_n\) for \(n \in \N_+^*\).
The proof is finished, but let's look at the probability density functions. For \(n \in \N_+\), the density function \(f_n\) of \(P_n\) (again with respect to counting measure) is given by \(f_n(x) = \frac{1}{n}\) for \(x \in \left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) and \(f_n(x) = 0\) otherwise. Hence \( 0 \le f_n(x) \le \frac{1}{n} \) for \(n \in \N_+\) and \(x \in \R\), so \(f_n(x) \to 0\) as \(n \to \infty\) for every \( x \in \R \).
The point of the example is that it's reasonable for the discrete uniform distribution on \(\left\{\frac{1}{n}, \frac{2}{n}, \ldots \frac{n-1}{n}, 1\right\}\) to converge to the continuous uniform distribution on \([0, 1]\), but once again, the probability density functions are evidently not the correct objects of study.
As example shows, it is quite possible to have a sequence of discrete distributions converge to a continuous distribution (or the other way around). Recall that probability density functions have very different meanings in the discrete and continuous cases: density with respect to counting measure in the first case, and density with respect to Lebesgue measure in the second case. This is another indication that distribution functions, rather than density functions, are the correct objects of study. However, if probability density functions of a fixed type converge then the distributions converge. Recall again that we are thinnking of our probability distributions as measures on \(\R\) even when supported on a smaller subset of \(\R\).
Convergence in distribution in terms of probability density functions.
Naturally, we would like to compare convergence in distribution with other modes of convergence we have studied.
Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\), all defined on the same probability space. If \(X_n \to X_\infty\) as \(n \to \infty\) in probability then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
Let \(F_n\) denote the distribution function of \(X_n\) for \(n \in \N_+^*\). Fix \(\epsilon \gt 0\). Note first that \(\P(X_n \le x) = \P(X_n \le x, X_\infty \le x + \epsilon) + \P(X_n \le x, X_\infty \gt x + \epsilon) \). Hence \(F_n(x) \le F_\infty(x + \epsilon) + \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right)\). Next, note that \(\P(X_\infty \le x - \epsilon) = \P(X_\infty \le x - \epsilon, X_n \le x) + \P(X_\infty \le x - \epsilon, X_n \gt x)\). Hence \(F_\infty(x - \epsilon) \le F_n(x) + \P\left(\left|X_n - X_\infty\right|\right) \gt \epsilon\). From the last two results it follows that \[ F_\infty(x - \epsilon) - \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right) \le F_n(x) \le F_\infty(x + \epsilon) + \P\left(\left|X_n - X_\infty\right| \gt \epsilon\right) \] Letting \(n \to \infty\) and using convergence in probability gives \[ F_\infty(x - \epsilon) \le \liminf_{n \to \infty} F_n(x) \le \limsup_{n \to \infty} F_n(x) \le F_\infty(x + \epsilon) \] Finally, letting \(\epsilon \downarrow 0\) we see that if \(F_\infty\) is continuous at \(x\) then \(F_n(x) \to F_\infty(x)\) as \(n \to \infty\).
Our next example shows that even when the variables are defined on the same probability space, a sequence can converge in distribution, but not in any other way.
Let \(X\) be an indicator variable with \(\P(X = 0) = \P(X = 1) = \frac{1}{2}\), so that \(X\) is the result of tossing a fair coin. Let \(X_n = 1 - X \) for \(n \in \N_+\). Then
The critical fact that makes this counterexample work is that \(1 - X\) has the same distribution as \(X\). Any random variable with this property would work just as well, so if you prefer a counterexample with continuous distributions, let \(X\) have probability density function \(f\) given by \(f(x) = 6 x (1 - x)\) for \(0 \le x \le 1\). The distribution of \(X\) is an example of a beta distribution.
The following summary gives the implications for the various modes of convergence; no other implications hold in general.
Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\), all defined on a common probability space.
It follows that convergence with probability 1, convergence in probability, and convergence in mean all imply convergence in distribution, so the latter mode of convergence is indeed the weakest. However, our next theorem gives an important converse to part (c) in , when the limiting variable is a constant. Of course, a constant can be viewed as a random variable defined on any probability space.
Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+\), defined on the same probability space, and that \(c \in \R\). If \(X_n \to c\) as \(n \to \infty\) in distribution then \(X_n \to c\) as \(n \to \infty\) in probability.
Assume that the probability space is \((\Omega, \ms F, \P)\). Note first that \(\P(X_n \le x) \to 0\) as \(n \to \infty\) if \(x \lt c\) and \(\P(X_n \le x) \to 1\) as \(n \to \infty\) if \(x \gt c\). It follows that \(\P\left(\left|X_n - c\right| \le \epsilon\right) \to 1\) as \(n \to \infty\) for every \(\epsilon \gt 0\).
As noted in the summary in , convergence in distribution does not imply convergence with probability 1, even when the random variables are defined on the same probability space. However, the next theorem, known as the Skorohod representation theorem, gives an important partial result in this direction.
Suppose that \(P_n\) is a probability measure on \(\R\) for each \(n \in \N_+^*\) and that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Then there exist real-valued random variables \(X_n\) for \(n \in \N_+^*\), defined on the same probability space, such that
Let \((\Omega, \ms F, \P)\) be a probability space and \(U\) a random variable defined on this space that is uniformly distributed on the interval \((0, 1)\). For a specific construction, we could take \(\Omega = (0, 1)\), \(\ms F\) the \(\sigma\)-algebra of Borel measurable subsets of \((0, 1)\), and \(\P\) Lebesgue measure on \((\Omega, \ms F)\) (the uniform distribution on \((0, 1)\)). Then let \(U\) be the identity function on \(\Omega\) so that \(U(\omega) = \omega\) for \(\omega \in \Omega\), so that \(U\) has probability distribution \(\P\). We have seen this construction many times before.
The following theorem illustrates the value of the Skorohod representation and the usefulness of random variable notation for convergence in distribution. The theorem is also quite intuitive, since a basic idea is that continuity should preserve convergence.
Suppose that \(X_n\) is a real-valued random variable for each \(n \in \N_+^*\) (not necessarily defined on the same probability space). Suppose also that \(g: \R \to \R\) is measurable, and let \(D_g\) denote the set of discontinuities of \(g\), and \(P_\infty\) the distribution of \(X_\infty\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and \(P_\infty(D_g) = 0\), then \(g(X_n) \to g(X_\infty)\) as \(n \to \infty\) in distribution.
By Skorohod's theorem, there exists random variables \(Y_n\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \ms F, \P)\), such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\), and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Since \(\P(Y_\infty \in D_g) = P_\infty(D_g) = 0\) it follows that \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) with probability 1. Hence by , \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) in distribution. But \(g(Y_n)\) has the same distribution as \(g(X_n)\) for each \(n \in \N_+^*\).
As a simple corollary, if \(X_n\) converges \(X_\infty\) as \(n \to \infty\) in distribution, and if \(a, \, b \in \R\) then \(a + b X_n\) converges to \(a + b X\) as \(n \to \infty\) in distribution. But we can do a little better:
Suppose that \(X_n\) is a real-valued random variable and that \(a_n, \, b_n \in \R\) for each \(n \in \N_+^*\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and if \(a_n \to a_\infty\) and \(b_n \to b_\infty\) as \(n \to \infty\), then \(a_n + b_n X_n \to a_\infty + b_\infty X_\infty\) as \(n \to \infty\) in distribution.
Again by Skorohod's , there exist random variables \(Y_n\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \ms F, \P)\) such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\) and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Hence also \(a_n + b_n Y_n \to a_\infty + b_\infty Y_\infty\) as \(n \to \infty\) with probability 1. By , \(a_n + b_n Y_n \to a_\infty + b_\infty Y_\infty\) as \(n \to \infty\) in distribution. But \(a_n + b_n Y_n\) has the same distribution as \(a_n + b_n X_n\) for \(n \in \N_+^*\).
The definition of convergence in distribution requires that the sequence of probability measures converge on sets of the form \((-\infty, x]\) for \(x \in \R\) when the limiting distrbution has probability 0 at \(x\). It turns out that the probability measures will converge on lots of other sets as well, and this result points the way to extending convergence in distribution to more general spaces. To state the result, recall that if \(A\) is a subset of a topological space, then the boundary of \(A\) is \(\partial A = \cl(A) \setminus \interior(A)\) where \(\cl(A)\) is the closure of \(A\) (the smallest closed set that contains \(A\)) and \(\interior(A)\) is the interior of \(A\) (the largest open set contained in \(A\)).
Suppose that \(P_n\) is a probability measure on \(\R\) for \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) for every \(A \in \ms R\) with \(P(\partial A) = 0\).
Suppose that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Let \(X_n\) be a random variable with distribution \(P_n\) for \(n \in \N_+^*\). (We don't care about the underlying probability spaces.) If \(A \in \ms R\) then the set of discontinuities of \(\bs 1_A\), the indicator function of \(A\), is \(\partial A\). So, suppose \(\P_\infty(\partial A) = 0\). By the continuity , \(\bs 1_A(X_n) \to \bs 1_A(X_\infty)\) as \(n \to \infty\) in distribution. Let \(G_n\) denote the distribution function of \(\bs 1_A(X_n)\) for \(n \in \N_+^*\). The only possible points of discontinuity of \(G_\infty\) are 0 and 1. Hence \(G_n\left(\frac 1 2\right) \to G_\infty\left(\frac 1 2\right) \) as \(n \to \infty\). But \(G_n\left(\frac 1 2\right) = P_n(A^c)\) for \(n \in \N_+^*\). Hence \(P_n(A^c) \to \P_\infty(A^c)\) and so also \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\).
Conversely, suppose that the condition in the theorem holds. If \(x \in \R\), then the boundary of \((-\infty, x]\) is \(\{x\}\), so if \(P_\infty\{x\} = 0\) then \(P_n(-\infty, x] \to P_\infty(-\infty, x]\) as \(n \to \infty\). So by definition, \(P_n \Rightarrow P_\infty\) as \(n \to \infty\).
In the context of this result, suppose that \(a, \, b \in \R\) with \(a \lt b\). If \(P\{a\} = P\{b\} = 0\), then as \(n \to \infty\) we have \(P_n(a, b) \to P(a, b)\), \(P_n[a, b) \to P[a, b)\), \(P_n(a, b] \to P(a, b]\), and \(P_n[a, b] \to P[a, b]\). Of course, the limiting values are all the same.
Next we will explore several interesting examples of the convergence of distributions on \((\R, \ms R)\). There are several important cases where a special distribution converges to another special distribution as a parameter approaches a limiting value. Indeed, such convergence results are part of the reason why such distributions are special in the first place.
Recall that the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn without replacement from a population of \(m\) objects with \(r\) objects of type 1. It has discrete probability density function \(f\) given by \[ f(k) = \frac{\binom{r}{k} \binom{m - r}{n - k}}{\binom{m}{n}}, \quad k \in \{0, 1, \ldots, n\} \] The pramaters \(m\), \(r\), and \(n\) are positive integers with \(n \le m\) and \(r \le m\).
Recall next that Bernoulli trials are independent trials, each with two possible outcomes, generically called success and failure. The probability of success \(p \in [0, 1]\) is the same for each trial. The binomial distribution with parameters \(n \in \N_+\) and \(p\) is the distribution of the number successes in \(n\) Bernoulli trials. This distribution has probability density function \(g\) given by \[ g(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \] Note that the binomial distribution with parameters \(n\) and \(p = r / m\) is the distribution that governs the number of type 1 objects in a sample of size \(n\), drawn with replacement from a population of \(m\) objects with \(r\) objects of type 1. This fact is motivation for the following result:
Suppose that \(r_m \in \{0, 1, \ldots, m\}\) for each \(m \in \N_+\) and that \(r_m / m \to p\) as \(m \to \infty\). For fixed \(n \in \N_+\), the hypergeometric distribution with parameters \(m\), \(r_m\), and \(n\) converges to the binomial distribution with parameters \(n\) and \(p\) as \(m \to \infty\).
Recall that for \( a \in \R \) and \( j \in \N \), we let \( a^{(j)} = a \, (a - 1) \cdots [a - (j - 1)] \) denote the falling power of \( a \) of order \( j \). The hypergeometric density function can be written as \[ f_m(k) = \binom{n}{k} \frac{r_m^{(k)} (m - r_m)^{(n - k)}}{m^{(n)}}, \quad k \in \{0, 1, \ldots, n\} \] In the fraction above, the numerator and denominator both have \( n \) fractors. Suppose that we group the \( k \) factors in \( r_m^{(k)} \) with the first \( k \) factors of \( m^{(n)} \) and the \( n - k \) factors of \( (m - r_m)^{(n-k)} \) with the last \( n - k \) factors of \( m^{(n)} \) to form a product of \( n \) fractions. The first \( k \) fractions have the form \( (r_m - j) \big/ (m - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( p \) as \( m \to \infty \). The last \( n - k \) fractions have the form \( (m - r_m - j) \big/ (m - k - j) \) for some \( j \) that does not depend on \( m \). Each of these converges to \( 1 - p \) as \( m \to \infty \). Hence \[f_m(k) \to \binom{n}{k} p^k (1 - p)^{n-k} \text{ as } m \to \infty \text{ for each } k \in \{0, 1, \ldots, n\}\] The result now follows from .
From a practical point of view, the last result means that if the population size \(m\) is large
compared to sample size \(n\), then the hypergeometric distribution with parameters \(m\), \(r\), and \(n\) (which corresponds to sampling without replacement) is well approximated by the binomial distribution with parameters \(n\) and \(p = r / m\) (which corresponds to sampling with replacement). This is often a useful result, not computationally, but rather because the binomial distribution has fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the limiting binomial distribution, we do not need to know the population size \(m\) and the number of type 1 objects \(r\) individually, but only in the ratio \(r / m\).
In the ball and urn experiment, set \(m = 100\) and \(r = 30\). For each of the following values of \(n\) (the sample size), switch between sampling without replacement (the hypergeometric distribution) and sampling with replacement (the binomial distribution). Note the difference in the probability density functions. Run the simulation 1000 times for each sampling mode and compare the relative frequency function to the probability density function.
Recall again that the binomial distribution with parameters \(n \in \N_+\) and \(p \in [0, 1]\) is the distribution of the number successes in \(n\) Bernoulli trials, when \(p\) is the probability of success on a trial. This distribution has probability density function \(f\) given by
\[ f(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k \in \{0, 1, \ldots, n\} \]
Recall also that the Poisson distribution with parameter \(r \in (0, \infty)\) has probability density function \(g\) given by
\[g(k) = e^{-r} \frac{r^k}{k!}, \quad k \in \N\]
The distribution is named for Simeon Poisson and governs the number of random points
in a region of time or space, under certain ideal conditions. The parameter \(r\) is proportional to the size of the region of time or space.
Suppose that \(p_n \in [0, 1]\) for \(n \in \N_+\) and that \(n p_n \to r \in (0, \infty)\) as \(n \to \infty\). Then the binomial distribution with parameters \(n\) and \(p_n\) converges to the Poisson distribution with parameter \(r\) as \(n \to \infty\).
For \( k, \, n \in \N \) with \( k \le n \), the binomial density function can be written as \[ f_n(k) = \frac{n^{(k)}}{k!} p_n^k (1 - p_n)^{n - k} = \frac{1}{k!} (n p_n) \left[(n - 1) p_n\right] \cdots \left[(n - k + 1) p_n\right] (1 - p_n)^{n - k} \] First, \( (n - j) p_n \to r \) as \(n \to \infty\) for \(j \in \{0, 1, \ldots, n - 1\}\). Next, by a famous limit from calculus, \( (1 - p_n)^n = (1 - n p_n / n)^n \to e^{-r} \) as \( n \to \infty \). Hence also \((1 - p_n)^{n-k} \to e^{-r}\) as \(n \to \infty\) for fixed \(k \in \N_+\). Therefore \(f_n(k) \to e^{-r} r^k / k!\) as \(n \to \infty\) for each \(k \in \N_+\). The result now follows from .
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials \(n\) is large
and the probability of success \(p\) small
, so that \(n p^2\) is small, then the binomial distribution with parameters \(n\) and \(p\) is well approximated by the Poisson distribution with parameter \(r = n p\). This is often a useful result, again not computationally, but rather because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of trials \(n\) and the probability of success \(p\) individually, but only in the product \(n p\). As we will see in the next chapter, the condition that \(n p^2\) be small means that the variance of the binomial distribution, namely \(n p (1 - p) = n p - n p^2\) is approximately \(r = n p\), which is the variance of the approximating Poisson distribution.
In the binomial timeline experiment, set the parameter values as follows, and observe the graph of the probability density function. (Note that \(n p = 5\) in each case.) Run the experiment 1000 times in each case and compare the relative frequency function and the probability density function. Note also the successes represented as random points
in discrete time.
In the Poisson experiment, set \(r = 5\) and \(t = 1\), to get the Poisson distribution with parameter 5. Note the shape of the probability density function. Run the experiment 1000 times and compare the relative frequency function to the probability density function. Note the similarity between this experiment and the one in the previous exercise.
Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\) has probability density function \(f\) given by \[ f(k) = p (1 - p)^{k-1}, \quad k \in \N_+\] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.
Suppose that \(U\) has the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\). For \( n \in \N_+ \), the conditional distribution of \( U \) given \( U \le n \) converges to the uniform distribution on \(\{1, 2, \ldots, n\}\) as \(p \downarrow 0\).
The distribution function \(F\) of \( U \) is given by \( F(k) = 1 - (1 - p)^k \) for \(k \in \N_+\). Hence for \(n \in \N_+\), the conditional distribution function of \( U \) given \( U \le n \) is \[ F_n(k) = \P(U \le k \mid U \le n) = \frac{\P(U \le k)}{\P(U \le n)} = \frac{1 - (1 - p)^k}{1 - (1 - p)^n}, \quad k \in \{1, 2, \ldots n\} \] Using L'Hospital's rule, gives \( F_n(k) \to k / n \) as \( p \downarrow 0 \) for \(k \in \{1, 2, \ldots, n\}\). As a function of \(k\) this is the distribution function of the uniform distribution on \( \{1, 2, \ldots, n\} \).
Next, recall that the exponential distribution with rate parameter \(r \in (0, \infty)\) has distribution function \(G\) given by
\[ G(t) = 1 - e^{-r t}, \quad 0 \le t \lt \infty \]
The exponential distribution governs the time between arrivals
in the Poisson model of random points in time.
Suppose that \(U_n\) has the geometric distribution on \(\N_+\) with success parameter \(p_n \in (0, 1]\) for \(n \in \N_+\), and that \(n p_n \to r \in (0, \infty)\) as \(n \to \infty\). The distribution of \(U_n / n\) converges to the exponential distribution with parameter \(r\) as \(n \to \infty\).
Let \( F_n \) denote the distribution function of \( U_n / n \). Then for \( x \in [0, \infty) \) \[ F_n(x) = \P\left(\frac{U_n}{n} \le x\right) = \P(U_n \le n x) = \P\left(U_n \le \lfloor n x \rfloor\right) = 1 - \left(1 - p_n\right)^{\lfloor n x \rfloor} \] We showed in the proof of that \( (1 - p_n)^n \to e^{-r} \) as \( n \to \infty \), and hence \( \left(1 - p_n\right)^{n x} \to e^{-r x} \) as \( n \to \infty \). But by definition, \( \lfloor n x \rfloor \le n x \lt \lfloor n x \rfloor + 1\) or equivalently, \( n x - 1 \lt \lfloor n x \rfloor \le n x \) so it follows from the squeeze theorem that \( \left(1 - p_n \right)^{\lfloor n x \rfloor} \to e^{- r x} \) as \( n \to \infty \). Hence \( F_n(x) \to 1 - e^{-r x} \) as \( n \to \infty \). As a function of \(x \in [0, \infty), this is the distribution function of the exponential distribution with parameter \(r\).
Note that the limiting condition on \(n\) and \(p\) in is precisely the same as the condition in . For a deeper interpretation of both of these results, see the section on the Poisson distribution.
In the negative binomial experiment, set \(k = 1\) to get the geometric distribution. Then decrease the value of \(p\) and note the shape of the probability density function. With \(p = 0.5\) run the experiment 1000 times and compare the relative frequency function to the probability density function.
In the gamma experiment, set \(k = 1\) to get the exponential distribution, and set \(r = 5\). Note the shape of the probability density function. Run the experiment 1000 times and compare the empirical density function and the probability density function. Compare this experiment with the one in the previous exercise, and note the similarity, up to a change in scale.
For \(n \in \N_+\), consider a random permutation \((X_1, X_2, \ldots, X_n)\) of the elements in the set \(\{1, 2, \ldots, n\}\). We say that a match occurs at position \(i\) if \(X_i = i\).
\(\P\left(X_i = i\right) = \frac{1}{n}\) for each \(i \in \{1, 2, \ldots, n\}\).
The number of permutations of \(\{1, 2, \ldots, n\}\) is \(n!\). For \(i \in \{1, 2, \ldots, n\}\), the number of such permutations with \(i\) in position \(i\) is \((n - 1)!\). Hence \(\P(X_i = i) = (n - 1)! / n! = 1 / n\). A more direct argument is that \(i\) is no more or less likely to end up in position \(i\) as any other number.
So the matching events all have the same probability, which varies inversely with the number of trials.
\(\P\left(X_i = i, X_j = j\right) = \frac{1}{n (n - 1)}\) for \(i, \, j \in \{1, 2, \ldots, n\}\) with \(i \ne j\).
Again, the number of permutations of \(\{1, 2, \ldots, n\}\) is \(n!\). For distinct \(i, \, j \in \{1, 2, \ldots, n\}\), the number of such permutations with \(i\) in position \(i\) and \(j\) in position \(j\) is \((n - 2)!\). Hence \(\P(X_i = i, X_j = j) = (n - 2)! / n! = 1 / n (n - 1)\).
So the matching events are dependent, and in fact are positively correlated. In particular, the matching events do not form a sequence of Bernoulli trials. In our detailed disucssion of the matching problem we show that the number of matches \(N_n\) has probability density function \(f_n\) given by: \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!}, \quad k \in \{0, 1, \ldots, n\} \]
The distribution of \(N_n\) converges to the Poisson distribution with parameter 1 as \(n \to \infty\).
For \( k \in \N \), \[ f_n(k) = \frac{1}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!} \to \frac{1}{k!} \sum_{j=0}^\infty \frac{(-1)^j}{j!} = \frac{1}{k!} e^{-1} \] As a function of \(k \in \N\), this is the density function of the Poisson distribution with parameter 1. So the result follows from the theorem in on density functions.
In the matching experiment, increase \(n\) and note the apparent convergence of the probability density function for the number of matches. With selected values of \(n\), run the experiment 1000 times and compare the relative frequency function and the probability density function.
Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables, each with the standard exponential distribution. So the common distribution function \(G\) is given by \[ G(x) = 1 - e^{-x}, \quad 0 \le x \lt \infty \]
As \(n \to \infty\), the distribution of \(Y_n = \max\{X_1, X_2, \ldots, X_n\} - \ln n \) converges to the distribution with distribution function \(F\) given by \[ F(x) = e^{-e^{-x}}, \quad x \in \R\]
Let \( X_{(n)} = \max\{X_1, X_2, \ldots, X_n\} \) and recall that \( X_{(n)} \) has distribution function \( G^n \). Let \( F_n \) denote the distribution function of \( Y_n \). For \( x \in \R \) \[ F_n(x) = \P(Y_n \le x) = \P\left(X_{(n)} \le x + \ln n \right) = G^n(x + \ln n) = \left[1 - e^{-(x + \ln n) }\right]^n = \left(1 - \frac{e^{-x}}{n} \right)^n \] By our famous limit from calculus again, \( F_n(x) \to e^{-e^{-x}} \) as \( n \to \infty \).
This limiting distribution is the standard extreme value distribution, also known as the standard Gumbel distribution in honor of Emil Gumbel.
Recall that the Pareto distribution with shape parameter \(a \in (0, \infty)\) has distribution function \(F\) given by \[F(x) = 1 - \frac{1}{x^a}, \quad 1 \le x \lt \infty\] The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution sometimes used to model financial variables.
Suppose that \(X_n\) has the Pareto distribution with parameter \(n\) for each \(n \in \N_+\). Then
The two fundamental theorems of basic probability theory, the law of large numbers and the central limit theorem are studied in detail in the chapter on random samples. For this reason we will simply state the results in this section. So suppose that \((X_1, X_2, \ldots)\) is a sequence of independent, identically distributed, real-valued random variables (defined on the same probability space) with mean \(\mu \in (-\infty. \infty)\) and standard deviation \(\sigma \in (0, \infty)\). For \(n \in \N_+\), let \( Y_n = \sum_{i=1}^n X_i \) denote the sum of the first \(n\) variables, \( M_n = Y_n \big/n \) the average of the first \( n \) variables, and \( Z_n = (Y_n - n \mu) \big/ \sqrt{n} \sigma \) the standard score of \( Y_n \).
The fundamental theorems of probability
In part (a), convergence with probability 1 is the strong law of large numbers while convergence in probability and in distribution are the weak laws of large numbers.
Our next goal is to define convergence of probability distributions on more general measurable spaces.
First we need to define the type of measurable spaces that we will use in this subsection.
We assume that \((S, d)\) is a complete, separable metric space and let \(\ms S\) denote the Borel \(\sigma\)-algebra of subsets of \(S\), that is, the \(\sigma\)-algebra generated by the topology. The standard spaces that we often use are special cases:
Recall that the metric space \((S, d)\) is complete if every Cauchy sequence in \(S\) converges to a point in \(S\). The space is separable if there exists a coutable subset that is dense. A complete, separable metric space is sometimes called a Polish space because such spaces were extensively studied by a group of Polish mathematicians in the 1930s, including Kazimierz Kuratowski.
As suggested by our setup, the definition for convergence in distribution involves both measure theory and topology. The motivation is for the one-dimensional Euclidean space \((\R, \ms R)\).
Convergence in distribution:
Let's consider our two special cases. In the discrete case, as usual, the measure theory and topology are not really necessary.
Suppose that \(P_n\) is a probability measures on a discrete space \((S, \ms S)\) for each \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) for every \(A \subseteq S\).
This follows from the definition. Every subset is both open and closed so \(\partial A = \emptyset\) for every \(A \subseteq S\).
In the Euclidean case, it suffices to consider distribution functions, as in the one-dimensional case. If \(P\) is a probability measure on \((\R^n, \ms R^n)\), recall that the distribution function \(F\) of \(P\) is given by \[F(x_1, x_2, \ldots, x_n) = P\left((-\infty, x_1] \times (-\infty, x_2] \times \cdots \times (-\infty, x_n]\right), \quad (x_1, x_2, \ldots, x_n) \in \R^n\]
Suppose that \(P_n\) is a probability measures on \((\R^n, \ms R^n)\) with distribution function \(F_n\) for each \(n \in \N_+^*\). Then \(P_n \Rightarrow P_\infty\) as \(n \to \infty\) if and only if \(F_n(\bs x) \to F_\infty(\bs x)\) as \(n \to \infty\) for every \(\bs x \in \R^n\) where \(F_\infty\) is continuous.
As in the case of \((\R, \ms R)\), convergence in probability implies convergence in distribution.
Suppose that \(X_n\) is a random variable in \(S\) for each \(n \in \N_+^*\), all defined on the same probability space. If \(X_n \to X_\infty\) as \(n \to \infty\) in probability then \(X_n \to X_\infty\) as \(n \to \infty\) in distribution.
Assume that the common probability space is \((\Omega, \ms F, \P)\). Recall that convergence in probability means that \(\P[d(X_n, X_\infty) \gt \epsilon] \to 0\) as \(n \to \infty\) for every \(\epsilon \gt 0\),
So as before, convergence with probability 1 implies convergence in probability which in turn implies convergence in distribution.
As you might guess, Skorohod's for the one-dimensional Euclidean space \((\R, \ms R)\) can be extended to the more general spaces. However the proof is not nearly as straightforward, because we no longer have the quantile function for constructing random variables on a common probability space.
Suppose that \(P_n\) is a probability measures on \((S, \ms S)\) for each \(n \in \N_+^*\) and that \(P_n \Rightarrow P_\infty\) as \(n \to \infty\). Then there exists a random variable \(X_n\) in \(S\) for each \(n \in \N_+^*\), defined on a common probability space, such that
One of the main consequences of Skorohod's representation, the preservation of convergence in distribution under continuous functions, is still true and has essentially the same proof. For the general setup, suppose that \((S, d, \ms S)\) and \((T, e, \ms T)\) are spaces of the type described in .
Suppose that \(X_n\) is a random variable in \(S\) for each \(n \in \N_+^*\) (not necessarily defined on the same probability space). Suppose also that \(g: S \to T\) is measurable, and let \(D_g\) denote the set of discontinuities of \(g\), and \(P_\infty\) the distribution of \(X_\infty\). If \(X_n \to X_\infty\) as \(n \to \infty\) in distribution and \(P_\infty(D_g) = 0\), then \(g(X_n) \to g(X_\infty)\) as \(n \to \infty\) in distribution.
By Skorohod's theorem, there exists random variables \(Y_n\) in \(S\) for \(n \in \N_+^*\), defined on the same probability space \((\Omega, \ms F, \P)\), such that \(Y_n\) has the same distribution as \(X_n\) for \(n \in \N_+^*\), and \(Y_n \to Y_\infty\) as \(n \to \infty\) with probability 1. Since \(\P(Y_\infty \in D_g) = P_\infty(D_g) = 0\) it follows that \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) with probability 1. Hence \(g(Y_n) \to g(Y_\infty)\) as \(n \to \infty\) in distribution. But \(g(Y_n)\) has the same distribution as \(g(X_n)\) for each \(n \in \N_+^*\).
A simple consequence of the continuity theorem is that if a sequence of random vectors in \(\R^n\) converge in distribution, then the sequence of each coordinate also converges in distribution. Let's just consider the two-dimensional case to keep the notation simple.
Suppose that \((X_n, Y_n)\) is a random variable in \(\R^2\) for \(n \in \N_+^*\) and that \((X_n, Y_n) \to (X_\infty, Y_\infty)\) as \(n \to \infty\) in distribution. Then
Our next discussion concerns an important result known as Scheffé's theorem, named after Henry Scheffé. To state our theorem, suppose that \( (S, \ms S, \mu) \) is a measure space, so that \( S \) is a set, \( \ms S \) is a \( \sigma \)-algebra of subsets of \( S \), and \( \mu \) is a positive measure on \( (S, \ms S) \). Further, suppose that \( P_n \) is a probability measure on \( (S, \ms S) \) that has density function \( f_n \) with respect to \( \mu \) for each \( n \in \N_+^* \).
If \(f_n(x) \to f_\infty(x)\) as \(n \to \infty\) for almost all \( x \in S \) (with respect to \( \mu \)) then \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) uniformly in \(A \in \ms S\).
From basic properties of the integral it follows that for \( A \in \ms S \), \[\left|P_\infty(A) - P_n(A)\right| = \left|\int_A f_\infty \, d\mu - \int_A f_n \, d\mu \right| = \left| \int_A (f_\infty - f_n) \, d\mu\right| \le \int_A \left|f_\infty - f_n\right| \, d\mu \le \int_S \left|f_\infty - f_n\right| \, d\mu\] For \(n \in \N_+\) let \(g_n = f_\infty - f_n\), and let \(g_n^+\) denote the positive part of \(g_n\) and \(g_n^-\) the negative part of \(g_n\). Note that \(g_n^+ \le f_\infty\) and \(g_n^+ \to 0\) as \(n \to \infty\) almost everywhere on \( S \). Since \( f_\infty \) is a probability density function, it is trivially integrable, so by the dominated convergence theorem, \(\int_S g_n^+ \, d\mu \to 0\) as \(n \to \infty\). But \(\int_\R g_n \, d\mu = 0\) so \(\int_\R g_n^+ \, d\mu = \int_\R g_n^- \, d\mu\). Therefore \(\int_S \left|g_n\right| \, d\mu = 2 \int_S g_n^+ d\mu \to 0\) as \(n \to \infty\). Hence \(P_n(A) \to P_\infty(A)\) as \(n \to \infty\) uniformly in \(A \in \ms S\).
Of course, the most important special cases of Scheffé's theorem are to discrete distributions and to continuous distributions on a subset of \( \R^n \), as in on density functions.
Generating functions are studied in the chapter on expected value. In part, the importance of generating functions stems from the fact that ordinary (pointwise) convergence of a sequence of generating functions corresponds to the convergence of the distributions in the sense of this section. Often it is easier to show convergence in distribution using generating functions than directly from the definition.
In addition, converence in distribution has elegant characterizations in terms of the convergence of the expected values of certain types of functions of the underlying random variables.