Our random experiment is to perform a sequence of Bernoulli trials \(\bs{X} = (X_1, X_2, \ldots) \). Recall that \(\bs{X}\) is a sequence of independent, identically distributed indicator random variables, and in the usual language of reliability, 1 denotes success and 0 denotes failure. The common probability of success \( p = \P(X_i = 1) \), is the basic parameter of the process. In statistical terms, the first \(n\) trails \((X_1, X_2, \ldots, X_n)\) form a random sample of size \(n\) from the Bernoulli distribution.
In this section we will study the random variable that gives the number of successes in the first \(n\) trials and the random variable that gives the proportion of successes in the first \(n\) trials. The underlying distribution, the binomial distribution, is one of the most important in probability theory, and so deserves to be studied in considerable detail. As you will see, some of the results in this section have two or more proofs. In almost all cases, note that the proof from Bernoulli trials is the simplest and most elegant.
For \( n \in \N \), the number of successes in the first \(n\) trials is the random variable \[ Y_n = \sum_{i=1}^n X_i, \quad n \in \N\] The distribution of \( Y_n \) is the binomial distribution with trial parameter \( n \) and success parameter \( p \).
Note that \(\bs{Y} = (Y_0, Y_1, \ldots)\) is the partial sum process associated with the Bernoulli trials sequence \(\bs{X}\). In particular, \( Y_0 = 0 \), so point mass at 0 is considered a degenerate form of the binomial distribution.
The probability density function \( f_n \) of \(Y_n\) is given by \[ f_n(y) = \binom{n}{y} p^y (1 - p)^{n-y}, \quad y \in \{0, 1, \ldots, n\} \]
Recall that if \((x_1, x_2 \ldots) \in \{0, 1\}^n\) with \(\sum_{i=1}^n x_i = y\) (that is, a bit string of length \(n\) with 1 occurring exactly \(y\) times), then by independence, \[ \P\left[(X_1, X_2, \ldots, X_n) = (x_1, x_2, \ldots, x_n)\right] = p^y (1 - p)^{n - y} \] Moreover, the number of bit strings of length \(n\) with 1 occurring exactly \(y\) times is the binomial coefficient \(\binom{n}{y}\). By the additive property of probability \[ \P(Y_n = y) = \binom{n}{y} p^y (1 - p)^{n-y}, \quad y \in \{0, 1, \ldots, n\} \]
We can check that \(f_n\) is a valid PDF. Clearly \( f_n(y) \ge 0 \) for \( y \in \{0, 1, \ldots, n\} \). From the binomial theorem \[ \sum_{y=0}^n f_n(y) = \sum_{y=0}^n \binom{n}{y} p^y (1 - p)^{n-y} = \left[p + (1 - p)\right]^n = 1 \]
In the binomial coin experiment, vary \(n\) and \(p\) with the scrollbars, and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the relative frequency function to the probability density function.
The binomial distribution is unimodal: For \( k \in \{1, 2, \ldots, n\} \),
So the density function at first increases and then decreases, reaching its maximum value at \(\lfloor (n + 1) p \rfloor\). This integer is a mode of the distribution. In the case that \(m = (n + 1) p\) is an integer between 1 and \(n\), there are two consecutive modes, at \(m - 1\) and \(m\).
Now let \( F_n \) denote the distribution function of \( Y_n \), so that \[ F_n(y) = \P(Y_n \le y) = \sum_{k=0}^y f_n(k) = \sum_{k=0}^y \binom{n}{k} p^k (1 - p)^{n-k}, \quad y \in \{0, 1, \ldots, n\} \] The distribution function \( F_n \) and the quantile function \( F_n^{-1} \) do not have simple, closed forms, but values of these functions can be computed from mathematical and statistical software.
Open the quantile app and select the binomial distribution. Vary \( n \) and \( p \) and note the shape and location of the distribution function. For various values of the parameters, compute the median and the first and third quartiles.
The binomial distribution function also has a nice relationship to the beta distribution function.
The distribution function \( F_n \) can be written in the form \[ F_n(k) = \frac{n!}{(n - k - 1)! k!} \int_0^{1-p} x^{n-k-1} (1 - x)^k \, dx, \quad k \in \{0, 1, \ldots, n\} \]
Let \( G_n(k) \) denote the expression on the right. Substitution and simple integration shows that \( G_n(0) = (1 - p)^n = f_n(0) = F_n(0) \). For \( k \in \{1, 2, \ldots, n\} \), integrating by parts with \( u = (1 - x)^k \) and \( dv = x^{n - k - 1} dx \) gives \[ G_n(k) = \binom{n}{k} p^k (1 - p)^{n-k} + \frac{n!}{(n - k)! (k - 1)!} \int_0^{1-p} x^{n-k} (1 - x)^k \, dx = f_n(k) + G_n(k - 1) \] It follows that \( G_n(k) = \sum_{j=0}^k f_n(j) = F_n(k) \) for \( k \in \{0, 1, \ldots, n\} \).
The expression on the right in is the beta distribution function, with left parameter \( n - k \) and right parameter \( k + 1 \), evaluated at \( 1 - p \).
The mean, variance and other moments of the binomial distribution can be computed in several different ways. Again let \(Y_n = \sum_{i=1}^n X_i\) where \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of Bernoulli trials with success parameter \(p\).
The mean and variance of \( Y_n \) are
The best proof is based on the \(Y_n\) as a sum of independent indicator variables:
For a proof using the probability density function, recall the identity \(y \binom{n}{y} = n \binom{n - 1}{y - 1}\) for \( n, \, y \in \N_+ \). Using the binomial theorem, \begin{align} \E(Y_n) & = \sum_{y=0}^n y \binom{n}{y} p^y (1 - p)^{n-y} = \sum_{y=1}^n y \binom{n}{y} p^y (1 - p)^{n-y} \\ & = \sum_{y=1}^n n \binom{n - 1}{y - 1} p^y (1 - p)^{n-y} = n p \sum_{y=1}^n \binom{n - 1}{y - 1} p^{y-1} (1 - p)^{(n - 1) - (y - 1)} \\ & = n p \sum_{k=0}^{n-1} \binom{n - 1}{k} p^k (1 - p)^{n - 1 - k} = n p \left[p + (1 - p)\right]^{n-1} = n p \end{align} A similar, but more complicated proof can be used for part (b).
The expected value of \(Y_n\) also makes intuitive sense, since \(p\) should be approximately the proportion of successes in a large number of trials. We will discuss the point further in the subsection below on the proportion of successes. Note that the graph of \(\var(Y_n)\) as a function of \(p \in [0, 1]\) is parabola opening downward. In particular the maximum value of the variance is \(n / 4\) when \(p = 1 / 2\), and the minimm value is 0 when \(p = 0\) or \(p = 1\). Of course, in the last two cases, \( Y_n \) is deterministic, taking just the value 0 if \( p = 0 \) and just the value \( n \) when \( p = 1 \).
In the binomial coin experiment, vary \(n\) and \(p\) with the scrollbars and note the location and size of the mean\(\pm\)standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
The probability generating function of \(Y_n\) is \( P_n(t) = \E \left(t^{Y_n} \right) = \left[(1 - p) + p t\right]^n\) for \( t \in \R \).
The best proof uses the fact that the probability generating function of a sum of indpendent variables is the product of the probability generating functions of the terms. Recall that the PGF of \(X_i\) is \( P(t) = \E\left(t^{X_i}\right) = (1 - p) + p t\) for each \(i\). Hence \( P_n(t) = P^n(t) \).
For a direct proof we use the binomial theorem yet again, \[ \E\left(t^{Y_n}\right) = \sum_{y=0}^n t^y \binom{n}{y} p^y (1 - p)^{n-y} = \sum_{y=0}^n \binom{n}{y} (p t)^y (1 - p)^{n-y} = \left[ p t + (1 - p)\right]^n\]
Recall that for \( x \in \R \) and \( k \in \N \), the falling power of \( x \) of order \( k \) is \( x^{(k)} = x (x - 1) \cdots (x - k + 1) \). If \( X \) is a random variable and \( k \in \N \), then \( \E\left[X^{(k)}\right] \) is the factorial moment of \( X \) of onder \( k \). The probability generating function in provides an easy way to compute the factorial moments of the binomial distribution.
\( \E\left[Y_n^{(k)}\right] = n^{(k)} p^k \) for \( k \in \N \).
Recall that \( P_n^{(k)}(1) = \E\left[Y_n^{(k)}\right] \) where \( P_n^{(k)} \) denotes the \( k \)th derivative of \( P_n \). By simple calculus, \(P_n^{(k)}(t) = n^{(k)} \left[(1 - p) + p t\right]^{n-k} p^k \), so \( P_n^{(k)} = n^{(k)} p^k \).
Our next result gives a recursion equation and initial conditions for the moments of the binomial distribution.
Recursion relation and initial conditions
Recall again the identity \(y \binom{n}{y} = n \binom{n - 1}{y - 1}\). \begin{align} \E\left(Y_n^k\right) & = \sum_{y=0}^n y^k \binom{n}{y} p^y (1 - p)^{n-y} = \sum_{y=1}^n y^{k-1} y \binom{n}{y} p^y (1 - p)^{n-y} \\ & = \sum_{y=1}^n y^{k-1} n \binom{n-1}{y-1} p^y (1 - p)^{n-y} = n p \sum_{y=1}^n y^{k-1} \binom{n-1}{y-1} p^{y-1} (1 - p)^{(n-1)-(y-1)} \\ & = n p \sum_{j=0}^{n-1} (j + 1)^{k-1} \binom{n-1}{j} p^j (1 - p)^{n-1 - j} = n p \E\left[(Y_{n-1} + 1)^{k-1}\right] \end{align}
The ordinary (raw) moments of \( Y_n \) can be computed from the factorial moments and from the recursion relation. Here are the first four, which will be needed below.
The first four moments of \( Y_n \) are
Our final moment results gives the skewness and kurtosis of the binomial distribution.
For \( p \in (0, 1) \), the skewness of \( Y_n \) is \[ \skw(Y_n) = \frac{1 - 2 p}{\sqrt{n p (1 - p)}} \]
Open the binomial timeline experiment. For each of the following values of \( n \), vary \( p \) from 0 to 1 and note the shape of the probability density function in light of the previous results on skewness.
For \( p \in (0, 1) \), the kurtosis of \( Y_n \) is \[ \kur(Y_n) = 3 - \frac{6}{n} + \frac{1}{n p (1 - p)} \]
Note that the excess kurtosis is \( \kur(Y_n) - 3 = \frac{1}{n p (1 - p)} - \frac{6}{n} \to 0\) as \( n \to \infty \). This is related to the convergence of the binomial distribution to the normal, which we will discuss below.
Several important properties of the random process \(\bs{Y} = (Y_0, Y_1, Y_2, \ldots)\) stem from the fact that it is a partial sum process corresponding to the sequence \(\bs{X} = (X_1, X_2, \ldots)\) of independent, identically distributed indicator variables.
\(\bs{Y}\) has stationary, independent increments:
Every partial sum process corresponding to a sequence of independent, identically distributed variables has stationary, independent increments.
The following result gives the finite dimensional distributions of \( \bs{Y} \).
The joint probability density functions of the sequence \(\bs{Y}\) are given as follows: \[ \P\left(Y_{n_1} = y_1, Y_{n_2} = y_2, \ldots, Y_{n_k} = y_k\right) = \binom{n_1}{y_1} \binom{n_2 - n_1}{y_2 - y_1} \cdots \binom{n_k - n_{k-1}}{y_k - y_{k-1}} p^{y_k} (1 - p)^{n_k - y_k} \] where \( n_1, n_2, \ldots, n_k \in \N_+ \) with \( n_1 \lt n_2 \lt \cdots \lt n_k \) and where \( y_1, y_2, \ldots, y_k \in \N \) with \( 0 \le y_j - y_{j-1} \le n_j - n_{j-1} \) for each \( j \in \{1, 2, \ldots, k\} \).
From the stationary and independent increments properties, \[ \P\left(Y_{n_1} = y_1, Y_{n_2} = y_2, \ldots, Y_{n_k} = y_k\right) = f_{n_1}(y_1) f_{n_2 - n_1}(y_2 - y_1) \cdots f_{n_k - n_{k-1}}(y_k - y_{k-1}) \] The result then follows from substitution and simplification.
There are two simple but important transformations that perserve the binomial distribution.
If \(U\) is a random variable having the binomial distribution with parameters \(n\) and \(p\), then \(n - U\) has the binomial distribution with parameters \(n\) and \(1 - p\).
Recall that if \( (X_1, X_2, \ldots) \) is a Bernoulli trials sequence with parameter \( p \), then \( (1 - X_1, 1 - X_2, \ldots) \) is a Bernoulli trials sequence with parameter \( 1 - p \). Also \( U \) has the same distribution as \( \sum_{i=1}^n X_i \) (binomial with parameters \( n \) and \( p \)) so \( n - U \) has the same distribution as \( \sum_{i=1}^n (1 - X_i) \) (binomial with parameters \( n \) and \( 1 - p \)).
For a direct proof, note that \(\P(n - U = k) = \P(U = n - k) = \binom{n}{k} p^{n-k} (1 - p)^k\) for \(k \in \{0, 1, \ldots, n\}\)
The sum of two independent binomial variables with the same success parameter also has a binomial distribution.
Suppose that \(U\) and \(V\) are independent random variables, and that \(U\) has the binomial distribution with parameters \(m\) and \(p\), and \(V\) has the binomial distribution with parameters \(n\) and \(p\). Then \(U + V\) has the binomial distribution with parameters \(m + n\) and \(p\).
Our first proof uses Bernoulli trials. So let \( (X_1, X_2, \ldots) \) be a Bernoulli trials sequence with parameter \( p \), and let \( Y_k = \sum_{i=1}^k X_i\) for \( k \in \N \). Then \(U\) has the same distribution as \(Y_m\) and \(V\) has the same distribution as \(Y_{m+n} - Y_m\). Since \( Y_m \) and \( Y_{m+n} - Y_m \) are independent, \( U + V \) has the same distribution as \( Y_m + (Y_{m+n} - Y_m) = Y_{m+n} \).
Our second proof uses probability density functions. Let \( f \) denote the PDF of an indicator variable \( X \) with parameter \( p \), so that \( f(x) = p^x (1 - p)^{1 - x} \) for \( x \in \{0, 1\} \). The binomial distribution with parameters \( k \in \N_+ \) and \( p \) has PDF \( f_k = f^{* k} \), the \( k \)-fold convolution power of \( f \). In particular, \( U \) has PDF \( f^{* m} \), \( V \) has PDF \( f^{* n} \) and hence \( U + V \) has PDF \( f^{* m} * f^{* n} = f^{* (m + n)} \).
Our third proof uses probability generating functions. By , \( U \) and \( V \) have PGFs \( P_m(t) = (1 - p + p t)^m \) and \( P_n(t) = (1 - p + p t)^n \) for \( t \in \R \), respectively. Hence by independence, \( U + V \) has PGF \(P_m P_n = P_{n+m} \).
Suppose that we have a dichotomous population, that is a population of two types of objects. Specifically, suppose that we have \( m \) objects, and that \( r \) of the objects are type 1 and the remaining \( m - r \) objects are type 0. Thus \( m \in \N_+ \) and \( r \in \{0, 1, \ldots, m\} \). We select \( n \) objects at random from the population, so that all samples of size \( n \) are equally likely. If the sampling is with replacement, the sample size \( n \) can be any positive integer. If the sampling is without replacement, then we must have \( n \in \{1, 2, \ldots, m\} \).
In either case, let \( X_i \) denote the type of the \( i \)'th object selected for \( i \in \{1, 2, \ldots, n\} \) so that \( Y = \sum_{i=1}^n X_i\) is the number of type 1 objects in the sample. As noted in the Introduction, if the sampling is with replacement, \( (X_1, X_2, \ldots, X_n) \) is a sequence of Bernoulli trials, and hence \( Y \) has the binomial distribution parameters \( n \) and \( p = r / m \). If the sampling is without replacement, then \( Y \) has the hypergeometric distribution with parameters \( m \), \( r \), and \( n \). The probability density function of \( Y \) is given by \[ \P(Y = y) = \frac{\binom{r}{y} \binom{m - r}{n - y}}{\binom{m}{n}} = \binom{n}{y} \frac{r^{(y)} (m - r)^{(n-y)}}{m^{(n)}}, \quad y \in \{0, 1, \ldots, n\} \] and the mean and variance of \( Y \) are \[ \E(Y) = n \frac{r}{m}, \quad \var(Y) = n \frac{r}{m} \left(1 - \frac{r}{m}\right) \frac{m - n}{m - 1} \] If the population size \( m \) is large compared to the sample size \( n \), then the dependence between the indicator variables is slight, and so the hypergeometric distribution should be close to the binomial distribution. The following theorem makes this precise.
Suppose that \( r_m \in \{0, 1, \ldots, m\} \) for each \( m \in \N_+ \) and that \( r_m/m \to p \in [0, 1] \) as \( m \to \infty \). Then for fixed \( n \in \N_+ \)the hypergeometric distribution with parameters \( m \), \( r_m \) and \( n \) converges to the binomial distribution with parameters \( n \) and \( p \) as \( m \to \infty \).
The hypergeometric PDF has the form \[ g_m(y) = \binom{n}{y} \frac{r_m^{(y)} (m - r_m)^{(n-y)}}{m^{(n)}}, \quad y \in \{0, 1, \ldots, n\} \] Note that the fraction above has \( n \) factors in the numerator and \( n \) factors in the denominator. We can group these, in order, to form a product of \( n \) fractions. The first \( y \) fractions have the form \[ \frac{r_m - i}{m - i} \] where \( i \in \{0, 1, \ldots, y - 1\} \). Each of these converges to \( p \) as \( m \to \infty \). The remaining \( n - y \) fractions have the form \[ \frac{m - r_m - j}{m - y - j} \] where \( j \in \{0, 1, \ldots, n - y - 1\} \). For fixed \( y \) and \( n \), each of these converges to \( 1 - p \) as \( m \to \infty \). Hence \( g_m(y) \to \binom{n}{y} p^y (1 - p)^{n -y} \) as \( m \to \infty \) for each \( y \in \{0, 1, \ldots, n\} \)
Under the conditions in , the mean and variance of the hypergeometric distribution converge to the mean and variance of the limiting binomial distribution:
By assumption \( r_m / m \to p \) as \( m \to \infty \) and \( n \) is fixed, so also \( (m - n) \big/ (m - 1) \to 1\) as \( n \to \infty \)
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial probability density function. In particular, note the similarity when \(m\) is large and \(n\) small. For selected values of the parameters, and for both sampling modes, run the experiment 1000 times.
From a practical point of view, the convergence of the hypergeometric distribution to the binomial means that if the population size \(m\) is large
compared to the sample size, then the hypergeometric distribution with parameters \( m \), \(r\) and \(n\) is well approximated by the binomial distribution with parameters \( n \) and \(p = r / m\). This is often a useful result, because the binomial distribution has fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating binomial distribution, we do not need to know the population size \( m \) and the number of type 1 objects \(r\) individually, but only in the ratio \(r / m\). Generally, the approximation works well if \( m \) is large compared to \( n \) that \( \frac{m - n}{m - 1} \) is close to 1. This ensures that the variance of the hypergeometric distribution is close to the variance of the approximating binomial distribution.
Now let's return to our usual sequence of Bernoulli trials \( \bs{X} = (X_1, X_2, \ldots) \), with success parameter \( p \), and to the binomial variables \( Y_n = \sum_{i=1}^n X_i \) for \( n \in \N \). Our next result shows that given \( k \) successes in the first \( n \) trials, the trials on which the successes occur is simply a random sample of size \( k \) chosen without replacement from \( \{1, 2, \ldots, n\} \).
Suppose that \( n \in \N_+ \) and \( k \in \{0, 1, \ldots, n\} \). Then for \( (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \) with \( \sum_{i=1}^n x_i = k \), \[ \P\left[(X_1, X_2, \ldots, X_n) = (x_1, x_2, \ldots, x_n) \mid Y_n = k\right] = \frac{1}{\binom{n}{k}} \]
From the definition of conditional probability, \[ \P\left[(X_1, X_2, \ldots, X_n) = (x_1, x_2, \ldots, x_n) \mid Y_n = k\right] = \frac{\P\left[(X_1, X_2, \ldots, X_n) = (x_1, x_2, \ldots, x_n)\right]}{\P(Y_n = k)} = \frac{p^k (1 - p)^{n-k}}{\binom{n}{k} p^k (1 - p)^{n-k}}= \frac{1}{\binom{n}{k}} \]
Note in particular that the conditional distribution in exercise not depend on \( p \). In statistical terms, this means that relative to \( (X_1, X_2, \ldots, X_n) \), random variable \( Y_n \) is a sufficient statistic for \( p \). Roughly, \( Y_n \) contains all of the information about \( p \) that is available in the entire sample \( (X_1, X_2, \ldots, X_n) \). Next, if \( m \le n \) then the conditional distribution of \( Y_m \) given \( Y_n = k \) is hypergeometric, with population size \( n \), type 1 size \( m \), and sample size \( k \).
Suppose that \( p \in (0, 1) \) and that \(m, \, n, \, k \in \N_+\) with \(m \le n\) and \(k \le n\). Then \[ \P(Y_m = j \mid Y_n = k) = \frac{\binom{m}{j} \binom{m - n}{k - j}}{\binom{n}{k}}, \quad j \in \{0, 1, \ldots, k\} \]
Our first proof uses . Given \( Y_n = k \), the trial numbers of the successes form a random sample of size \( k \) chosen without replacement from \( \{1, 2, \ldots, n\} \). Designate trials \( \{1, 2, \ldots, m\} \) as type 1 and trials \( \{m + 1, m+2, \ldots, n\} \) as type 0. Then \( Y_m \) is the number of type 1 trials in the sample, and hence (given \( Y_n = k \)) has the hypergeometric distribution with population size \( n \), type 1 size \( m \), and sample size \( k \).
Our second proof uses the definition of conditional probability: \[ \P(Y_m = j \mid Y_n = k) = \frac{\P(Y_m = j, Y_n = k)}{\P(Y_n = k)} = \frac{\P(Y_m = j, Y_n - Y_m = k - j)}{\P(Y_n = k)} \] But \( Y_m \) and \( Y_n - Y_m \) are independent. Both variables have binomial distributions; the first with parameters \( m \) and \( p \), and the second with parameters \( n - m \) and \( p \). Hence
\[ \P(Y_m = j \mid Y_n = k) = \frac{\binom{m}{j} p^j (1 - p)^{m - j} \binom{n - m}{k - j} p^{k - j} (1 - p)^{n - m - (k - j)}}{\binom{n}{k} p^k (1 - p)^{n - k}} = \frac{\binom{m}{j} \binom{m - n}{k - j}}{\binom{n}{k}}\]Once again, note that the conditional distribution in is independent of the success parameter \( p \).
The Poisson process on \( [0, \infty) \), named for Simeon Poisson, is a model for random points in continuous time. There are many deep and interesting connections between the Bernoulli trials process (which can be thought of as a model for random points in discrete time) and the Poisson process. These connections are explored in detail in the chapter referenced abvoe. In this section we just give the most famous and important result—the convergence of the binomial distribution to the Poisson distribution.
The Poisson distribution with rate parameter \( r \in (0, \infty) \) has probability density function \[ g(n) = e^{-r} \frac{r^n}{n!}, \quad n \in \N \] The parameter \( r \) is both the mean and the variance of the distribution. In addition, the probability generating function is \( t \mapsto e^{t (r - 1)} \).
Suppose that \( p_n \in (0, 1) \) for \( n \in \N_+ \) and that \( n p_n \to r \in (0, \infty) \) as \( n \to \infty \). Then the binomial distribution with parameters \(n\) and \(p_n\) converges to the Poisson distribution with parameter \(r\) as \(n \to \infty\).
Our first proof uses probability density functions. Let \( f_n \) denote the binomial PDF with parameters \( n \) and \( p_n \). Then for \( k \in \{0, 1, \ldots, n\} \) \[ f_n(k) = \binom{n}{k} p_n^k (1 - p_n)^{n-k} = \frac{1}{k!}\left[n p_n\right]\left[(n - 1) p_n\right] \cdots \left[(n - k + 1) p_n\right] \left(1 - n \frac{p_n}{n}\right)^{n-k} \] But \((n - j) p_n \to r\) as \(n \to \infty\) for fixed \(j\). Also, using a basic theorem from calculus, \(\left(1 - n p_n \big/ n\right)^{n-k} \to e^{-r}\) as \(n \to \infty\). Hence \(f_n(k) \to e^{-r} \frac{r^k}{k!} \) as \( n \to \infty \).
Our second proof uses probability generating functions. For \(t \in \R\), using the same basic limit from calculus, \[ \left[(1 - p_n) + p_n t\right]^n = \left[1 + n \frac{p_n}{n}(t - 1)\right]^n \to e^{r(t - 1)} \text{ as } n \to \infty \] The left side is the PGF of the binomial distribution with parameters \( n \) and \( p_n \), while the right side is the PGF of the Poisson distribution with parameter \( r \).
Under the same conditions as in , the mean and variance of the binomial distribution converge to the mean and variance of the limiting Poisson distribution, respectively.
By assumption, \( n p_n \to r \) as \( n \to \infty \), and so it also follows that \( p_n \to 0 \) as \( n \to \infty \).
Compare the Poisson experiment and the binomial timeline experiment.
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials \(n\) is large
and the probability of success \(p\) small
, so that \(n p^2\) is small, then the binomial distribution with parameters \(n\) and \(p\) is well approximated by the Poisson distribution with parameter \(r = n p\). This is often a useful result, because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of trials \(n\) and the probability of success \(p\) individually, but only in the product \(n p\). The condition that \(n p^2\) be small means that the variance of the binomial distribution, namely \(n p (1 - p) = n p - n p^2\) is approximately \(r\), the variance of the approximating Poisson distribution.
For a fixed number of trials \( n \), the binomial distribution is a member of two general families of distributions. First, it is a general exponential distribution.
Suppose that \(Y\) has the binomial distribution with parameters \(n\) and \(p\), where \(n \in \N_+\) is fixed and \(p \in (0, 1)\). The distribution of \( Y \) is a one-parameter exponential family with natural parameter \(\ln \left( \frac{p}{1 - p} \right)\) and natural statistic \(Y\).
This follows from the definition of the general exponential family. The support set \(\{0, 1, \ldots, n\} \) does not depend on \( p \), and for \( y \) in this set, \[ f_n(y) = \binom{n}{y} p^y (1 - p)^{n-y} = \binom{n}{y} (1 - p)^n \left(\frac{p}{1 - p}\right)^y = \binom{n}{y} (1 - p)^n \exp\left[y \ln\left(\frac{p}{1 - p}\right)\right] \]
Note that the natural parameter is the logarithm of the odds ratio corresponding to \(p\). This function is sometimes called the logit function. The binomial distribution is also a power series distribution
Suppose again that \( Y \) has the binomial distribution with parameters \( n \) and \( p \), where \( n \in \N_+ \) is fixed and \( p \in (0, 1) \). The distribution of \( Y \) is a power series distribution in the parameter \( \theta = \frac{p}{1 - p} \), corresponding to the function \( \theta \mapsto (1 + \theta)^n \).
This follows from the definition of the power series distribution. As before, for \( y \in \{0, 1, \ldots, n\} \), \[ f_n(y) = \binom{n}{y} p^y (1 - p)^{n-y} = \binom{n}{y} (1 - p)^n \left(\frac{p}{1 - p}\right)^y = \frac{1}{(1 + \theta)^n} \binom{n}{y} \theta^y\] where \( \theta = \frac{p}{1 - p} \). This is the power series distribution in \( \theta \), with coefficients \( \binom{n}{y} \), corresponding to the function \(\theta \mapsto (1 + \theta)^n \).
Suppose again that \( \bs{X} = (X_1, X_2, \ldots) \) is a sequence of Bernoulli trials with success parameter \( p \), and that as usual, \( Y_n = \sum_{i=1}^n X_i \) is the number of successes in the first \( n \) trials for \( n \in \N \).
The proportion of successes in the first \(n\) trials is the random variable \[ M_n = \frac{Y_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i \]
It is easy to express the probability density function of the proportion of successes \(M_n\) in terms of the probability density function of the number of successes \(Y_n\). First, note that \(M_n\) takes the values \(k / n\) where \(k \in \{0, 1, \dots, n\}\).
The probabiliy density function of \(M_n\) is given by \[ \P \left( M_n = \frac{k}{n} \right) = \binom{n}{k} p^k (1 - p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} \]
Trivially, \( M_n = k / n \) if and only if \( Y_n = k \) for \( k \in \{0, 1, \ldots, n\} \).
In the binomial coin experiment, select the proportion of heads. Vary \(n\) and \(p\) with the scrollbars and note the shape of the probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability density function.
The mean and variance of the proportion of successes \( M_n \) are easy to compute from the mean and variance of the number of successes \( Y_n \).
The mean and variance of \( M_n \) are
From the scaling properties of expected value and variance,
In the binomial coin experiment, select the proportion of heads. Vary \(n\) and \(p\) and note the size and location of the mean\(\pm\)standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical moments to the distribution moments.
Recall that skewness and kurtosis are standardized measures. Since \( M_n \) and \( Y_n \) have the same standard score, the skewness and kurtosis of \( M_n \) are the same as the skewness and kurtosis in of \( Y_n \) given in and
In statistical terms, part (a) of the moment result above means that \(M_n\) is an unbiased estimator of \(p\). From part (b) note that \(\var(M_n) \le \frac{1}{4 n}\) for any \(p \in [0, 1]\). In particular, \(\var(M_n) \to 0\) as \(n \to \infty\) and the convergence is uniform in \(p \in [0, 1]\). Thus, the estimate improves as \(n\) increases; in statistical terms, this is known as consistency.
For every \(\epsilon \gt 0\), \(\P\left(\left|M_n - p\right| \ge \epsilon\right) \to 0\) as \(n \to \infty\) and the convergence is uniform in \(p \in [0, 1]\).
This follows from and Chebyshev's inequality.
Exercise a special case of the weak law of large numbers and means that \(M_n \to p\) as \(n \to \infty\) in probability. The strong law of large numbers states that the convergence actually holds with probability 1.
The proportion of successes \( M_n \) has a number of nice properties as an estimator of the probability of success \( p \). As already noted, it is unbiased and consistent. By , \( Y_n \) is a sufficient statistic for \( p \), based on the sample \( (X_1, X_2, \ldots, X_n) \). It follows that \( M_n \) is sufficient for \( p \) as well. Since \( \E(X_i) = p \), \( M_n \) is trivially the method of moments estimator of \( p \). Assuming that the parameter space for \( p \) is \( [0, 1] \), it is also the maximum likelihood estimator of \( p \).
The likelihood function for \( p \in [0, 1] \), based on the observed sample \( (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \), is \( L_y(p) = p^y (1 - p)^{n-y} \), where \( y = \sum_{i=1}^n x_i \). The likelihood is maximized at \( m = y / n \).
By definition, the likelihood function is simply the joint PDF of \( (X_1, X_2, \ldots, X_n) \) thought of as a function of the parameter \( p \), for fixed \( (x_1, x_2, \ldots, x_n) \). Thus the form of the likelihood function follows from the joint PDF given in the introduction. If \( y = 0 \), \( L_y \) is decreasing and hence is maximized at \( p = 0 = y / n \). If \( y = n \), \( L_y \) is increasing and is maximized at \( p = 1 = y / n \). If \( 0 \lt y \lt n \), the log-likelihood function is \[ \ln\left[L_y(p)\right] = y \ln(p) + (n - y) \ln(1 - p) \] and the derivative is \[ \frac{d}{dp} \ln\left[L_y(p)\right] = \frac{y}{p} - \frac{n - y}{1 - p} \] There is a single critical point at \( y / n \). The second derivative of the log-likelihood function is negative, so the maximum on \( (0, 1) \) occurs at the critical point.
See the section on estimation in the Bernoulli model in the chapter on set estimation for a different approach to the problem of estimating \(p\).
Open the binomial timeline experiment. For selected values of \(p \in (0, 1)\), start with \(n = 1\) and successively increase \(n\) by 1. For each value of \(n\), Note the shape of the probability density function of the number of successes and the proportion of successes. With \(n = 100\), run the experiment 1000 times and compare the empirical density function to the probability density function for the number of successes and the proportion of successes
The characteristic bell shape that you should observe in is an example of the central limit theorem, because the binomial variable can be written as a sum of \(n\) independent, identically distributed random variables (the indicator variables).
The standard score \( Z_n \) of \(Y_n\) is the same as the standard score of \(M_n\): \[ Z_n = \frac{Y_n - n p}{\sqrt{n p (1 - p)}} = \frac{M_n - p}{\sqrt{p (1 - p) / n}} \] The distribution of \( Z_n \) converges to the standard normal distribution as \(n \to \infty\).
This version of the central limit theorem is known as the DeMoivre-Laplace theorem, and is named after Abraham DeMoivre and Simeon Laplace. From a practical point of view, this result means that, for large \(n\), the distribution of \(Y_n\) is approximately normal, with mean \(n p\) and standard deviation \(\sqrt{n p (1 - p)}\) and the distribution of \(M_n\) is approximately normal, with mean \(p\) and standard deviation \(\sqrt{p (1 - p) / n}\). Just how large \(n\) needs to be for the normal approximation to work well depends on the value of \(p\). The rule of thumb is that we need \(n p \ge 5\) and \(n (1 - p) \ge 5\) (the first condition is the significant one when \( p \le \frac{1}{2} \) and the second condition is the significant one when \( p \ge \frac{1}{2} \)). Finally, when using the normal approximation, we should remember to use the continuity correction, since the binomial is a discrete distribution.
A student takes a multiple choice test with 20 questions, each with 5 choices (only one of which is correct). Suppose that the student blindly guesses. Let \(X\) denote the number of questions that the student answers correctly. Find each of the following:
A certain type of missile has failure probability 0.02. Let \(Y\) denote the number of failures in 50 tests. Find each of the following:
Suppose that in a certain district, 40% of the registered voters prefer candidate \(A\). A random sample of 50 registered voters is selected. Let \(Z\) denote the number in the sample who prefer \(A\). Find each of the following:
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, and 5 have probability \(\frac{1}{8}\).
A standard, fair die is tossed 10 times. Let \(N\) denote the number of aces. Find each of the following:
A coin is tossed 100 times and results in 30 heads. Find the probability density function of the number of heads in the first 20 tosses.
Let \(Y_n\) denote the number of heads in the first \(n\) tosses.
\[ \P(Y_{20} = y \mid Y_{100} = 30) = \frac{\binom{20}{y} \binom{80}{30 - y}}{\binom{100}{30}}, \quad y \in \{0, 1, \ldots, 20\} \]An ace-six flat die is rolled 1000 times. Let \(Z\) denote the number of times that a score of 1 or 2 occurred. Find each of the following:
In the binomial coin experiment, select the proportion of heads. Set \(n = 10\) and \(p = 0.4\). Run the experiment 100 times. Over all 100 runs, compute the square root of the average of the squares of the errors, when \(M\) used to estimate \(p\). This number is a measure of the quality of the estimate.
In the binomial coin experiment, select the number of heads \(Y\), and set \(p = 0.5\) and \(n = 15\). Run the experiment 1000 times with and compute the following:
In the binomial coin experiment, select the proportion of heads \(M\) and set \(n = 30\), \(p = 0.6\). Run the experiment 1000 times and compute each of the following:
In 1693, Samuel Pepys asked Isaac Newton whether it is more likely to get at least one 6 in 6 rolls of a die, or at least two 6's in 12 rolls, or at least three 6's in 18 rolls. This problems is known a Pepys' problem; naturally, Pepys had fair dice in mind.
Solve Pepys' problem using the binomial distribution.
Let \(Y_n\) denote the number of 6's in \(n\) rolls of a fair die, so that \( Y_n \) has the binomial distribution with parameters \( n \) and \( p = \frac{1}{6} \). Using the binomial PDF (and complements to simplify the computations),
So the first event, at least one 6 in 6 rolls, is the most likely, and in fact the three events, in the order given, decrease in probability.
With fair dice run the simulation of the dice experiment 500 times and compute the relative frequency of the event of interest. Compare the results with the theoretical probabilities above.
It appears that Pepys had some sort of misguided linearity in mind, given the comparisons that he wanted. Let's solve the general problem.
For \( k \in \N_+ \), let \( Y_k \) denote the number of 6's in \(6 k\) rolls of a fair die.
So on average, the number of 6's in \( 6 k \) rolls of a fair die is \( k \), and that fact might have influenced Pepys thinking. The next problem is known as DeMere's problem, named after Chevalier De Mere
Which is more likely: at least one 6 with 4 throws of a fair die or at least one double 6 in 24 throws of two fair dice? .
Let \(Y_n\) denote the number of 6's in \(n\) rolls of a fair die, and let \(Z_n\) denote the number of double 6's in \(n\) rolls of a pair of fair dice. Then \( Y_n \) as the binomial distribution with parameters \( n \) and \( p = \frac{1}{6} \), and \( Z_n \) has the binomial distribution with parameters \( n \) and \( p = \frac{1}{36} \). Using the binomial PDF,
In the cicada data, compute the proportion of males in the entire sample, and the proportion of males of each species in the sample.
In the M&M data, pool the bags to create a large sample of M&Ms. Now compute the sample proportion of red M&Ms.
\(m_{\text{red}} = 0.168\)
The Galton board is a triangular array of pegs. The rows are numbered by the natural numbers \(\N = \{0, 1, \ldots\}\) from top downward. Row \(n\) has \(n + 1\) pegs numbered from left to right by the integers \(\{0, 1, \ldots, n\}\). Thus a peg can be uniquely identified by the ordered pair \((n, k)\) where \(n\) is the row number and \(k\) is the peg number in that row. The Galton board is named after Francis Galton.
Now suppose that a ball is dropped from above the top peg \((0, 0)\). Each time the ball hits a peg, it bounces to the right with probability \(p\) and to the left with probability \(1 - p\), independently from bounce to bounce.
The number of the peg that the ball hits in row \(n\) has the binomial distribution with parameters \(n\) and \(p\).
In the Galton board experiment, select random variable \(Y\) (the number of moves right). Vary the parameters \(n\) and \(p\) and note the shape and location of the probability density function and the mean\(\pm\)standard deviation bar. For selected values of the parameters, click single step several times and watch the ball fall through the pegs. Then run the experiment 1000 times and watch the path of the ball. Compare the relative frequency function and empirical moments to the probability density function and distribution moments, respectively.
Recall the discussion of structural reliability given in the introduction. In particular, we have a system of \(n\) similar components that function independently, each with reliability \(p\). Suppose now that the system as a whole functions properly if and only if at least \(k\) of the \(n\) components are good. Such a systems is called, appropriately enough, a \(k\) out of \(n\) system. Note that the series and parallel systems considered in the previous section are \(n\) out of \(n\) and 1 out of \(n\) systems, respectively.
Consider the \(k\) out of \(n\) system.
In the binomial coin experiment, set \(n = 10\) and \(p = 0.9\) and run the simulation 1000 times. Compute the empirical reliability and compare with the true reliability in each of the following cases:
Consider a system with \(n = 4\) components. Sketch the graphs of \(r_{4,1}\), \(r_{4,2}\), \(r_{4,3}\), and \(r_{4,4}\) on the same set of axes.
An \(n\) out of \(2 n - 1\) system is a majority rules system.
In the binomial coin experiment, compute the empirical reliability, based on 100 runs, in each of the following cases. Compare your results to the true probabilities.
Consider the transmission of bits (0s and 1s) through a noisy channel. Specifically, suppose that when bit \(i \in \{0, 1\}\) is transmitted, bit \(i\) is received with probability \(p_i \in (0, 1)\) and the complementary bit \(1 - i\) is received with probability \(1 - p_i\). Given the bits transmitted, bits are received correctly or incorrectly independently of one-another. Suppose now, that to increase reliability, a given bit \(I\) is repeated \(n \in \N_+\) times in the transmission. A priori, we believe that \(\P(I = 1) = \alpha \in (0, 1)\) and \(\P(I = 0) = 1 - \alpha\). Let \(X\) denote the number of 1s received when bit \(I\) is transmitted \(n\) times.
Find each of the following:
Simplify the results in in the symmetric case where \(p_1 = p_0 =: p\) (so that the bits are equally reliable) and with \(\alpha = \frac{1}{2}\) (so that we have no prior information).
Our interest, of course, is predicting the bit transmitted given the bits received.
Find the posterior probability that \(I = 1\) given \(X = k \in \{0, 1, \ldots, n\}\).
\[\P(I = 1 \mid X = k) = \frac{\alpha p_1^k (1 - p_1)^{n-k}}{\alpha p_1^k (1 - p_1)^{n-k} + (1 - \alpha) (1 - p_0)^k p_0^{n-k}}\]
Presumably, our decision rule would be to conclude that 1 was transmitted if the posterior probability in the previous exercise is greater than \(\frac{1}{2}\) and to conclude that 0 was transmitted if the this probability is less than \(\frac{1}{2}\). If the probability equals \(\frac{1}{2}\), we have no basis to prefer one bit over the other.
Give the decision rule in the symmetric case where \(p_1 = p_0 =: p\), so that the bits are equally reliable. Assume that \(p \gt \frac{1}{2}\), so that we at least have a better than even chance of receiving the bit transmitted.
Give \(X = k\), we conclude that bit 1 was transmitted if \[k \gt \frac{n}{2} - \frac{1}{2} \frac{\ln(\alpha) - \ln(1 - \alpha)}{\ln(p) - \ln(1 - p)}\] and we conclude that bit 0 was transmitted if the reverse inequality holds.
Not surprisingly, in the symmetric case with no prior information, so that \(\alpha = \frac{1}{2}\), we conclude that bit \(i\) was transmitted if a majority of bits received are \(i\).
The Weierstrass Approximation Theorem, named after Karl Weierstrass, states that any real-valued function that is continuous on a closed, bounded interval can be uniformly approximated on that interval, to any degree of accuracy, with a polynomial. The theorem is important, since polynomials are simple and basic functions, and a bit surprising, since continuous functions can be quite strange. In 1911, Sergi Bernstein gave an explicit construction of polynomials that uniformly approximate a given continuous function, using Bernoulli trials. Bernstein's result is a beautiful example of the probabilistic method, the use of probability theory to obtain results in other areas of mathematics that are seemingly unrelated to probability.
Suppose that \(f\) is a real-valued function that is continuous on the interval \([0, 1]\). The Bernstein polynomial of degree \(n\) for \(f\) is defined by \[ b_n(p) = \E_p\left[f(M_n)\right], \quad p \in [0, 1] \] where \(M_n\) is the proportion of successes in the first \(n\) Bernoulli trials with success parameter \(p\).
Note that we are emphasizing the dependence on \(p\) in the expected value operator. Our next gives a more explicit representation, and shows that the Bernstein polynomial is, in fact, a polynomial
The Bernstein polynomial of degree \(n\) can be written as follows: \[ b_n(p) = \sum_{k=0}^n f \left(\frac{k}{n} \right) \binom{n}{k} p^k (1 - p)^{n-k}, \quad p \in [0, 1] \]
This follows from the change of variables theorem for expected value.
The Bernstein polynomials satisfy the following properties:
From part (a) of , the graph of \(b_n\) passes through the endpoints \(\left(0, f(0)\right)\) and \(\left(1, f(1)\right)\). From part (b), the graph of \(b_1\) is a line connecting the endpoints. From (c), the graph of \(b_2\) is parabola passing through the endpoints and the point \(\left( \frac{1}{2}, \frac{1}{4} \, f(0) + \frac{1}{2} \, f\left(\frac{1}{2}\right) + \frac{1}{4} \, f(1) \right)\).
Our next next gives Bernstein's theorem explicitly.
\(b_n \to f\) as \(n \to \infty\) uniformly on \([0, 1]\).
Since \(f\) is continuous on the closed, bounded interval \([0, 1]\), it is bounded on this interval. Thus, there exists a constant \(C\) such that \(\left|f(p)\right| \le C\) for all \(p \in [0, 1]\). Also, \(f\) it is uniformly continuous on \([0, 1]\). Thus, for any \(\epsilon \gt 0\) there exists \(\delta \gt 0\) such that if \(p, \, q \in [0, 1]\) and \(\left|p - q\right| \lt \delta\) then \(\left|f(p) - f(q)\right| \lt \epsilon\). From basic properties of expected value, \[ \left|b_n(p) - f(p)\right| \le \E_p\left[\left|f(M_n) - f(p)\right|, \left|M_n - p\right| \lt \delta\right] + \E_p\left[\left|f(M_n) - f(p)\right|, \left|M_n - p\right| \ge \delta\right] \] Hence \(\left|b_n(p) - f(p)\right| \le \epsilon + 2 C \P_p\left(\left|M_n - p\right| \ge \delta\right)\) for any \(p \in [0, 1]\). But by weak law of large numbers in , \(\P_p\left(\left|M_n - p\right| \ge \delta\right) \to 0\) as \(n \to \infty\) uniformly in \(p \in [0, 1]\).
Compute the Bernstein polynomials of orders 1, 2, and 3 for the function \(f\) defined by \(f(x) = \cos(\pi x)\) for \( x \in [0, 1]\). Graph \(f\) and the three polynomials on the same set of axes.
Use a computer algebra system to compute the Bernstein polynomials of orders 10, 20, and 30 for the function \(f\) defined below. Use the CAS to graph the function and the three polynomials on the same axes. \[ f(x) = \begin{cases} 0, & x = 0 \\ x \sin\left(\frac{\pi}{x}\right), & x \in (0, 1] \end{cases} \]