The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many statistical procedures work.
Partial Sum Processes
Definitions
Suppose that is a sequence of independent, identically distributed, real-valued random variables with common probability density function , mean , and variance . We assume that , so that in particular, the random variables really are random and not constants. Let
Note that by convention, , since the sum is over an empty index set. The random process is called the partial sum process associated with . Special types of partial sum processes have been studied in many places in this text; in particular see
Recall that in statistical terms, the sequence corresponds to sampling from the underlying distribution. In particular, is a random sample of size from the distribution, and the corresponding sample mean is
By the law of large numbers, as with probability 1.
Stationary, Independent Increments
The partial sum process corresponding to a sequence of independent, identically distributed variables has two important properties, and these properties essentially characterize such processes.
If then has the same distribution as . Thus the process has stationary increments.
Details:
Note that and is the sum of independent variables, each with the common distribution. Of course, is also the sum of independent variables, each with the common distribution.
Note however that and are very different random variables; the theorem simply states that they have the same distribution.
If then is a sequence of independent random variables. Thus the process has independent increments.
Details:
The terms in the sequence of increments are sums over disjoint collections of terms in the sequence . Since the sequence is independent, so is the sequence of increments.
Conversely, suppose that is a random process with stationary, independent increments. Define for . Then is a sequence of independent, identically distributed variables and is the partial sum process associated with .
So partial sum processes are the only discrete-time random processes that have stationary, independent increments. An interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments. The Poisson counting process has stationary independent increments, as does the Brownian motion process.
Moments
If then
Details:
The results follow from basic properties of expected value and variance. Expected value is a linear operation so . By independence, .
If and with then
Details:
- Note that . So this result follows from basic properties of covariance, [1], and [2]:
- This result follows from part (a) of [4]
- This result also follows from part (a) of [4]:
If has moment generating function then has moment generating function .
Details:
This follows from a basic property of generating functions: the generating function of a sum of independent variables is the product of the generating functions of the terms.
Distributions
Suppose that has either a discrete distribution or a continuous distribution with probability density function . Then the probability density function of is , the convolution power of of order .
Details:
This follows from a basic property of PDFs: the pdf of a sum of independent variables is the convolution of the PDFs of the terms.
More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process:
The Central Limit Theorem
First, let's make the central limit theorem more precise. From Theorem 4 in [4], we cannot expect itself to have a limiting distribution. Note that as since , and as if while as if . Similarly, we know that as with probability 1, so the limiting distribution of the sample mean is degenerate. Thus, to obtain a limiting distribution of or that is not degenerate, we need to consider, not these variables themeselves, but rather the common standard score. Thus, let
has mean 0 and variance 1.
Details:
These results follow from basic properties of expected value and variance, and are true for the standard score associated with any random variable. Recall also that the standard score of a variable is invariant under linear transformations with positive slope. The fact that the standard score of and the standard score of are the same is a special case of this.
The precise statement of the central limit theorem is that the distribution of the standard score converges to the standard normal distribution as . Recall that the standard normal distribution has probability density function
A special case of the central limit theorem (to Bernoulli trials), dates to Abraham De Moivre. The term central limit theorem was coined by George Pólya in 1920. By definition of convergence in distribution, the central limit theorem states that as for each , where is the distribution function of and is the standard normal distribution function:
An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the version that we will give and prove, but first we need a generalization of a famous limit from calculus.
Suppose that is a sequence of real numbers and that as . Then
Now let denote the characteristic function of the standard score of the sample variable , and let denote the characteristic function of the standard score :
Recall that is the characteristic function of the standard normal distribution. We can now give a proof.
The central limit theorem. The distribution of converges to the standard normal distribution as . That is, as for each .
Details:
Note that , , . Next
From properties of characteristic functions, for . By Taylor's theorem (named after Brook Taylor),
But and hence as . Finally,
Normal Approximations
The central limit theorem implies that if the sample size is large
then the distribution of the partial sum is approximately normal with mean and variance . Equivalently the sample mean is approximately normal with mean and variance . The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain statistics, even if we know very little about the underlying sampling distribution.
Of course, the term large
is relative. Roughly, the more abnormal
the basic distribution, the larger must be for normal approximations to work well. The rule of thumb is that a sample size of at least 30 will usually suffice if the basic distribution is not too weird; although for many distributions smaller will do.
Let denote the sum of the variables in a random sample of size 30 from the uniform distribution on . Find normal approximations to each of the following:
- The 90th percentile of
Details:
- 0.8682
- 17.03
Random variable in [12] has the Irwin-Hall distribution of order 30, named for Joseph Irwin and Phillip Hall.
In the special distribution simulator, select the Irwin-Hall distribution. Vary and from 1 to 10 and note the shape of the probability density function. With run the experiment 1000 times and compare the empirical density function to the true probability density function.
Let denote the sample mean of a random sample of size 50 from the distribution with probability density function for . This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of the following:
- The 60th percentile of
Details:
- 0.2071
- 1.531
The Continuity Correction
A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that takes integer values (the most common case) and hence so does the partial sum . For any and , note that the event is equivalent to the event . Different values of lead to different normal approximations, even though the events are equivalent. The smallest approximation would be 0 when , and the approximations increase as increases. It is customary to split the difference by using for the normal approximation. This is sometimes called the half-unit continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way, using the additivity of probability.
Suppose that with .
- For the event , use in the normal approximation.
- For the event , use in the normal approximation.
- For the event , use in the normal approximation.
Let denote the sum of the scores of 20 fair dice. Compute the normal approximation to .
Details:
0.6741
In the dice experiment, set the die distribution to fair, select the sum random variable , and set . Run the simulation 1000 times and find each of the following. Compare with the result in the previous exercise:
- The relative frequency of the event (from the simulation)
Normal Approximation to the Gamma Distribution
Recall that the gamma distribution with shape parameter and scale parameter is a continuous distribution on with probability density function given by
The mean is and the variance is . The gamma distribution is widely used to model random times and other positive random variables. In the context of the Poisson model (where ), the gamma distribution is also known as the Erlang distribution, named for Agner Erlang. Suppose now that has the gamma (Erlang) distribution with shape parameter and scale parameter then
where is a sequence of independent variables, each having the exponential distribution with scale parameter . (The exponential distribution is a special case of the gamma distribution with shape parameter 1.) It follows that if is large, the gamma distribution can be approximated by the normal distribution with mean and variance . The same statement actually holds when is not an integer. Here is the precise statement:
Suppose that has the gamma distribution with scale parameter and shape parameter . Then the distribution of the standardized variable below converges to the standard normal distribution as :
In the special distribution simulator, select the gamma distribution. Vary and and note the shape of the probability density function. With and various values of , run the experiment 1000 times and compare the empirical density function to the true probability density function.
Suppose that has the gamma distribution with shape parameter and scale parameter . Find normal approximations to each of the following:
- The 80th percentile of
Details:
- 0.3063
- 25.32
Normal Approximation to the Chi-Square Distribution
Recall that the chi-square distribution with degrees of freedom is a special case of the gamma distribution, with shape parameter and scale parameter . Thus, the chi-square distribution with degrees of freedom has probability density function
When is a positive, integer, the chi-square distribution governs the sum of independent, standard normal variables. For this reason, it is one of the most important distributions in statistics. From the previous discussion on gamma distributions, it follows that if is large, the chi-square distribution can be approximated by the normal distribution with mean and variance . Here is the precise statement:
Suppose that has the chi-square distribution with degrees of freedom. Then the distribution of the standardized variable below converges to the standard normal distribution as :
In the special distribution simulator, select the chi-square distribution. Vary and note the shape of the probability density function. With , run the experiment 1000 times andcompare the empirical density function to the probability density function.
Suppose that has the chi-square distribution with degrees of freedom. Find normal approximations to each of the following:
- The 75th percentile of
Details:
- 0.4107
- 24.3
Normal Approximation to the Binomial Distribution
Recall that a Bernoulli trials sequence, named for Jacob Bernoulli, is a sequence of independent, identically distributed indicator variables with for each , where is the parameter. In the usual language of reliability, is the outcome of trial , where 1 means success and 0 means failure. The common mean is and the common variance is .
Let , so that is the number of successes in the first trials. Recall that has the binomial distribution with parameters and , and has probability density function
It follows from the central limit theorem that if is large, the binomial distribution with parameters and can be approximated by the normal distribution with mean and variance . The rule of thumb is that should be large enough for and . (The first condition is the important one when and the second condition is the important one when .) Here is the precise statement:
Suppose that has the binomial distribution with trial parameter and success parameter . Then the distribution of the standardized variable given below converges to the standard normal distribution as :
In the binomial timeline experiment, vary and and note the shape of the probability density function. With and , run the simulation 1000 times and compute the following:
- The relative frequency of the event (from the simulation)
Details:
- 0.5448
Suppose that has the binomial distribution with parameters and . Compute the normal approximation to (don't forget the continuity correction) and compare with the results of the previous exercise.
Details:
0.5383
Normal Approximation to the Poisson Distribution
Recall that the Poisson distribution, named for Simeon Poisson, is a discrete distribution on with probability density function given by
where is a parameter. The parameter is both the mean and the variance of the distribution. The Poisson distribution is widely used to model the number of random points
in a region of time or space, particularly in the context of the Poisson process. In this context, the parameter is proportional to the size of the region.
Suppose now that has the Poisson distribution with parameter . Then
where is a sequence of independent variables, each with the Poisson distribution with parameter 1. It follows from the central limit theorem that if is large, the Poisson distribution with parameter can be approximated by the normal distribution with mean and variance . The same statement holds when the parameter is not an integer. Here is the precise statement:
. Suppose that has the Poisson distribution with parameter . Then the distribution of the standardized variable below converges to the standard normal distribution as :
Suppose that has the Poisson distribution with mean 20.
- Compute the true value of .
- Compute the normal approximation to .
Details:
- 0.6310
- 0.6259
In the Poisson experiment, vary the time and rate parameters and (the parameter of the Poisson distribution in the experiment is the product ). Note the shape of the probability density function. With and , run the experiment 1000 times and compare the empirical density function to the true probability density function.
Normal Approximation to the Negative Binomial Distribution
The general version of the negative binomial distribution is a discrete distribution on , with shape parameter and success parameter . The probability density function is given by
The mean is and the variance is . If , the distribution governs the number of failures before success number in a sequence of Bernoulli trials with success parameter . So in this case,
where is a sequence of independent variables, each having the geometric distribution on with parameter . (The geometric distribution is a special case of the negative binomial, with parameters 1 and .) In the context of the Bernoulli trials, is the number of failures before the first success, and for , is the number of failures between success number success number . It follows that if is large, the negative binomial distribution can be approximated by the normal distribution. The same statement holds if is not an integer. Here is the precise statement:
Suppose that has the negative binomial distribution with shape parameter and scale parameter . Then the distribution of the standardized variable below converges to the standard normal distribution as :
Another version of the negative binomial distribution is the distribution of the trial number of success number . So and has mean and variance . The normal approximation applies to the distribution of as well, if is large, and since the distributions are related by a location transformation, the standard scores are the same. That is
In the negative binomial experiment, vary and and note the shape of the probability density function. With and , run the experiment 1000 times and compare the empirical density function to the true probability density function.
Suppose that has the negative binomial distribution with trial parameter and success parameter . Find normal approximations to each of the following:
- The 80th percentile of
Details:
- 0.6318
- 30.1
Partial Sums with a Random Number of Terms
Our last topic is a bit more esoteric, but still fits with the general setting of this section. Recall that is a sequence of independent, identically distributed real-valued random variables with common mean and variance . Suppose now that is a random variable (on the same probability space) taking values in , also with finite mean and variance. Then
is a random sum of the independent, identically distributed variables. That is, the terms are random of course, but so also is the number of terms . We are primarily interested in the moments of .
Independent Number of Terms
Suppose first that , the number of terms, is independent of , the sequence of terms. Computing the moments of is a good exercise in conditional expectation.
The conditional expected value of given , and the expected value of are
The conditional variance of given and the variance of are
Let denote the probability generating function of . Show that the moment generating function of is .
Wald's Equation
The result in part (b) of [33] generalizes to the case where the random number of terms is a stopping time for the sequence . This means that the event depends only on (technically, is measurable with respect to) for each . The generalization is knowns as Wald's equation, and is named for Abraham Wald.
If is a stopping time for then .
Details:
First note that . But depends only on and hence is independent of . Thus . Suppose that for each . Taking expected values term by term gives Wald's equation in this special case. The interchange of sum and expected value is justified by the monotone convergence theorem. Now Wald's equation can be established in general by using the dominated convergence theorem.
An elgant proof of Wald's equation is given in the chapter on Martingales.
Suppose that the number of customers arriving at a store during a given day has the Poisson distribution with parameter 50. Each customer, independently of the others (and independently of the number of customers), spends an amount of money that is uniformly distributed on the interval . Find the mean and standard deviation of the amount of money that the store takes in during a day.
Details:
500, 81.65
When a certain critical component in a system fails, it is immediately replaced by a new, statistically identical component. The components are independent, and the lifetime of each (in hours) is exponentially distributed with scale parameter . During the life of the system, the number of critical components used has a geometric distribution on with parameter . For the total life of the critical component,
- Find the mean.
- Find the standard deviation.
- Find the moment generating function.
- Identify the distribution by name.
Details:
- Exponential distribution with scale parameter