Random samples from normal distributions are the most important special cases of the topics in this chapter. As we will see, many of the results simplify significantly when the underlying sampling distribution is normal. In addition we will derive the distributions of a number of random variables constructed from normal samples that are of fundamental important in inferential statistics.
The One Sample Model
Suppose that is a random sample from the normal distribution with mean and standard deviation . Recall that the term random sample means that is a sequence of independent, identically distributed random variables. Recall also that the normal distribution has probability density function
In the notation that we have used elsewhere in this chapter, (equivalently, the skewness of the normal distribution is 0) and (equivalently, the kurtosis of the normal distribution is 3). Since the sample (and in particular the sample size ) is fixed is this subsection, it will be suppressed in the notation.
The Sample Mean
First recall that the sample mean is
is normally distributed with mean and variance given by
Details:
This follows from basic properties of the normal distribution. Recall that the sum of independent normally distributed variables also has a normal distribution, and a linear transformation of a normally distributed variable is also normally distributed. The mean and variance of hold in general, and were derived in the section on the Law of Large Numbers.
Of course, by the central limit theorem, the distribution of is approximately normal, if is large, even if the underlying sampling distribution is not normal.
The standard score of is
has the standard normal distribution.
The standard score in [2] plays a critical role in constructing interval estimates and hypothesis tests for the distribution mean when the distribution standard deviation is known. The random variable will also appear in several derivations in this section.
The Sample Variance
The main goal of this subsection is to show that certain multiples of the two versions of the sample variance that we have studied have chi-square distributions. Recall that the chi-square distribution with degrees of freedom has probability density function
and has mean and variance . The moment generating function is
The most important result to remember is that the chi-square distribution with degrees of freedom governs , where is a sequence of independent, standard normal random variables.
Recall that if is known, a natural estimator of the variance is the statistic
Although the assumption that is known is almost always artificial, is very easy to analyze and it will be used in some of the derivations below. Our first result is the distribution of a simple multiple of .
The random variable
has the chi-square distribution with degrees of freedom.
Details:
Note that
and the terms in the sum are independent standard normal variables.
The variable in [3] plays a critical role in constructing interval estimates and hypothesis tests for the distribution standard deviation when the distribution mean is known (although again, this assumption is usually not realistic).
The mean and variance of are
Details:
These results follow from the chi-square distribution of and standard properties of expected value and variance.
As an estimator of , part (a) means that is unbiased and part (b) means that is consistent. Of course, these moment results are special cases of the general results obtained in the section on Sample Variance. In that section, we also showed that and are uncorrelated if the underlying sampling distribution has skewness 0 (), as is the case here.
Recall now that the standard version of the sample variance is the statistic
The sample variance is the usual estimator of when is unknown (which is usually the case). We showed earlier that in general, the sample mean and the sample variance are uncorrelated if the underlying sampling distribution has skewness 0 (). It turns out that if the sampling distribution is normal, these variables are in fact independent, a very important and useful property, and at first blush, a very surprising result since appears to depend explicitly on .
The sample mean and the sample variance are independent.
Details:
The proof is based on the vector of deviations from the sample mean. Let
Note that can be written as a function of since . Next, and the vector have a joint multivariate normal distribution. We showed earlier that and are uncorrelated for each , and hence it follows that and are independent. Finally, since is a function of , it follows that and are independent.
We can now determine the distribution of a simple multiple of the sample variance .
The random variable
has the chi-square distribution with degrees of freedom.
Details:
We first show that where is the chi-square variable associated with and where is the standard score associated with . To see this, note that
In the right side of the last equation, the first term is . The second term is 0 because . The last term is . Now, from [3], has the chi-square distribution with degrees of freedom. and of course has the chi-square distribution with 1 degree of freedom. From [5], and are independent. Recall that the moment generating function of a sum of independent variables is the product of the MGFs. Thus, taking moment generating functions in the equation gives
Solving we have for and therefore has the chi-square distribution with degrees of freedom.
The variable in [6] plays a critical role in constructing interval estimates and hypothesis tests for the distribution standard deviation when the distribution mean is unknown (almost always the case).
The mean and variance of are
-
Details:
These results follow from the chi-square distribution of and standard properties of expected value and variance.
As before, these moment results are special cases of the general results obtained in the section on Sample Variance. Again, as an estimator of , part (a) means that is unbiased, and part (b) means that is consistent. Note also that is larger than (not surprising), by a factor of .
In the special distribution simulator, select the chi-square distribution. Vary the degree of freedom parameter and note the shape and location of the probability density function and the mean, standard deviation bar. For selected values of the parameter, run the experiment 1000 times and compare the empirical density function and moments to the true probability density function and moments.
The covariance and correlation between the special sample variance and the standard sample variance are
Details:
These results follows from general results obtained in the section on sample variance and the fact that .
Note that the correlation does not depend on the parameters and , and converges to 1 as ,
The Variable
Recall that the Student distribution with degrees of freedom has probability density function
where is the appropriate normalizing constant. The distribution has mean 0 if and variance if . In this subsection, the main point to remember is that the distribution with degrees of freedom is the distribution of
where has the standard normal distribution; has the chi-square distribution with degrees of freedom; and and are independent.
Note that is similar to the standard score associated with , but with the sample standard deviation replacing the distribution standard deviation . The variable plays a critical role in constructing interval estimates and hypothesis tests for the distribution mean when the distribution standard deviation is unknown.
Let denote the standard score in [2], and let denote the chi-square variable in [6]. Then
and hence has the student distribution with degrees of freedom.
Details:
In the definition of , divide the numerator and denominator by . The numerator is then and the denominator is . Since and are independent, has the standard normal distribution, and has the chi-squre distribution with degrees of freedom, it follows that has the student distribution with degrees of freedom.
In the special distribution simulator, select the distribution. Vary the degree of freedom parameter and note the shape and location of the probability density function and the meanstandard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical density function and moments to the distribution density function and moments.
The Two Sample Model
Suppose that is a random sample of size from the normal distribution with mean and standard deviation , and that is a random sample of size from the normal distribution with mean and standard deviation . Finally, suppose that and are independent. Of course, all of the results above in the one sample model in apply to and separately, but now we are interested in statistics that are helpful in inferential procedures that compare the two normal distributions. We will use the basic notation established above, but we will indicate the dependence on the sample.
The two-sample (or more generally the multi-sample model) occurs naturally when a basic variable in the statistical experiment is filtered according to one or more other variable (often nominal variables). For example, in the cicada data, the weights of the male cicadas and the weights of the female cicadas may fit observations from the two-sample normal model. The basic variable weight is filtered by the variable gender. If weight is filtered by gender and species, we might have observations from the 6-sample normal model.
The Difference in the Sample Means
We know from [1] that and have normal distributions. Moreover, these sample means are independent because the underlying samples and are independent. Hence, it follows from a basic property of the normal distribution that any linear combination of and will be normally distributed as well. For inferential procedures that compare the distribution means and , the linear combination that is most important is the difference.
has a normal distribution with mean and variance given by
Hence the standard score
has the standard normal distribution. This standard score plays a fundamental role in constructing interval estimates and hypothesis test for the difference when the distribution standard deviations and are known.
Ratios of Sample Variances
Next we will show that the ratios of certain multiples of the sample variances (both versions) of and have distributions. Recall that the distribution with degrees of freedom in the numerator and degrees of freedom in the denominator is the distribution of
where has the chi-square distribution with degrees of freedom; has the chi-square distribution with degrees of freedom; and and are independent. The distribution is named in honor of Ronald Fisher and has probability density function
where is the appropriate normalizing constant. The mean is if , and the variance is if .
The random variable given below has the distribution with degrees of freedom in the numerator and degrees of freedom in the denominator:
Details:
Using the notation in [3], note that and . The result then follows immediately since and are independent chi-square variables with and degrees of freedom, respectivley.
The random variable given below has the distribution with degrees of freedom in the numerator and degrees of freedom in the denominator:
Details:
Using the notation in [6], note that and . The result then follows immediately since and are independent chi-square variables with and degrees of freedom, respectively.
These variables are useful for constructing interval estimates and hypothesis tests of the ratio of the standard deviations . The choice of the variable depends on whether the means and are known or unknown. Usually, of course, the means are unknown and so the statistic in in [15] is used.
In the special distribution simulator, select the distribution. Vary the degrees of freedom parameters and note the shape and location of the probability density function and the meanstandard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical density function and moments to the true distribution density function and moments.
The Variable
Our final construction in the two sample normal model will result in a variable that has the student distribution. This variable plays a fundamental role in constructing interval estimates and hypothesis test for the difference when the distribution standard deviations and are unknown. The construction requires the additional assumption that the distribution standard deviations are the same: . This assumption is reasonable if there is an inherent variability in the measurement variables that does not change even when different treatments are applied to the objects in the population.
The standard score associated with the difference in the sample means is
To construct our desired variable, we first need an estimate of . A natural approach is to consider a weighted average of the sample variances and , with the degrees of freedom as the weight factors.
The pooled estimate of is
The random variable given below has the chi-square distribution with degrees of freedom:
Details:
The variable can be expressed as the sum of independent chi-square variables.
The variables and are independent.
Details:
The following pairs of variables are independent: and ; and ; and
The random variable given below has the student distribution with degrees of freedom.
Details:
The random variable can be written as where is the the standard normal variable given in [17] amd is the chi-square variable given in [19]. Moreover, and are independent by theorem [20].
The Bivariate Sample Model
Suppose now that is a random sample of size from the bivariate normal distribution with means and , standard deviations and , and correlation . Of course, is a random sample of size from the normal distribution with mean and standard deviation , and is a random sample of size from the normal distribution with mean and standard deviation , so the results above in the one sample model in apply to and individually. Thus our interest in this section is in the relation between various and statistics and properties of sample covariance.
The bivariate (or more generally multivariate) model occurs naturally when considering two (or more) variables in the statistical experiment. For example, the heights of the fathers and the heights of the sons in Pearson's height data may well fit observations from the bivariate normal model.
In the notation that we have used previously, recall that , , , , . and .
The data vector has a multivariate normal distribution.
- The mean vector has a block form, with each block being .
- The variance-covariance matrix has a block-diagonal form, with each block being .
Details:
This follows from standard results for the multivariate normal distribution. Of course the blocks in parts (a) and (b) are simply the mean and variance-covariance matrix of a single observation .
Sample Means
has a bivariate normal distribution. The covariance and correlation are
Details:
The bivariate normal distribution follows from [22] since can be obtained from the data vector by a linear transformation. Parts (a) and (b) follow from the section on sample correlation.
Of course, we know the individual means and variances of and from the one-sample model above. Hence we know the complete distribution of .
Sample Variances
The covariance and correlation between the special sample variances are:
Details:
These results follow from the section on sample correlation and the special form of , , and .
The covariance and correlation between the standard sample variances are
Details:
These results follow from the section on sample correlation and the special form of , , , and .
Sample Covariance
If and are known (again usually an artificial assumption), a natural estimator of the distribution covariance is the special version of the sample covariance
The mean and variance of are
Details:
These results follow from the section on sample correlation and the special form of and .
If and are unknown (again usually the case), then a natural estimator of the distribution covariance is the standard sample covariance
The mean and variance of the sample variance are
Details:
These results follow from our previous general results and the special form of and .
Computational Exercises
We use the basic notation established above for samples and , and for the statistics , , , , and so forth.
Suppose that the net weights (in grams) of 25 bags of M&Ms form a random sample from the normal distribution with mean 50 and standard deviation 4. Find each of the following:
- The mean and standard deviation of .
- The mean and standard deviation of .
- The mean and standard deviation of .
- The mean and standard deviation of .
- .
- .
Details:
Suppose that the SAT math scores from 16 Alabama students form a random sample from the normal distribution with mean 550 and standard deviation 20, while the SAT math scores from 25 Georgia students form a random sample from the normal distribution with mean 540 and standard deviation 15. The two samples are independent. Find each of the following:
- The mean and standard deviation of .
- The mean and standard deviation of .
- The mean and standard deviation of .
- .
- The mean and standard deviation of .
- The mean and standard deviation of .
- The mean and standard deviation of
- .
Details: