Recall that the expected value of a real-valued random variable is the mean of the variable, and is a measure of the center of the distribution. Recall also that by taking the expected value of various transformations of the variable, we can measure other interesting characteristics of the distribution. In this section, we will study expected values that measure the spread of the distribution about the mean.
As usual, we start with a random experiment, modeled by a probability space \((\Omega, \mathscr F, \P)\). So to review, \(\Omega\) is the set of outcomes, \(\mathscr F\) the collection of events, and \(\P\) the probability measure on the sample space \((\Omega, \mathscr F)\). Suppose that \(X\) is a random variable for the experiment, with values in \(S \subseteq \R\). Recall that \( \E(X) \), the expected value (or mean) of \(X\) gives the center of the distribution of \(X\).
The variance and standard deviation of \( X \) are defined by
Implicit in the definition is the assumption that the mean \( \E(X) \) exists, as a real number. If this is not the case, then \( \var(X) \) (and hence also \( \sd(X) \)) are undefined. Even if \( \E(X) \) does exist as a real number, it's possible that \( \var(X) = \infty \). For the remainder of our discussion of the basic theory, we will assume that expected values that are mentioned exist as real numbers.
The variance and standard deviation of \(X\) are both measures of the spread of the distribution about the mean. Variance (as we will see) has nicer mathematical properties, but its physical unit is the square of that of \( X \). Standard deviation, on the other hand, is not as nice mathematically, but has the advantage that its physical unit is the same as that of \( X \). When the random variable \(X\) is understood, the standard deviation is often denoted by \(\sigma\), so that the variance is \(\sigma^2\).
Recall that the second moment of \(X\) about \(a \in \R\) is \(\E\left[(X - a)^2\right]\). Thus, the variance is the second moment of \(X\) about the mean \(\mu = \E(X)\), or equivalently, the second central moment of \(X\). In general, the second moment of \( X \) about \( a \in \R\) can also be thought of as the mean square error if the constant \( a \) is used as an estimate of \( X \). In addition, second moments have a nice interpretation in physics. If we think of the distribution of \(X\) as a mass distribution in \(\R\), then the second moment of \(X\) about \(a \in \R\) is the moment of inertia of the mass distribution about \(a\). This is a measure of the resistance of the mass distribution to any change in its rotational motion about \(a\). In particular, the variance of \(X\) is the moment of inertia of the mass distribution about the center of mass \(\mu\).
The mean square error (or equivalently the moment of inertia) about \( a \) is minimized when \( a = \mu \):
Let \( \mse(a) = \E\left[(X - a)^2\right] \) for \( a \in \R \). Then \( \mse \) is minimized when \( a = \mu \), and the minimum value is \( \sigma^2 \).
Using the linearity of expected value, \[ \mse(a) = \E\left(X^2 - 2 a X + a^2\right) = \E(X^2) - 2 a \E(X) + a^2 \] Thus, the graph of \( \mse \) is a parabola opening upward, with vertex at \( a = \E(X) \).
The relationship between measures of center and measures of spread will be studied in more detail.
The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value. As usual, be sure to try the proofs yourself before expanding the details. Our first results are computational formulas based on the change of variables formula for expected value
Let \( \mu = \E(X) \).
Our next result is a variance formula that is usually better than the definition for computational purposes.
\(\var(X) = \E(X^2) - [\E(X)]^2\).
Let \( \mu = \E(X) \). Using the linearity of expected value we have \[ \var(X) = \E[(X - \mu)^2] = \E(X^2 - 2 \mu X + \mu^2) = \E(X^2) - 2 \mu \E(X) + \mu^2 = \E(X^2) - 2 \mu^2 + \mu^2 = \E(X^2) - \mu^2 \]
Of course, by the change of variables formula, \( \E\left(X^2\right) = \sum_{x \in S} x^2 f(x) \) if \( X \) has a discrete distribution, and \( \E\left(X^2\right) = \int_S x^2 f(x) \, dx \) if \( X \) has a continuous distribution. In both cases, \( f \) is the probability density function of \( X \).
Variance is always nonnegative, since it's the expected value of a nonnegative random variable. Moreover, any random variable that really is random (not a constant) will have strictly positive variance.
The nonnegative property.
These results follow from the basic positive property of expected value. Let \( \mu = \E(X) \). First \( (X - \mu)^2 \ge 0 \) with probability 1 so \( \E\left[(X - \mu)^2\right] \ge 0 \). In addition, \( \E\left[(X - \mu)^2\right] = 0 \) if and only if \( \P(X = \mu) = 1 \).
Our next result shows how the variance and standard deviation are changed by a linear transformation of the random variable. In particular, note that variance, unlike general expected value, is not a linear operation. This is not really surprising since the variance is the expected value of a nonlinear function of the variable: \( x \mapsto (x - \mu)^2 \).
If \(a, \, b \in \R\) then
Recall that when \( b \gt 0 \), the linear transformation \( x \mapsto a + b x \) is called a location-scale transformation and often corresponds to a change of location and change of scale in the physical units. For example, the change from inches to centimeters in a measurement of length is a scale transformation, and the change from Fahrenheit to Celsius in a measurement of temperature is both a location and scale transformation. Proposition shows that when a location-scale transformation is applied to a random variable, the standard deviation does not depend on the location parameter, but is multiplied by the scale factor. There is a particularly important location-scale transformation.
Suppose that \( X \) is a random variable with mean \( \mu \) and variance \( \sigma^2 \). The random variable \( Z\) defined as follows is the standard score of \( X \). \[ Z = \frac{X - \mu}{\sigma} \]
Since \(X\) and its mean and standard deviation all have the same physical units, the standard score \(Z\) is dimensionless. It measures the directed distance from \(\E(X)\) to \(X\) in terms of standard deviations.
Let \( Z \) denote the standard score of \( X \), and suppose that \( Y = a + b X \) where \( a, \, b \in \R \) and \( b \ne 0 \).
\( E(Y) = a + b \E(X) \) and \( \sd(Y) = \left|b\right| \, \sd(X) \). Hence \[ \frac{Y - \E(Y)}{\sd(Y)} = \frac{b}{\left|b\right|} \frac{X - \E(X)}{\sd(X)} \]
As just noted, when \( b \gt 0 \), the variable \(Y = a + b X \) is a location-scale transformation and often corresponds to a change of physical units. Since the standard score is dimensionless, it's reasonable that the standard scores of \( X \) and \( Y \) are the same. Here is another standardized measure of dispersion:
Suppose that \(X\) is a random variable with \(\E(X) \ne 0\). The coefficient of variation is the ratio of the standard deviation to the mean: \[ \text{cv}(X) = \frac{\sd(X)}{\E(X)} \]
The coefficient of variation is also dimensionless, and is sometimes used to compare variability for random variables with different means. The variance of the sum of two random variables can be computed using covariance.
Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be more than a specified distance from its mean. This is often useful in applied problems where the distribution is unknown, but the mean and variance are known (at least approximately). In the following two results, suppose that \(X\) is a real-valued random variable with mean \(\mu = \E(X) \in \R\) and standard deviation \(\sigma = \sd(X) \in (0, \infty)\).
Chebyshev's inequality 1. \[ \P\left(\left|X - \mu\right| \ge t\right) \le \frac{\sigma^2}{t^2}, \quad t \gt 0 \]
From Markov's inequality, \(\P\left(\left|X - \mu\right| \ge t\right) = \P\left[(X - \mu)^2 \ge t^2\right] \le \E\left[(X - \mu)^2\right] \big/ t^2 = \sigma^2 \big/ t^2\) .
Here's an alternate version, with the distance in terms of standard deviation.
Chebyshev's inequality 2. \[\P\left(\left|X - \mu\right| \ge k \sigma\right) \le \frac{1}{k^2}, \quad k \gt 0 \]
The usefulness of the Chebyshev inequality comes from the fact that it holds for any distribution (assuming only that the mean and variance exist). The tradeoff is that for many specific distributions, the Chebyshev bound is rather crude. Note in particular that the first inequality is useless when \(t \le \sigma\), and the second inequality is useless when \( k \le 1 \), since 1 is an upper bound for the probability of any event. On the other hand, it's easy to construct a distribution for which Chebyshev's inequality is sharp for a specified value of \( t \in (0, \infty) \). Such a distribution is given in .
As always, be sure to try the problems yourself before looking at the details.
Suppose that \(X\) is an indicator variable with \(p = \P(X = 1)\), where \(p \in [0, 1]\). Then
The graph of \(\var(X)\) as a function of \(p\) is a parabola, opening downward, with roots at 0 and 1. Thus the minimum value of \(\var(X)\) is 0, and occurs when \(p = 0\) and \(p = 1\) (when \( X \) is deterministic, of course). The maximum value is \(\frac{1}{4}\) and occurs when \(p = \frac{1}{2}\).
Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set. The mean and variance have simple forms for the discrete uniform distribution on a set of evenly spaced points (sometimes referred to as a discrete interval):
Suppose that \(X\) has the discrete uniform distribution on \(\{a, a + h, \ldots, a + (n - 1) h\}\) where \( a \in \R \), \( h \in (0, \infty) \), and \( n \in \N_+ \). Let \( b = a + (n - 1) h \), the right endpoint. Then
Note that mean is simply the average of the endpoints, while the variance depends only on difference between the endpoints and the step size.
Open the special distribution simulator, and select the discrete uniform distribution. Vary the parameters and note the location and size of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in various geometric probability models and in a variety of other applied problems.
Suppose that \(X\) has the continuous uniform distribution on the interval \([a, b]\) where \( a, \, b \in \R \) with \( a \lt b \). Then
Note that the mean is the midpoint of the interval and the variance depends only on the length of the interval. Compare this with the results in the discrete case.
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the interval \( [a, a + w] \). Vary the parameters and note the location and size of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. Here are three:
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular probabilities that we use (\( \frac{1}{4} \) and \( \frac{1}{8} \)) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. In the following problems, you will compute the mean and variance for each of the various types of dice. Be sure to compare the results.
A standard, fair die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
An ace-six flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
A two-five flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
A three-four flat die is thrown and the score \(X\) is recorded. Sketch the graph of the probability density function and compute each of the following:
In the dice experiment, select one die. For each of the following cases, note the location and size of the mean \(\pm\) standard deviation bar in relation to the probability density function. Run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Recall that the Poisson distribution is a discrete distribution on \( \N \) with probability density function \( f \) given by
\[ f(n) = e^{-a} \, \frac{a^n}{n!}, \quad n \in \N\]
where \(a \in (0, \infty)\) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of random points
in a region of time or space; the parameter \(a\) is proportional to the size of the region.
Suppose that \(N\) has the Poisson distribution with parameter \(a\). Then
So the parameter of the Poisson distribution is both the mean and the variance of the distribution.
In the Poisson experiment, the parameter is \(a = r t\). Vary the parameter and note the size and location of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected values of the parameter, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Recall that Bernoulli trials, named for Jacob Bernoulli, are independent trials each with two outcomes, which in the language of reliability, are called success and failure. The probability of success on each trial is \( p \in [0, 1] \). If \( p \in (0, 1] \), the trial number \( N \) of the first success has the geometric distribution on \(\N_+\) with success parameter \(p\). The probability density function \( f \) of \( N \) is given by \[ f(n) = p (1 - p)^{n - 1}, \quad n \in \N_+ \]
Suppose that \(N\) has the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1]\). Then
Note that the variance is 0 when \(p = 1\), not surprising since \( X \) is deterministic in this case.
In the negative binomial experiment, set \(k = 1\) to get the geometric distribution . Vary \(p\) with the scrollbar and note the size and location of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected values of \(p\), run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that \(N\) has the geometric distribution with parameter \(p = \frac{3}{4}\). Compute the true value and the Chebyshev bound for the probability that \(N\) is at least 2 standard deviations away from the mean.
Recall that the exponential distribution is a continuous distribution on \( [0, \infty) \) with probability density function \( f \) given by
\[ f(t) = r e^{-r t}, \quad t \in [0, \infty) \]
where \(r \in (0, \infty)\) is the with rate parameter. This distribution is widely used to model failure times and other arrival times
, particulalry in the context of the Poisson model.
Suppose that \(T\) has the exponential distribution with rate parameter \(r\). Then
Thus, for the exponential distribution, the mean and standard deviation are the same.
In the gamma experiment, set \(k = 1\) to get the exponential distribution. Vary \(r\) with the scrollbar and note the size and location of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected values of \(r\), run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that \(X\) has the exponential distribution with rate parameter \(r \gt 0\). Compute the true value and the Chebyshev bound for the probability that \(X\) is at least \(k\) standard deviations away from the mean.
Recall that the Pareto distribution is a continuous distribution on \( [1, \infty) \) with probability density function \( f \) given by \[ f(x) = \frac{a}{x^{a + 1}}, \quad x \in [1, \infty) \] where \(a \in (0, \infty)\) is a parameter. The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution that is widely used to model financial variables such as income.
Suppose that \(X\) has the Pareto distribution with shape parameter \(a\). Then
In the special distribution simuator, select the Pareto distribution. Vary \(a\) with the scrollbar and note the size and location of the mean \(\pm\) standard deviation bar. For each of the following values of \(a\), run the experiment 1000 times and note the behavior of the empirical mean and standard deviation.
Recall that the standard normal distribution is a continuous distribution on \( \R \) with probability density function \( \phi \) given by \[ \phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}, \quad z \in \R \] Normal distributions are widely used to model physical measurements subject to small, random errors.
Suppose that \(Z\) has the standard normal distribution. Then
More generally, for \(\mu \in \R\) and \(\sigma \in (0, \infty)\), recall that the normal distribution with location parameter \(\mu\) and scale parameter \(\sigma\) is a continuous distribution on \( \R \) with probability density function \( f \) given by \[ f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left[-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] Moreover, if \( Z \) has the standard normal distribution, then \( X = \mu + \sigma Z \) has the normal distribution with location parameter \( \mu \) and scale parameter \( \sigma \). As the notation suggests, the location parameter is the mean of the distribution and the scale parameter is the standard deviation.
Suppose that \( X \) has the normal distribution with location parameter \(\mu\) and scale parameter \(\sigma\). Then
We could use the probability density function, of course, but it's much better to use the representation of \( X \) in terms of the standard normal variable \( Z \), and use properties of expected value and variance.
So to summarize, if \( X \) has a normal distribution, then its standard score \( Z \) has the standard normal distribution.
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the mean \(\pm\) standard deviation bar in relation to the probability density function. For selected parameter values, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions and probabilities.
Suppose that \(X\) has a beta distribution with probability density function \(f\). In each case below, graph \(f\) below and compute the mean and variance.
In the special distribution simulator, select the beta distribution. The parameter values below give the distributions in . In each case, note the location and size of the mean \(\pm\) standard deviation bar. Run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that a sphere has a random radius \(R\) with probability density function \(f\) given by \(f(r) = 12 r ^2 (1 - r)\) for \(r \in [0, 1]\). Find the mean and standard deviation of each of the following:
Suppose that \(X\) has probability density function \(f\) given by \(f(x) = \frac{1}{\pi \sqrt{x (1 - x)}}\) for \(x \in (0, 1)\). Find
The particular beta distribution is also known as the (standard) arcsine distribution. It governs the last time that the standard Brownian motion process hits 0 during the time interval \( [0, 1] \).
Open the Brownian motion experiment and select the last zero. Note the location and size of the mean \( \pm \) standard deviation bar in relation to the probability density function. Run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Suppose that the grades on a test are described by the random variable \( Y = 100 X \) where \( X \) has the beta distribution with probability density function \( f \) given by \( f(x) = 12 x (1 - x)^2 \) for \( x \in [0, 1] \). The grades are generally low, so the teacher decides to curve
the grades using the transformation \( Z = 10 \sqrt{Y} = 100 \sqrt{X}\). Find the mean and standard deviation of each of the following variables:
Suppose that \(X\) is a real-valued random variable with \(\E(X) = 5\) and \(\var(X) = 4\). Find each of the following:
Suppose that \(X\) is a real-valued random variable with \(\E(X) = 2\) and \(\E\left[X(X - 1)\right] = 8\). Find each of the following:
The expected value \(\E\left[X(X - 1)\right]\) is an example of a factorial moment.
Suppose that \(X_1\) and \(X_2\) are independent, real-valued random variables with \(\E(X_i) = \mu_i\) and \(\var(X_i) = \sigma_i^2\) for \(i \in \{1, 2\}\). Then
Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find Marilyn's standard score.
\(z = 8.53\)
Fix \( t \in (0, \infty) \). Suppose that \( X \) is the discrete random variable with probability density function defined by \( \P(X = t) = \P(X = -t) = p \), \( \P(X = 0) = 1 - 2 p \), where \( p \in (0, \frac{1}{2}) \). Then equality holds in Chebyshev's inequality at \( t \).
Note that \( \E(X) = 0 \) and \( \var(X) = \E(X^2) = 2 p t^2 \). So \( \P(|X| \ge t) = 2 p \) and \( \sigma^2 / t^2 = 2 p \).