Consider again the basic statistical model, in which we have a random experiment that results in an observable random variable \(\bs{X}\) with values in a set \(S\). Once again, the experiment is typically to sample \(n\) objects from a population and record one or more measurements for each item. In this case, the observable random variable has the form \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th item.
Suppose that \(\theta\) is a real parameter of the distribution of \(\bs{X}\), with values in a parameter set \(T\). Let \(f_\theta\) denote the probability density function of \(\bs{X}\) for \(\theta \in T\). Note that the expected value), variance, and covariance operators also depend on \(\theta\), although we will sometimes suppress this to keep the notation from becoming too unwieldy.
Suppose now that \(\lambda = \lambda(\theta)\) is a parameter of interest that is derived from \(\theta\). (Of course, \(\lambda\) might be \(\theta\) itself, but more generally might be a function of \(\theta\).) In this section we will consider the general problem of finding the best estimator of \(\lambda\) among a given class of unbiased estimators. Recall that if \(U\) is an unbiased estimator of \(\lambda\), then \(\var_\theta(U)\) is the mean square error. Mean square error is our measure of the quality of unbiased estimators, so the following definitions are natural.
Suppose that \(U\) and \(V\) are unbiased estimators of \(\lambda\).
Given unbiased estimators \( U \) and \( V \) of \( \lambda \), it may be the case that \(U\) has smaller variance for some values of \(\theta\) while \(V\) has smaller variance for other values of \(\theta\), so that neither estimator is uniformly better than the other. Of course, a minimum variance unbiased estimator is the best we can hope for.
We will show that under mild conditions, there is a lower bound on the variance of any unbiased estimator of the parameter \(\lambda\). Thus, if we can find an estimator that achieves this lower bound for all \(\theta\), then the estimator must be an UMVUE of \(\lambda\). The derivative of the log likelihood function, sometimes called the score, will play a critical role in our anaylsis. A lesser, but still important role, is played by the negative of the second derivative of the log-likelihood function. Life will be much easier if we give these functions names.
For \(\bs{x} \in S\) and \(\theta \in T\), define \begin{align} L_1(\bs{x}, \theta) & = \frac{d}{d \theta} \ln\left[f_\theta(\bs{x})\right] \\ L_2(\bs{x}, \theta) & = -\frac{d}{d \theta} L_1(\bs{x}, \theta) = -\frac{d^2}{d \theta^2} \ln\left[f_\theta(\bs{x})\right] \end{align}
In the rest of this subsection, we consider statistics \(h(\bs{X})\) where \(h: S \to \R\) (and so in particular, \(h\) does not depend on \(\theta\)). We need a fundamental assumption:
We will consider only statistics \( h(\bs{X}) \) with \(\E_\theta\left[h^2(\bs{X})\right] \lt \infty\) for \(\theta \in T\). We also assume that \[ \frac{d}{d \theta} \E_\theta\left[h(\bs{X})\right] = \E_\theta\left[h(\bs{X}) L_1(\bs{X}, \theta)\right] \] This is equivalent to the assumption that the derivative operator \(d / d\theta\) can be interchanged with the expected value operator \(\E_\theta\).
Note first that \[\frac{d}{d \theta} \E\left[h(\bs{X})\right]= \frac{d}{d \theta} \int_S h(\bs{x}) f_\theta(\bs{x}) \, d \bs{x}\] On the other hand, \begin{align} \E_\theta\left[h(\bs{X}) L_1(\bs{X}, \theta)\right] & = \E_\theta\left[h(\bs{X}) \frac{d}{d \theta} \ln\left[f_\theta(\bs{X})\right] \right] = \int_S h(\bs{x}) \frac{d}{d \theta} \ln\left[f_\theta(\bs{x})\right] f_\theta(\bs{x}) \, d \bs{x} \\ & = \int_S h(\bs{x}) \frac{\frac{d}{d \theta} f_\theta(\bs{x})}{f_\theta(\bs{x})} f_\theta(\bs{x}) \, d \bs{x} = \int_S h(\bs{x}) \frac{d}{d \theta} f_\theta(\bs{x}) \, d \bs{x} = \int_S \frac{d}{d \theta} h(\bs{x}) f_\theta(\bs{x}) \, d \bs{x} \end{align} Thus the two expressions are the same if and only if we can interchange the derivative and integral operators.
Generally speaking, the fundamental assumption will be satisfied if \(f_\theta(\bs{x})\) is differentiable as a function of \(\theta\), with a derivative that is jointly continuous in \(\bs{x}\) and \(\theta\), and if the support set \(\left\{\bs{x} \in S: f_\theta(\bs{x}) \gt 0 \right\}\) does not depend on \(\theta\).
\(\E_\theta\left[L_1(\bs{X}, \theta)\right] = 0\) for \(\theta \in T\).
If \(h(\bs{X})\) is a statistic then
\[ \cov_\theta\left[h(\bs{X}), L_1(\bs{X}, \theta)\right] = \frac{d}{d \theta} \E_\theta\left[h(\bs{X})\right] \]\(\var_\theta\left[L_1(\bs{X}, \theta)\right] = \E_\theta\left[L_1^2(\bs{X}, \theta)\right]\)
The following theorem gives the general Cramér-Rao lower bound on the variance of a statistic. The lower bound is named for Harold Cramér and CR Rao:
If \(h(\bs{X})\) is a statistic then \[ \var_\theta\left[h(\bs{X})\right] \ge \frac{\left[\frac{d}{d \theta} \E_\theta\left[h(\bs{X})\right] \right]^2}{\E_\theta\left[L_1^2(\bs{X}, \theta)\right]} \]
From the correlation inequality, \[\cov_\theta^2\left[h(\bs{X}), L_1(\bs{X}, \theta)\right] \le \var_\theta\left[h(\bs{X})\right] \var_\theta\left[L_1(\bs{X}, \theta)\right]\] The result now follows from theorems and .
We can now give the first version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
Suppose now that \(\lambda(\theta)\) is a parameter of interest and \(h(\bs{X})\) is an unbiased estimator of \(\lambda\). Then \[ \var_\theta\left[h(\bs{X})\right] \ge \frac{\left[d\lambda / d\theta\right]^2}{\E_\theta\left[L_1^2(\bs{X}, \theta)\right]} \]
An estimator of \(\lambda\) that achieves the Cramér-Rao lower bound must be a uniformly minimum variance unbiased estimator (UMVUE) of \(\lambda\).
Equality holds in , and hence \(h(\bs{X})\) is an UMVUE, if and only if there exists a function \(u(\theta)\) such that (with probability 1) \[ h(\bs{X}) = \lambda(\theta) + u(\theta) L_1(\bs{X}, \theta) \]
Equality holds in the correlation inequality if and only if the random variables are linear transformations of each other. Recall also that \(L_1(\bs{X}, \theta)\) has mean 0.
The quantity \(\E_\theta\left[L^2(\bs{X}, \theta)\right]\) that occurs in the denominator of the lower bounds in theorems and is called the Fisher information number of \(\bs{X}\), named after Sir Ronald Fisher. The following theorem gives an alternate version of the Fisher information number that is usually computationally better.
If the appropriate derivatives exist and if the appropriate interchanges are permissible then \[ \E_\theta\left[L_1^2(\bs{X}, \theta)\right] = \E_\theta\left[L_2(\bs{X}, \theta)\right] \]
The following theorem gives the second version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
If \(\lambda(\theta)\) is a parameter of interest and \(h(\bs{X})\) is an unbiased estimator of \(\lambda\) then
\[ \var_\theta\left[h(\bs{X})\right] \ge \frac{\left[d\lambda / d\theta\right]^2}{\E_\theta\left[L_2(\bs{X}, \theta)\right]} \]Suppose now that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of a random variable \(X\) having probability density function \(g_\theta\) and taking values in a set \(R\). Thus \(S = R^n\). We will use lower-case letters for the derivative of the log likelihood function of \(X\) and the negative of the second derivative of the log likelihood function of \(X\).
For \(x \in R\) and \(\theta \in T\) define \begin{align} l(x, \theta) & = \frac{d}{d\theta} \ln\left[g_\theta(x)\right] \\ l_2(x, \theta) & = -\frac{d^2}{d\theta^2} \ln\left[g_\theta(x)\right] \end{align}
\(L^2\) can be written in terms of \(l^2\) and \(L_2\) can be written in terms of \(l_2\):
The following theorem gives the second version of the general Cramér-Rao lower bound on the variance of a statistic, specialized for random samples.
If \( h(\bs{X}) \) is a statistic then
\[ \var_\theta\left[h(\bs{X})\right] \ge \frac{\left[\frac{d}{d\theta} \E_\theta\left[h(\bs{X})\right] \right]^2}{n \E_\theta\left[l^2(X, \theta)\right]} \]The following theorem give the third version of the Cramér-Rao lower bound for unbiased estimators of a parameter, specialized for random samples.
Suppose now that \(\lambda(\theta)\) is a parameter of interest and \(h(\bs{X})\) is an unbiased estimator of \(\lambda\). Then \[ \var_\theta\left[h(\bs{X})\right] \ge \frac{(d\lambda / d\theta)^2}{n \E_\theta\left[l^2(X, \theta)\right]} \]
Note that the Cramér-Rao lower bound varies inversely with the sample size \(n\). The following version gives the fourth version of the Cramér-Rao lower bound for unbiased estimators of a parameter, again specialized for random samples.
If the appropriate derivatives exist and the appropriate interchanges are permissible) then \[ \var_\theta\left[h(\bs{X})\right] \ge \frac{\left[d\lambda / d\theta\right]^2}{n \E_\theta\left[l_2(X, \theta)\right]} \]
To summarize, we have four versions of the Cramér-Rao lower bound for the variance of an unbiased estimate of \(\lambda\): and in the general case, and and in the special case that \(\bs{X}\) is a random sample from the distribution of \(X\). If an ubiased estimator of \(\lambda\) achieves the lower bound, then the estimator is an UMVUE.
We will apply the results above to several parametric families of distributions. First we need to recall some standard notation. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of a real-valued random variable \(X\) with mean \(\mu\) and variance \(\sigma^2\). The sample mean is \[ M = \frac{1}{n} \sum_{i=1}^n X_i \] Recall that \(\E(M) = \mu\) and \(\var(M) = \sigma^2 / n\). The special version of the sample variance, when \(\mu\) is known, and standard version of the sample variance are, respectively, \begin{align} W^2 & = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \\ S^2 & = \frac{1}{n - 1} \sum_{i=1}^n (X_i - M)^2 \end{align}
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with unknown success parameter \(p \in (0, 1)\). In the usual language of reliability, \(X_i = 1\) means success on trial \(i\) and \(X_i = 0\) means failure on trial \(i\); the distribution is named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function \[ g_p(x) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\} \] Assumption is satisfied. Moreover, recall that the mean of the Bernoulli distribution is \(p\), while the variance is \(p (1 - p)\).
The sample mean \(M\) (which is the proportion of successes) attains the lower bound in and hence is an UMVUE of \(p\).
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Poisson distribution with parameter \(\theta \in (0, \infty)\). Recall that this distribution is often used to model the number of random points
in a region of time or space, particularly in the context of the Poisson process. The Poisson distribution is named for Simeon Poisson and has probability density function
\[ g_\theta(x) = e^{-\theta} \frac{\theta^x}{x!}, \quad x \in \N \]
Assumption is satisfied. Recall also that the mean and variance of the distribution are both \(\theta\).
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the normal distribution with mean \(\mu \in \R\) and variance \(\sigma^2 \in (0, \infty)\). Recall that the normal distribution plays an especially important role in statistics, in part because of the central limit theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors, and has probability density function \[ g_{\mu,\sigma^2}(x) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp\left[-\left[\frac{x - \mu}{\sigma}\right]^2 \right], \quad x \in \R\]
Assumption is satisfied with respect to both of these parameters. Recall also that the fourth central moment is \(\E\left[(X - \mu)^4\right] = 3 \, \sigma^4\).
\(\frac{2 \sigma^4}{n}\) is the Cramér-Rao lower bound for the variance of unbiased estimators of \(\sigma^2\).
The sample variance \(S^2\) has variance \(\frac{2 \sigma^4}{n-1}\) and hence does not attain the lower bound in .
If \(\mu\) is known, then the special sample variance \(W^2\) attains the lower bound in and hence is an UMVUE of \(\sigma^2\).
If \(\mu\) is unknown, no unbiased estimator of \(\sigma^2\) attains the Cramér-Rao lower bound in .
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the gamma distribution with known shape parameter \(k \gt 0\) and unknown scale parameter \(b \gt 0\). The gamma distribution is often used to model random times and certain other types of positive random variables. The probability density function is \[ g_b(x) = \frac{1}{\Gamma(k) b^k} x^{k-1} e^{-x/b}, \quad x \in (0, \infty) \] Assumption in is satisfied with respect to \(b\). Moreover, the mean and variance of the gamma distribution are \(k b\) and \(k b^2\), respectively.
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the beta distribution with left parameter \(a \gt 0\) and right parameter \(b = 1\). Beta distributions are widely used to model random proportions and other random variables that take values in bounded intervals. In our specialized case, the probability density function of the sampling distribution is \[ g_a(x) = a \, x^{a-1}, \quad x \in (0, 1) \] Assumption is satisfied with respect to \(a\).
The mean and variance of the distribution are
The Cramér-Rao lower bound for the variance of unbiased estimators of \(\mu\) is \(\frac{a^2}{n \, (a + 1)^4}\).
The sample mean \(M\) does not achieve the Cramér-Rao lower bound in , and hence is not an UMVUE of \(\mu\).
Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the uniform distribution on \([0, a]\) where \(a \gt 0\) is the unknown parameter. Thus, the probability density function of the sampling distribution is \[ g_a(x) = \frac{1}{a}, \quad x \in [0, a] \]
The Cramér-Rao lower bound for the variance of unbiased estimators of \(a\) is \(\frac{a^2}{n}\). Of course, the Cramér-Rao theorem does not apply, by .
From the section on maximum likelihood estimators, recall that \(V = \frac{n+1}{n} \max\{X_1, X_2, \ldots, X_n\}\) is unbiased and has variance \(\frac{a^2}{n (n + 2)}\). This variance is smaller than the Cramér-Rao bound in .
The reason that assumption is not satisfied is that the support set \(\left\{x \in \R: g_a(x) \gt 0\right\}\) depends on the parameter \(a\).
We now consider a somewhat specialized problem, but one that fits the general theme of this section. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a sequence of observable real-valued random variables that are uncorrelated and have the same unknown mean \(\mu \in \R\), but possibly different standard deviations. Let \(\bs{\sigma} = (\sigma_1, \sigma_2, \ldots, \sigma_n)\) where \(\sigma_i = \sd(X_i)\) for \(i \in \{1, 2, \ldots, n\}\).
We will consider estimators of \(\mu\) that are linear functions of the outcome variables. Specifically, we will consider estimators of the following form, where the vector of coefficients \(\bs{c} = (c_1, c_2, \ldots, c_n)\) is to be determined: \[ Y = \sum_{i=1}^n c_i X_i \]
\(Y\) is unbiased if and only if \(\sum_{i=1}^n c_i = 1\).
The variance of \(Y\) is \[ \var(Y) = \sum_{i=1}^n c_i^2 \sigma_i^2 \]
The variance is minimized, subject to the unbiased constraint, when \[ c_j = \frac{1 / \sigma_j^2}{\sum_{i=1}^n 1 / \sigma_i^2}, \quad j \in \{1, 2, \ldots, n\} \]
Use the method of Lagrange multipliers (named after Joseph-Louis Lagrange).
Exercise shows how to construct the Best Linear Unbiased Estimator (BLUE) of \(\mu\), assuming that the vector of standard deviations \(\bs{\sigma}\) is known.
Suppose now that \(\sigma_i = \sigma\) for \(i \in \{1, 2, \ldots, n\}\) so that the outcome variables have the same standard deviation. In particular, this would be the case if the outcome variables form a random sample of size \(n\) from a distribution with mean \(\mu\) and standard deviation \(\sigma\).
In this case the variance is minimized when \(c_i = 1 / n\) for each \(i\) and hence \(Y = M\), the sample mean.
Exercise shows that the sample mean \(M\) is the best linear unbiased estimator of \(\mu\) when the standard deviations are the same, and that moreover, we do not need to know the value of the standard deviation.