\(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\bs}{\boldsymbol}\)
  1. Random
  2. 7. Set Estimation
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

5. Bayesian Set Estimation

Basic Theory

As usual, our starting point is a random experiment with an underlying sample space and a probability measure \(\P\). In the basic statistical model, we have an observable random variable \(\bs{X}\) taking values in a set \(S\). In general, \(\bs{X}\) can have quite a complicated structure. For example, if the experiment is to sample \(n\) objects from a population and record various measurements of interest, then \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th object.

Suppose also that the distribution of \(\bs{X}\) depends on a parameter \(\theta\) with values in a set \(T\). The parameter may also be vector valued, in which case \(T \subseteq \R^k\) for some \(k \in \N_+\) and the parameter has the form \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_k)\).

The Bayesian Formulation

Recall that in Bayesian analysis, named for the infamous Thomas Bayes, the unknown parameter \(\theta\) is treated as the observed value of a random variable \(\Theta\) with values in \(T\). Here is a brief review:

The Bayesian formulation

  1. The conditional probability density function of the data vector \(\bs{X}\) given \(\Theta = \theta\ \in T\) is denoted \( f(\bs{x} \mid \theta) \) for \( \bs{x} \in S \).
  2. The random parameter \(\Theta\) is given a prior distribution with probability density function \(h\) on \(T\).
  3. The joint probability density function of \((\bs X, \Theta)\) is \((\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)\) for \((\bs{x}, \theta) \in S \times T \).
  4. The (unconditional) probability density function of \(\bs{X}\) is the function \(f\) given by \(f(\bs{x}) = \sum_{\theta \in T} h(\theta) f(\bs{x} \mid \theta)\) for \(\bs{x} \in S\) if \(\Theta\) has a discrete distribution, or by \(f(\bs{x}) = \int_T h(\theta) f(\bs{x} \mid \theta) \, d\theta,\) for \(\bs{x} \in S\) if \(\Theta\) has a continuous distribution.
  5. By Bayes' theorem, the posterior probability density function of \(\Theta\) given \(\bs{X} = \bs{x} \in S\) is \[ h(\theta \mid \bs{x}) = \frac{h(\theta) f(\bs{x} \mid \theta)}{f(\bs{x})}, \quad \theta \in T \]

The prior distribution is often subjective, and is chosen to reflect our knowledge, if any, of the parameter. In some cases, we can recognize the posterior distribution from the functional form of \(\theta \mapsto h(\theta) f(\bs{x} \mid \theta)\) without having to actually compute the normalizing constant \(f(\bs{x})\), and thus reducing the computational burden significantly. In particular, this is often the case when we have a conjugate parametric family of distributions of \(\Theta\). Recall that this means that when the prior distribution of \(\Theta\) belongs to the family, so does the posterior distribution of \(\Theta\) given \(\bs{X} = \bs{x} \in S\).

The most important special case arises when we have a basic variable \(X\) with values in a set \(R\), and given \(\Theta = \theta \in T\), the data vector \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from \(X\). That is, given \(\Theta = \theta \in T\), \(\bs{X}\) is a sequence of independent, identically distributed variables, each with the same distribution as \(X\) given \(\Theta = \theta\). Thus \(S = R^n\) and if \(X\) has conditional probability density function \(g(x \mid \theta)\), then \[f(\bs{x} \mid \theta) = g(x_1 \mid \theta) g(x_2 \mid \theta) \cdots g(x_n \mid \theta), \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in S\]

Confidence Sets

Now let \(C(\bs{X})\) be a confidence set (that is, a subset of the parameter set \(T\) that depends on the data variable \(\bs{X}\) but no unknown parameters).

One possible definition of a \(1 - \alpha\) level Bayesian confidence set requires that \[ \P\left[\Theta \in C(\bs{x}) \mid \bs{X} = \bs{x}\right] = 1 - \alpha \]

In defintion , only \(\Theta\) is random and thus the probability above is computed using the posterior probability density function \(\theta \mapsto h(\theta \mid \bs{x})\).

Another possible definition requires that \[ \P\left[\Theta \in C(\bs{X})\right] = 1 - \alpha \]

In definition , \(\bs{X}\) and \(\Theta\) are both random, and so the probability above would be computed using the joint probability density function \((\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)\). Whatever the philosophical arguments may be, definition is certainly the easier one from a computational viewpoint, and hence is the one most commonly used.

Let us compare the classical and Bayesian approaches. In the classical approach, the parameter \(\theta\) is deterministic, but unknown. Before the data are collected, the confidence set \(C(\bs{X})\) (which is random by virtue of \(\bs{X}\)) will contain the parameter with probability \(1 - \alpha\). After the data are collected, the computed confidence set \(C(\bs{x})\) either contains \(\theta\) or does not, and we will usually never know which. By contrast in a Bayesian confidence set, the random parameter \(\Theta\) falls in the computed, deterministic confidence set \(C(\bs{x})\) with probability \(1 - \alpha\).

Suppose that \(\Theta\) is real valued, so that \(T \subseteq \R\). For \(r \in (0, 1)\), a \(1 - \alpha\) level Bayesian confidence interval is \(\left[U_{(1 - r) \alpha}(\bs{x}), U_{1 - r \alpha}(\bs{x})\right]\) where \(U_p(\bs{x})\) is the quantile of order \(p\) for the posterior distribution of \(\Theta\) given \(\bs{X} = \bs{x}\).

As in past sections, \(r\) is the fraction of \(\alpha\) in the right tail of the posterior distribution and \(1 - r\) is the fraction of \(\alpha\) in the left tail of the posterior distribution. As usual, \(r = \frac{1}{2}\) gives the symmetric, two-sided confidence interval; letting \(r \to 0\) gives the confidence lower bound; and letting \(r \to 1\) gives the confidence upper bound.

Applications

The Bernoulli Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with unknown success parameter \(p \in (0, 1)\). In the usual language of reliability, \(X_i = 1\) means success on trial \(i\) and \(X_i = 0\) means failure on trial \(i\). The distribution is named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function (given \(p\)) \[ g(x \mid p) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\} \] Note that the number of successes in the \(n\) trials is \(Y = \sum_{i=1}^n X_i\). Given \(p\), random variable \(Y\) has the binomial distribution with parameters \(n\) and \(p\).

In our previous discussion of Bayesian estimation, we modeled the parameter \(p\) with a random variable \(P\) that has a beta distribution. This family of disstributions is conjugate for \(P\). Specifically, if the prior distribution of \(P\) is beta with left parameter \(a \gt 0\) and right parameter \(b \gt 0\), then the posterior distribution of \(P\) given \(\bs{X}\) is beta with left parameter \(a + Y\) and right parameter \(b + (n - Y)\); the left parameter is increased by the number of successes and the right parameter by the number of failure. It follows that a \(1 - \alpha\) level Bayesian confidence interval for \(p\) is \(\left[U_{\alpha/2}(y), U_{1-\alpha/2}(y)\right]\) where \(U_r(y)\) is the quantile of order \(r\) for the posterior beta distribution. In the special case \(a = b = 1\) the prior distribution is uniform on \((0, 1)\) and reflects a lack of previous knowledge about \(p\).

Suppose that we have a coin with an unknown probability \(p\) of heads, and that we give \(p\) the uniform prior, reflecting our lack of knowledge about \(p\). We then toss the coin 50 times, observing 30 heads.

  1. Find the posterior distribution of \(p\) given the data.
  2. Construct the 95% Bayesian confidence interval.
  3. Construct the classical Wald confidence interval at the 95% level.
Details:
  1. Beta with left parameter 31 and right parameter 21.
  2. \([0.461, 0.724\)
  3. \([0.464, 0.736]\)

The Poisson Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Poisson distribution with parameter \(\lambda \in (0, \infty)\). Recall that the Poisson distribution is often used to model the number of random points in a region of time or space, particularly in the contest of the Poisson process. The distribution is named for the inimitable Simeon Poisson and given \(\lambda\), has probability density function \[ g(x \mid \theta) = e^{-\lambda} \frac{\lambda^x}{x!}, \quad x \in \N \] As usual, we will denote the sum of the sample values by \(Y = \sum_{i=1}^n X_i\). Given \(\lambda\), random variable \(Y\) also has a Poisson distribution, but with parameter \(n \lambda\).

In our previous discussion of Bayesian estimation, we showed modeled \(\lambda\) with a random variable \(\Lambda\) that has a gamma distribution. This family of distributions is conjugate for \(\Lambda\). Specifically, if the prior distribution of \(\Lambda\) is gamma with shape parameter \(k \gt 0\) and rate parameter \(r \gt 0\) (so that the scale parameter is \(1 / r\)), then the posterior distribution of \(\Lambda\) given \(\bs{X}\) is gamma with shape parameter \(k + Y\) and rate parameter \(r + n\). It follows that a \(1 - \alpha\) level Bayesian confidence interval for \(\lambda\) is \(\left[U_{\alpha/2}(y), U_{1-\alpha/2}(y)\right]\) where \(U_p(y)\) is the quantile of order \(p\) for the posterior gamma distribution.

Consider the alpha emissions data, which we believe come from a Poisson distribution with unknown parameter \(\lambda\). Suppose that a priori, we believe that \(\lambda\) is about 5, so we give \(\lambda\) a prior gamma distribution with shape parameter \(5\) and rate parameter 1. (Thus the mean is 5 and the standard deviation \(\sqrt{5} = 2.236\).)

  1. Find the posterior distribution of \(\lambda\) given the data.
  2. Construct the 95% Bayesian confidence interval.
  3. Construct the classical \(t\) confidence interval at the 95% level.
Details:
  1. Gamma with shape parameter 10104 and rate parameter 1208.
  2. \((8.202, 8.528)\)
  3. \((8.324, 8.410)\)

The Normal Distribution

Suppose that \(\bs{x} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the normal distribution with unknown mean \(\mu \in \R\) and known variance \(\sigma^2 \in (0, \infty)\). Of course, the normal distribution plays an especially important role in statistics, in part because of the central limit theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors. Recall that the normal probability density function (given \(\mu\)) is \[ g(x \mid \mu) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left[-\left(\frac{x - \mu}{\sigma}\right)^2 \right], \quad x \in \R \] We denote the sum of the sample values by \(Y = \sum_{i=1}^n X_i\). Recall that \(Y\) also has a normal distribution (given \(\mu\)), but with mean \(n \mu\) and variance \(n \sigma^2\).

In our previous discussion of Bayesian estimation, we modeled \(\mu\) with a random variable \(\Psi\) that also has a normal distribution. This family is conjugate for \(\Psi\) (with \(\sigma\) known). Specifically, if the prior distribution of \(\Psi\) is normal with mean \(a \in \R\) and standard deviation \(b \in (0, \infty)\), then the posterior distribution of \(\Psi\) given \(\bs{X}\) is also normal, with \[\E(\mu \mid \bs{X}) = \frac{Y b^2 + a \sigma^2}{\sigma^2 + n b^2}, \quad \var(\mu \mid \bs{X}) = \frac{\sigma^2 b^2}{\sigma^2 + n b^2}\] It follows that a \(1 - \alpha\) level Bayesian confidence interval for \(\mu\) is \(\left[U_{\alpha/2}(y), U_{1-\alpha/2}(y)\right]\) where \(U_p(y)\) is the quantile of order \(p\) for the posterior normal distribution. An interesting special case is when \(b = \sigma\), so that the standard deviation of the prior distribution of \(\mu\) is the same as the standard deviation of the sampling distribution. In this case, the posterior mean is \((Y + a) \big/ (n + 1)\) and the posterior variance is \(\sigma^2 \big/ (n + 1)\)

The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process, the actual length is a normally distributed with mean \(\mu\) and variance \(\sigma^2\). The variance is due to inherent factors in the process, which remain fairly stable over time. From historical data, it is known that \(\sigma = 0.3\). On the other hand, \(\mu\) may be set by adjusting various parameters in the process and hence may change to an unknown value fairly frequently. Thus, suppose that we give \(\mu\) with a prior normal distribution with mean 10 and standard deviation 0.03 A sample of 100 parts has mean 10.2.

  1. Find the posterior distribution of \(\mu\) given the data.
  2. Construct the 95% Bayesian confidence interval.
  3. Construct the classical \(z\) confidence interval at the 95% level.
Details:
  1. Normal with mean 10.198 and standard deviation 0.0299.
  2. \((10.14, 10.26)\)
  3. \((10.14, 10.26)\)

The Beta Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the beta distribution with unknown left shape parameter \(a \in (0, \infty)\) and right shape parameter \(b = 1\). The beta distribution is widely used to model random proportions and probabilities and other variables that take values in bounded intervals. Recall that the probability density function (given \(a\)) is \[ g(x \mid a) = a x^{a-1}, \quad x \in (0, 1) \] We denote the product of the sample values by \(W = X_1 X_2 \cdots X_n\).

In our previous discussion of Bayesian estimation, we modeled \(a\) with a random variable \(A\) that has a gamma distribution. This family of distributions is conjugate for \(a\). Specifically, if the prior distribution of \(A\) is gamma with shape parameter \(k \gt 0\) and rate parameter \(r \gt 0\), then the posterior distribution of \(A\) given \(\bs{X}\) is also gamma, with shape parameter \(k + n\) and rate parameter \(r - \ln(W)\). It follows that a \(1 - \alpha\) level Bayesian confidence interval for \(A\) is \(\left[U_{\alpha/2}(w), U_{1-\alpha/2}(w)\right]\) where \(U_p(w)\) is the quantile of order \(p\) for the posterior gamma distribution. In the special case that \(k = 1\), the prior distribution of \(a\) is exponential with rate parameter \(r\).

Suppose that the resistance of an electrical component (in Ohms) has the beta distribution with unknown left parameter \(a\) and right parameter \(b = 1\). We believe that \(a\) may be about 10, so we give \(a\) the prior gamma distribution with shape parameter 10 and rate parameter 1. We sample 20 components and observe the data \[0.98, 0.93, 0.99, 0.89, 0.79, 0.99, 0.92, 0.97, 0.88, 0.97, 0.86, 0.84, 0.96, 0.97, 0.92, 0.90, 0.98, 0.96, 0.96, 1.00\]

  1. Find the posterior distribution of \(a\).
  2. Construct the 95% Bayesian confidence interval for \(a\).
Details:
  1. Gamma with shape parameter 30 and rate parameter 2.424.
  2. \((8.349, 17.180)\)

The Pareto Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Pareto distribution with shape parameter \(a \in (0, \infty)\) and scale parameter \(b = 1\). The Pareto distribution is used to model certain financial variables and other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given \(a\)) is \[ g(x \mid a) = \frac{a}{x^{a+1}}, \quad x \in [1, \infty) \] We denote the product of the sample values by \(W = X_1 X_2 \cdots X_n\).

In our previous discussion of Bayesian estimation, we showed modeled \(a\) with a random variable \(A\) that has a gamma distribution. This family of distributions is conjugate for \(A\). Specifically, if the prior distribution of \(A\) is gamma with shape parameter \(k \gt 0\) and rate parameter \(r \gt 0\), then the posterior distribution of \(A\) given \(\bs{X}\) is also gamma, with shape parameter \(k + n\) and rate parameter \(r + \ln(W)\). It follows that a \(1 - \alpha\) level Bayesian confidence interval for \(a\) is \(\left[U_{\alpha/2}(w), U_{1-\alpha/2}(w)\right]\) where \(U_p(w)\) is the quantile of order \(p\) for the posterior gamma distribution. In the special case that \(k = 1\), the prior distribution of \(a\) is exponential with rate parameter \(r\).

Suppose that a financial variable has the Pareto distribution with unknown shape parameter \(a\) and scale parameter \(b = 1\). We believe that \(a\) may be about 4, so we give \(a\) the prior gamma distribution with shape parameter 4 and rate parameter 1. A random sample of size 20 from the variable gives the data \[1.09, 1.13, 2.00, 1.43, 1.26, 1.00, 1.36, 1.03, 1.46, 1.18, 2.16, 1.16, 1.22, 1.06, 1.28, 1.23, 1.11, 1.03, 1.04, 1.05\]

  1. Find the posterior distribution of \(a\).
  2. Construct the 95% Bayesian confidence interval for \(a\).
Details:
  1. Gamma with shape parameter 24 and rate parameter 5.223.
  2. \((2.944, 6.608)\)