\(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\)
  1. Random
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9
  12. 10
  13. 11
  14. 12
  15. 13

11. Vector Spaces of Random Variables

Basic Theory

Many of the concepts in this chapter have elegant interpretations if we think of real-valued random variables as vectors in a vector space. In particular, variance is related to the concept of norm and distance, while covariance is related to inner product. These connections can help unify and illuminate some of the ideas in the chapter from a different point of view. Of course, real-valued random variables are simply measurable, real-valued functions defined on the sample space, so much of this section is a special case of the more general function spaces.

As usual, our starting point is a random experiment modeled by a probability space \( (\Omega, \mathscr{F}, \P) \), so that \( \Omega \) is the set of outcomes, \( \mathscr{F} \) is the \( \sigma \)-algebra of events, and \( \P \) is the probability measure on the sample space \( (\Omega, \mathscr F) \). Our basic vector space \(\mathscr V\) consists of all real-valued random variables defined on \((\Omega, \mathscr{F}, \P)\) (that is, defined for the experiment). Recall that random variables \( X_1 \) and \( X_2 \) are equivalent if \( \P(X_1 = X_2) = 1 \), in which case we write \( X_1 \equiv X_2 \). We consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number. These operations are compatible with the equivalence relation in the sense that if \( X_1 \equiv X_2 \) and \( Y_1 \equiv Y_2 \) then \( X_1 + Y_1 \equiv X_2 + Y_2 \) and \( c X_1 \equiv c X_2 \) for \( c \in \R \). In short, the vector space \( \mathscr V \) is well-defined.

Norm

Suppose that \( k \in [1, \infty) \). The \( k \) norm of \( X \in \mathscr V \) is defined by

\[ \|X\|_k = \left[\E\left(\left|X\right|^k\right)\right]^{1 / k} \]

So \(\|X\|_k\) is a measure of the size of \(X\) in a certain sense, and of course it's possible that \( \|X\|_k = \infty \). The following theorems establish the fundamental properties. The first is the positive property.

Suppose again that \( k \in [1, \infty) \). For \( X \in \mathscr V \),

  1. \(\|X\|_k \ge 0\)
  2. \(\|X\|_k = 0\) if and only if \(\P(X = 0) = 1\) (so that \(X \equiv 0\)).
Details:

These results follow from the basic inequality properties of expected value. First \( \left|X\right|^k \ge 0 \) with probability 1, so \( \E\left(\left|X\right|^k\right) \ge 0 \). In addition, \( \E\left(\left|X\right|^k\right) = 0 \) if and only if \( \P(X = 0) = 1 \).

The next result is the scaling property.

Suppose again that \( k \in [1, \infty) \). Then \(\|c X\|_k = \left|c\right| \, \|X\|_k\) for \( X \in \mathscr V \) and \(c \in \R\).

Details: \[ \| c X \|_k = [\E\left(\left|c X\right|^k\right]^{1 / k} = \left[\E\left(\left|c\right|^k \left|X\right|^k\right)\right]^{1/k} = \left[\left|c\right|^k \E\left(\left|X\right|^k\right)\right]^{1/k} = \left|c\right| \left[\E\left(\left|X\right|^k\right)\right]^{1/k} = \left|c\right| \|X\|_k \]

The next result is Minkowski's inequality, named for Hermann Minkowski, and also known as the triangle inequality.

Suppose again that \( k \in [1, \infty) \). Then \(\|X + Y\|_k \le \|X\|_k + \|Y\|_k\) for \( X, \, Y \in \mathscr V \).

Details:

The first quadrant \(S = \left\{(x, y) \in \R^2: x \ge 0, \; y \ge 0\right\}\) is a convex set and \(g(x, y) = \left(x ^{1/k} + y^{1/k}\right)^k\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables, then \[ \E\left[(U^{1/k} + V^{1/k})^k\right] \le \left(\left[\E(U)\right]^{1/k} + \left[\E(V)\right]^{1/k}\right)^k \] Letting \(U = \left|X\right|^k\) and \(V = \left|Y\right|^k\) and simplifying gives the result. To show that \( g \) really is concave on \( S \), we can compute the second partial derivatives. Let \( h(x, y) = x^{1/k} + y^{1/k} \) so that \( g = h^k \). Then \begin{align} g_{xx} & = \frac{k-1}{k} h^{k-2} x^{1/k - 2}\left(x^{1/k} - h\right) \\ g_{yy} & = \frac{k-1}{k} h^{k-2} y^{1/k - 2}\left(y^{1/k} - h\right) \\ g_{xy} & = \frac{k-1}{k} h^{k-2} x^{1/k - 1} y^{1/k - 1} \end{align} Clearly \( h(x, y) \ge x^{1/k} \) and \( h(x, y) \ge y^{1/k} \) for \( x \ge 0 \) and \( y \ge 0 \), so \( g_{xx} \)and \( g_{yy} \), the diagonal entries of the second derivative matrix, are nonpositive on \( S \). A little algebra shows that the determinant of the second derivative matrix \( g_{xx} g_{yy} - g_{xy}^2 = 0\) on \( S \). Thus, the second derivative matrix of \( g \) is negative semi-definite.

It follows from the last three results that the set of random variables (again, modulo equivalence) with finite \(k\) norm forms a subspace of our parent vector space \(\mathscr V\), and that the \(k\) norm really is a norm on this vector space.

For \( k \in [1, \infty) \), \( \mathscr L_k \) denotes the vector space of \( X \in \mathscr V \) with \(\|X\|_k \lt \infty\), and with norm \( \| \cdot \|_k \).

In analysis, \( p \) is often used as the index rather than \( k \) as we have used here, but \( p \) seems too much like a probability, so we have broken with tradition on this point. The \( \mathscr L \) is in honor of Henri Lebesgue, who developed much of this theory. Sometimes, when we need to indicate the dependence on the underlying \( \sigma \)-algebra \( \mathscr{F} \), we write \( \mathscr L_k(\mathscr{F}) \). Our next result is Lyapunov's inequality, named for Aleksandr Lyapunov. This inequality shows that the \(k\)-norm of a random variable is increasing in \(k\).

Suppose that \( j, \, k \in [1, \infty) \) with \(j \le k\). Then \(\|X\|_j \le \|X\|_k\) for \(X \in \mathscr V\).

Details:

Note that \(S = \{x \in \R: x \ge 0\}\) is convex and \(g(x) = x^{k/j}\) is convex on \(S\). From Jensen's inequality if \(U\) is a nonnegative random variable then \(\left[\E(U)\right]^{k/j} \le \E\left(U^{k/j}\right)\). Letting \(U = \left|X\right|^j\) and simplifying gives the result.

Lyapunov's inequality shows that if \(1 \le j \le k\) and \( \|X\|_k \lt \infty \) then \( \|X\|_j \lt \infty \). Thus, \(\mathscr L_k\) is a subspace of \(\mathscr L_j\).

Metric

The \(k\) norm, like any norm on a vector space, can be used to define a metric; we simply compute the norm of the difference between two vectors.

For \( k \in [1, \infty) \), the \(k\) distance (or \(k\) metric) between \(X, \, Y \in \mathscr V\) is defined by \[ d_k(X, Y) = \|X - Y\|_k = \left[\E\left(\left|X - Y\right|^k\right)\right]^{1/k} \]

The following properties are analogous to the properties in norm properties in (and thus very little additional work is required for the proofs). These properties show that the \(k\) metric really is a metric on \( \mathscr L_k \) (as always, modulo equivalence). The first is the positive property.

Suppose again that \( k \in [1, \infty) \) \(X, \; Y \in \mathscr V\). Then

  1. \(d_k(X, Y) \ge 0\)
  2. \(d_k(X, Y) = 0\) if and only if \(\P(X = Y) = 1\) (so that \(X \equiv Y\) and \(Y\)).
Details:

These results follow directly from .

Next is the obvious symmetry property:

\( d_k(X, Y) = d_k(Y, X) \) for \( X, \; Y \in \mathscr V \).

Next is the distance version of the triangle inequality.

\(d_k(X, Z) \le d_k(X, Y) + d_k(Y, Z)\) for \(X, \; Y, \; Z \in \mathscr V\)

Details:

From Minkowski's inequality , \[ d_k(X, Z) = \|X - Z\|_k = \|(X - Y) + (Y - Z) \|_k \le \|X - Y\|_k + \|Y - Z\|_k = d_k(X, Y) + d_k(Y, Z) \]

The last three properties mean that \( d_k \) is indeed a metric on \( \mathscr L_k \) for \( k \ge 1 \). In particular, note that the standard deviation is simply the 2-distance from \(X\) to its mean \( \mu = \E(X) \): \[ \sd(X) = d_2(X, \mu) = \|X - \mu\|_2 = \sqrt{\E\left[(X - \mu)^2\right]} \] and the variance is the square of this. More generally, the \(k\)th moment of \(X\) about \(a\) is simply the \(k\)th power of the \(k\)-distance from \(X\) to \(a\). The 2-distance is especially important for reasons that will become clear in the discussion of inner product in . This distance is also called the root mean square distance.

Center and Spread Revisited

Measures of center and measures of spread are best thought of together, in the context of a measure of distance. For a real-valued random variable \(X\), we first try to find the constants \(t \in \R\) that are closest to \(X\), as measured by the given distance; any such \(t\) is a measure of center relative to the distance. The minimum distance itself is the corresponding measure of spread.

Let us apply this procedure to the 2-distance.

For \( X \in \mathscr L_2 \), define the root mean square error function by \[ d_2(X, t) = \|X - t\|_2 = \sqrt{\E\left[(X - t)^2\right]}, \quad t \in \R \]

For \( X \in \mathscr L_2 \), \(d_2(X, t)\) is minimized when \(t = \E(X)\) and the minimum value is \(\sd(X)\).

Details:

Note that the minimum value of \(d_2(X, t)\) occurs at the same points as the minimum value of \(d_2^2(X, t) = \E\left[(X - t)^2\right]\) (this is the mean square error function). Expanding and taking expected values term by term gives \[ \E\left[(X - t)^2\right] = \E\left(X^2\right) - 2 t \E(X) + t^2 \] This is a quadratic function of \( t \) and hence the graph is a parabola opening upward. The minimum occurs at \( t = \E(X) \), and the minimum value is \( \var(X) \). Hence the minimum value of \( t \mapsto d_2(X, t) \) also occurs at \( t = \E(X) \) and the minimum value is \( \sd(X) \).

We have seen this computation several times before. The best constant predictor of \( X \) is \( \E(X) \), with mean square error \( \var(X) \). The physical interpretation of this result is that the moment of inertia of the mass distribution of \(X\) about \(t\) is minimized when \(t = \mu\), the center of mass. Next, let us apply our procedure to the 1-distance.

For \( X \in \mathscr L_1 \), define the mean absolute error function by \[ d_1(X, t) = \|X - t\|_1 = \E\left[\left|X - t\right|\right], \quad t \in \R \]

We will show that \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\). (Recall that the set of medians of \( X \) forms a closed, bounded interval.) We start with a discrete case, because it's easier and has special interest.

Suppose that \(X \in \mathscr L_1\) has a discrete distribution with values in a finite set \(S \subseteq \R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).

Details:

Note first that \(\E\left(\left|X - t\right|\right) = \E(t - X, \, X \le t) + \E(X - t, \, X \gt t)\). Hence \(\E\left(\left|X - t\right|\right) = a_t \, t + b_t\), where \(a_t = 2 \, \P(X \le t) - 1\) and where \(b_t = \E(X) - 2 \, \E(X, \, X \le t)\). Note that \(\E\left(\left|X - t\right|\right)\) is a continuous, piecewise linear function of \(t\), with corners at the values in \(S\). That is, the function is a linear spline. Let \(m\) be the smallest median of \(X\). If \(t \lt m\) and \(t \notin S\), then the slope of the linear piece at \(t\) is negative. Let \(M\) be the largest median of \(X\). If \(t \gt M\) and \(t \notin S\), then the slope of the linear piece at \(t\) is positive. If \(t \in (m, M)\) then the slope of the linear piece at \(t\) is 0. Thus \(\E\left(\left|X - t\right|\right)\) is minimized for every \(t\) in the median interval \([m, M]\).

Theorem shows that mean absolute error has a couple of basic deficiencies as a measure of error:

Indeed, when \(X\) does not have a unique median, there is no compelling reason to choose one value in the median interval, as the measure of center, over any other value in the interval.

Suppose now that \(X \in \mathscr L_1 \) has a general distribution on \(\R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).

Details:

Let \( s, \, t \in \R \). Suppose first that \(s \lt t\). Computing the expected value over the events \(X \le s\), \(s \lt X \le t\), and \(X \ge t\), and simplifying gives \[ \E\left(\left|X - t\right|\right) = \E\left(\left|X - s\right|\right) + (t - s) \, \left[2 \, \P(X \le s) - 1\right] + 2 \, \E(t - X, \, s \lt X \le t) \] Suppose next that \(t \lt s\). Using similar methods gives \[ \E\left(\left|X - t\right|\right) = \E\left(\left|X - s\right|\right) + (t - s) \, \left[2 \, \P(X \lt s) - 1\right] + 2 \, \E(X - t, \, t \le X \lt s) \] Note that the last terms on the right in these equations are nonnegative. If we take \(s\) to be a median of \(X\), then the middle terms on the right in the equations are also nonnegative. Hence if \(s\) is a median of \(X\) and \(t\) is any other number then \(\E\left(\left|X - t\right|\right) \ge \E\left(\left|X - s\right|\right)\).

Convergence

Whenever we have a measure of distance, we automatically have a criterion for convergence.

Suppose that \( X_n \in \mathscr L_k \) for \( n \in \N_+ \) and that \( X \in \mathscr L_k \), where \( k \in [1, \infty) \). Then \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean if \( X_n \to X \) as \( n \to \infty \) in the vector space \( \mathscr L_k \). That is, \[ d_k(X_n, X) = \|X_n - X\|_k \to 0 \text{ as } n \to \infty \] or equivalently \( \E\left(\left|X_n - X\right|^k\right) \to 0\) as \(n \to \infty \).

When \(k = 1\), we simply say that \(X_n \to X\) as \(n \to \infty\) in mean; when \(k = 2\), we say that \(X_n \to X\) as \(n \to \infty\) in mean square. These are the most important special cases.

Suppose that \(1 \le j \le k\). If \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean then \(X_n \to X\) as \(n \to \infty\) in \(j\)th mean.

Details:

This follows from Lyanpuov's inequality . Note that \( 0 \le d_j(X_n, X) \le d_k(X_n, X) \to 0 \) as \( n \to \infty \).

Convergence in \( k \)th mean implies that the \( k \) norms converge.

Suppose that \( X_n \in \mathscr L_k \) for \( n \in \N_+ \) and that \( X \in \mathscr L_k \), where \( k \in [1, \infty) \). If \( X_n \to X \) as \( n \to \infty \) in \( k \)th mean then \( \|X_n\|_k \to \|X\|_k \) as \( n \to \infty \). Equivalently, if \( \E(|X_n - X|^k) \to 0 \) as \( n \to \infty \) then \( \E(|X_n|^k) \to \E(|X|^k) \) as \( n \to \infty \).

Details:

This is a simple consequence of the reverse triangle inequality, which holds in any normed vector space. The general result is that if a sequence of vectors in a normed vector space converge then the norms converge. In our notation here, \[ \left|\|X_n\|_k - \|X\|_k\right| \le \|X_n - X\|_k \] so if the right side converges to 0 as \( n \to \infty \), then so does the left side.

The converse is not true; a counterexample is given in . Our next result shows that convergence in mean is stronger than convergence in probability.

Suppose that \( X_n \in \mathscr L_1 \) for \( n \in \N_+ \) and that \( X \in \mathscr L_1 \). If \(X_n \to X\) as \(n \to \infty\) in mean, then \(X_n \to X\) as \(n \to \infty\) in probability.

Details:

This follows from Markov's inequality. For \( \epsilon \gt 0 \), \(0 \le \P\left(\left|X_n - X\right| \gt \epsilon\right) \le \E\left(\left|X_n - X\right|\right) \big/ \epsilon \to 0 \) as \( n \to \infty \).

The converse is not true. In fact, convergence with probability 1, which is stronger that convergence in probability, does not imply convergence in \(k\)th mean; gives a counterexample. Also convergence in \(k\)th mean does not imply convergence with probability 1; gives a counterexample to this. In summary, the implications in the various modes of convergence are shown below; no other implications hold in general.

However, for uniformly integrable variables, convergence in probability implies convergence in mean.

Inner Product

The vector space \( \mathscr L_2 \) of real-valued random variables on \( (\Omega, \mathscr{F}, \P) \) (modulo equivalence of course) with finite second moment is special, because it's the only one in which the norm corresponds to an inner product.

The inner product of \( X, \, Y \in \mathscr L_2 \) is defined by \[ \langle X, Y \rangle = \E(X Y) \]

The following results are analogous to the basic properties of covariance, and show that this definition really does give an inner product on the vector space

For \( X, \, Y, \, Z \in \mathscr L_2 \) and \( a \in \R \),

  1. \(\langle X, Y \rangle = \langle Y, X \rangle\), the symmetric property.
  2. \(\langle X, X \rangle \ge 0\) and \(\langle X, X \rangle = 0\) if and only if \(\P(X = 0) = 1\) (so that \(X \equiv 0\)), the positive property.
  3. \(\langle a X, Y \rangle = a \langle X, Y \rangle\), the scaling property.
  4. \(\langle X + Y, Z \rangle = \langle X, Z \rangle + \langle Y, Z \rangle\), the additive property.
Details:
  1. This property is trivial from the definition.
  2. Note that \( \E(X^2) \ge 0 \) and \( \E(X^2) = 0 \) if and only if \( \P(X = 0) = 1 \).
  3. This follows from the scaling property of expected value: \( \E(a X Y) = a \E(X Y) \)
  4. This follows from the additive property of expected value: \( \E[(X + Y) Z] = \E(X Z) + \E(Y Z) \).

From parts (a), (c), and (d) in it follows that inner product is bi-linear, that is, linear in each variable with the other fixed. Of course bi-linearity holds for any inner product on a vector space. Covariance and correlation can easily be expressed in terms of this inner product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation of the two variables is the inner product of the corresponding standard scroes.

For \( X, \, Y \in \mathscr L_2 \),

  1. \(\cov(X, Y) = \langle X - \E(X), Y - \E(Y) \rangle\)
  2. \(\cor(X, Y) = \left \langle [X - \E(X)] \big/ \sd(X), [Y - \E(Y)] / \sd(Y) \right \rangle\)
Details:
  1. This is simply a restatement of the definition of covariance.
  2. This is a restatement of the fact that the correlation of two variables is the covariance of their corresponding standard scores.

Thus, real-valued random variables \( X \) and \( Y \) are uncorrelated if and only if the centered variables \( X - \E(X) \) and \( Y - \E(Y) \) are perpendicular or orthogonal as elements of \( \mathscr L_2 \).

For \( X \in \mathscr L_2 \), \(\langle X, X \rangle = \|X\|_2^2 = \E\left(X^2\right)\).

So the norm associated with the inner product is the 2-norm studied above, and corresponds to the root mean square operation on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the \(k\)-norms, only the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental importance in probability and statistics. Technically, the vector space \( \mathscr L_2 \) is a Hilbert space, named for David Hilbert.

The next result is Hölder's inequality, named for Otto Hölder.

Suppose that \(j, \, k \in [1, \infty)\) and \(\frac{1}{j} + \frac{1}{k} = 1\). For \( X \in \mathscr L_j \) and \( Y \in \mathscr L_k \), \[\langle \left|X\right|, \left|Y\right| \rangle \le \|X\|_j \|Y\|_k \]

Details:

Note that \(S = \left\{(x, y) \in \R^2: x \ge 0, \; y \ge 0\right\}\) is a convex set and \(g(x, y) = x^{1/j} y^{1/k}\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables then \(\E\left(U^{1/j} V^{1/k}\right) \le \left[\E(U)\right]^{1/j} \left[\E(V)\right]^{1/k}\). Substituting \(U = \left|X\right|^j\) and \(V = \left|Y\right|^k\) gives the result.

To show that \( g \) really is concave on \( S \), we compute the second derivative matrix: \[ \left[ \begin{matrix} (1 / j)(1 / j - 1) x^{1 / j - 2} y^{1 / k} & (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} \\ (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} & (1 / k)(1 / k - 1) x^{1 / j} y^{1 / k - 2} \end{matrix} \right] \] Since \( 1 / j \lt 1 \) and \( 1 / k \lt 1 \), the diagonal entries are negative on \( S \). The determinant simplifies to \[ (1 / j)(1 / k) x^{2 / j - 2} y^{2 / k - 2} [1 - (1 / j + 1 / k)] = 0 \]

In the context of the last theorem, \(j\) and \(k\) are called conjugate exponents. If we let \(j = k = 2\) in Hölder's inequality, then we get the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz: For \( X, \, Y \in \mathscr L_2 \), \[ \E\left(\left|X\right| \left|Y\right|\right) \le \sqrt{\E\left(X^2\right)} \sqrt{\E\left(Y^2\right)} \] In turn, the Cauchy-Schwarz inequality is equivalent to the basic inequalities for covariance and correlations: For \( X, \, Y \in \mathscr L_2 \), \[ \left| \cov(X, Y) \right| \le \sd(X) \sd(Y), \quad \left|\cor(X, Y)\right| \le 1 \]

If \(j, \, k \in [1, \infty)\) are conjugate exponents then

  1. \(k = \frac{j}{j - 1}\).
  2. \(k \downarrow 1\) as \(j \uparrow \infty\).

Theorem given next is equivalent to the identity \( \var(X + Y) + \var(X - Y) = 2\left[\var(X) + \var(Y)\right] \) that we saw in the study of covariance. In the context of vector spaces, the result is known as the parallelogram rule:

If \(X, \, Y \in \mathscr L_2\) then \[ \|X + Y\|_2^2 + \|X - Y\|_2^2 = 2 \|X\|_2^2 + 2 \|Y\|_2^2\]

Details:

This result follows from the bi-linearity of inner product: \begin{align} \|X + Y\|_2^2 + \|X - Y\|_2^2 & = \langle X + Y, X + Y \rangle + \langle X - Y, X - Y\rangle \\ & = \left(\langle X, X \rangle + 2 \langle X, Y \rangle + \langle Y, Y \rangle\right) + \left(\langle X, X \rangle - 2 \langle X, Y \rangle + \langle Y, Y \rangle\right) = 2 \|X\|^2 + 2 \|Y\|^2 \end{align}

Theorem given next is equivalent to the statement that the variance of the sum of uncorrelated variables is the sum of the variances, which again we proved in our study of covaraince. In the context of vector spaces, the result is the famous Pythagorean theorem, named for Pythagoras of course.

If \((X_1, X_2, \ldots, X_n)\) is a sequence of random variables in \(\mathscr L_2\) with \(\langle X_i, X_j \rangle = 0\) for \(i \ne j\) then \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \sum_{i=1}^n \|X_i\|_2^2 \]

Details:

Again, this follows from the bi-linearity of inner product: \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \left\langle \sum_{i=1}^n X_i, \sum_{j=1}^n X_j\right\rangle = \sum_{i=1}^n \sum_{j=1}^n \langle X_i, X_j \rangle \] The terms with \( i \ne j \) are 0 by the orthogonality assumption, so \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \sum_{i=1}^n \langle X_i, X_i \rangle = \sum_{i=1}^n \|X_i\|_2^2 \]

Projections

The best linear predictor has a nice interpretation in terms of projections onto subspaces of \( \mathscr L_2 \). First let's review the concepts. Recall that \( \mathscr U \) is a subspace of \( \mathscr L_2 \) if \( \mathscr U \subseteq \mathscr L_2 \) and \( \mathscr U \) is also a vector space (under the same operations of addition and scalar multiplication). To show that \( \mathscr U \subseteq \mathscr L_2 \) is a subspace, we just need to show the closure properties (the other axioms of a vector space are inherited).

Suppose now that \( \mathscr U \) is a subspace of \( \mathscr L_2 \) and that \( X \in \mathscr L_2 \). Then the projection of \( X \) onto \( \mathscr U \) (if it exists) is the vector \( V \in \mathscr U \) with the property that \( X - V \) is perpendicular to \( \mathscr U \): \[ \langle X - V, U \rangle = 0, \quad U \in \mathscr U \]

The projection has two critical properties: It is unique (if it exists) and it is the vector in \( \mathscr U \) closest to \( X \). If you look at the proofs of these results, you will see that they are essentially the same as the ones used for the best predictors of \( X \) mentioned at the beginning of this subsection. Moreover, the proofs use only vector space concepts—the fact that our vectors are random variables on a probability space plays no special role.

The projection of \( X \) onto \( \mathscr U \) (if it exists) is unique.

Details:

Suppose that \( V_1 \) and \( V_2 \) satisfy the definition. then \[ \left\|V_1 - V_2\right\|_2^2 = \langle V_1 - V_2, V_1 - V_2 \rangle = \langle V_1 - X + X - V_2, V_1 - V_2 \rangle = \langle V_1 - X, V_1 - V_2 \rangle + \langle X - V_2, V_1 - V_2 \rangle = 0 \] Hence \( V_1 \equiv V_2 \). The last equality in the displayed equation holds by assumption and the fact that \( V_1 - V_2 \in \mathscr U \)

Suppose that \( V \) is the projection of \( X \) onto \( \mathscr U \). Then

  1. \( \left\|X - V\right\|_2^2 \le \left\|X - U\right\|_2^2\) for all \( U \in \mathscr U \).
  2. Equality holds in (a) if and only if \( U \equiv V \)
Details:
  1. If \( U \in \mathscr U \) then \[ \left\| X - U \right\|_2^2 = \left\| X - V + V - U \right\|_2^2 = \left\| X - V \right\|_2^2 + 2 \langle X - V, V - U \rangle + \left\| V - U \right\|_2^2\] But the middle terms is 0 so \[ \left\| X - U \right\|_2^2 = \left\| X - V \right\|_2^2 + \left\| V - U \right\|_2^2 \ge \left\| X - V \right\|_2^2\]
  2. Equality holds if and only if \( \left\| V - U \right\|_2^2 = 0\), if and only if \( V \equiv U \).

Now let's return to our study of best predictors of a random variable.

If \( X \in \mathscr L_2 \) then the set \( \mathscr W_X = \{a + b X: a \in \R, \; b \in \R\} \) is a subspace of \(\mathscr L_2\). In fact, it is the subspace generated by \(X\) and 1.

Details:

Note that \( \mathscr W_X \) is the set of all linear combinations of the vectors \( 1 \) and \( X \). If \( U, \, V \in \mathscr W_X \) then \( U + V \in \mathscr W_X \). If \( U \in \mathscr W_X \) and \( c \in \R \) then \( c U \in \mathscr W_X \).

Recall that for \( X, \, Y \in \mathscr L_2 \), the best linear predictor of \( Y \) based on \( X \) is \[ L(Y \mid X) = \E(Y) + \frac{\cov(X, Y)}{\var(X)} \left[X - \E(X)\right] \] Here is the meaning of the predictor in the context of our vector spaces.

If \( X, \, Y \in \mathscr L_2 \) then \( L(Y \mid X) \) is the projection of \(Y\) onto \(\mathscr W_X\).

Details:

Note first that \(L(Y \mid X) \in \mathscr W_X \). Thus, we just need to show that \( Y - L(Y \mid X) \) is perpendicular to \( \mathscr W_X \). For this, it suffices to show

  1. \(\left\langle Y - L(Y \mid X), X \right\rangle = 0\)
  2. \(\left\langle Y - L(Y \mid X), 1 \right\rangle = 0\)

We have already done this in the earlier sections, but for completeness, we do it again. Note that \( \E\left(X \left[X - \E(X)\right]\right) = \var(X) \). Hence \( \E\left[X L(Y \mid X)\right] = \E(X) \E(Y) + \cov(X, Y) = \E(X Y) \). This gives (a). By linearity, \( \E\left[L(Y \mid X)\right] = \E(Y) \) so (b) holds as well.

The previous result is actually just the random variable version of the standard formula for the projection of a vector onto a space spanned by two other vectors. Note that \( 1 \) is a unit vector and that \( X_0 = X - \E(X) = X - \langle X, 1 \rangle 1 \) is perpendicular to \( 1 \). Thus, \( L(Y \mid X) \) is just the sum of the projections of \( Y \) onto \( 1 \) and \( X_0 \): \[ L(Y \mid X) = \langle Y, 1 \rangle 1 + \frac{\langle Y, X_0 \rangle}{\langle X_0, X_0\rangle} X_0 \]

Suppose now that \( \mathscr{G} \) is a sub \( \sigma \)-algebra of \( \mathscr{F} \). Of course if \( X: \Omega \to \R \) is \( \mathscr{G} \)-measurable then \( X \) is \( \mathscr{F} \)-measurables, so \( \mathscr L_2(\mathscr{G}) \) is a subspace of \( \mathscr L_2(\mathscr{F}) \).

If \( X \in \mathscr L_2(\mathscr{F}) \) then \( \E(X \mid \mathscr{G}) \) is the projection of \( X \) onto \( \mathscr L_2(\mathscr{G}) \).

Details:

This is essentially the definition of \( \E(X \mid \mathscr{G}) \) as the only (up to equivalence) random variable in \( \mathscr L_2(\mathscr{G}) \) with \( \E\left[\E(X \mid \mathscr{G}) U\right] = \E(X U) \) for every \( U \in \mathscr L_2(\mathscr{G}) \).

But remember that \( \E(X \mid \mathscr{G}) \) is defined more generally for \( X \in \mathscr L_1(\mathscr{F}) \). Our final result in this discussion concerns convergence.

Suppose that \( k \in [1, \infty) \) and that \( \mathscr{G} \) is a sub \( \sigma \)-algebra of \( \mathscr{F} \).

  1. If \( X \in \mathscr L_k(\mathscr{F}) \) then \( \E(X \mid \mathscr{G}) \in \mathscr L_k(\mathscr{G}) \)
  2. If \( X_n \in \mathscr L_k(\mathscr{F}) \) for \( n \in \N_+ \), \( X \in \mathscr L_k(\mathscr{F}) \), and \( X_n \to X \) as \( n \to \infty \) in \( \mathscr L_k(\mathscr{F}) \) then \( \E(X_n \mid \mathscr{G}) \to \E(X \mid \mathscr{G}) \) as \( n \to \infty \) in \( \mathscr L_k(\mathscr{G}) \)
Details:
  1. Note that \( |\E(X \mid \mathscr{G})| \le \E(|X| \mid \mathscr{G}) \). Since \( t \mapsto t^k \) is increasing and convex on \( [0, \infty) \) we have \[ |\E(X \mid \mathscr{G})|^k \le [\E(|X| \mid \mathscr{G})]^k \le \E\left(|X|^k \mid \mathscr{G}\right) \] The last step uses Jensen's inequality. Taking expected values gives \[ \E[|\E(X \mid \mathscr{G})|^k] \le \E(|X|^k) \lt \infty \]
  2. Using the same ideas, \[ \E\left[\left|\E(X_n \mid \mathscr{G}) - \E(X \mid \mathscr{G})\right|^k\right] = \E\left[\left|\E(X_n - X \mid \mathscr{G})\right|^k\right] \le E[|X_n - X|^k] \] By assumption, the right side converges to 0 as \( n \to \infty \) and hence so does the left side.

Examples and Applications

App Exercises

In the error function app, select the root mean square error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.

In the error function app, select the mean absolute error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.

Computational Exercises

Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\).

  1. Find \(\|X\|_k\) for \( k \in [1, \infty) \).
  2. Graph \(\|X\|_k\) as a function of \(k \in [1, \infty)\).
  3. Find \(\lim_{k \to \infty} \|X\|_k\).
Details:
  1. \(\frac{1}{(k + 1)^{1/k}}\)
  2. 1

Suppose that \(X\) has probability density function \(f(x) = \frac{a}{x^{a+1}}\) for \(1 \le x \lt \infty\), where \(a \gt 0\) is a parameter. Thus, \(X\) has the Pareto distribution with shape parameter \(a\).

  1. Find \(\|X\|_k\) for \( k \in [1, \infty) \).
  2. Graph \(\|X\|_k\) as a function of \(k \in (1, a)\).
  3. Find \(\lim_{k \uparrow a} \|X\|_k\).
Details:
  1. \(\left(\frac{a}{a -k}\right)^{1/k}\) if \(k \lt a\), \(\infty\) if \(k \ge a\)
  2. \(\infty\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Minkowski's inequality.

Details:
  1. \(\|X + Y\|_k = \left(\frac{2^{k+2} - 2}{(k + 2)(k + 3)}\right)^{1/k}\)
  2. \(\|X\|_k + \|Y\|_k = 2 \left(\frac{1}{k + 2} + \frac{1}{2(k + 1)}\right)^{1/k}\)

Let \(X\) be an indicator random variable with \(\P(X = 1) = p\), where \(0 \le p \le 1\). Graph \(\E\left(\left|X - t\right|\right)\) as a function of \(t \in \R\) in each of the cases below. In each case, find the minimum value of the function and the values of \(t\) where the minimum occurs.

  1. \(p \lt \frac{1}{2}\)
  2. \(p = \frac{1}{2}\)
  3. \(p \gt \frac{1}{2}\)
Details:
  1. The minimum is \(p\) and occurs at \(t = 0\).
  2. The minimum is \(\frac{1}{2}\) and occurs for \(t \in [0, 1]\)
  3. The minimum is \(1 - p\) and occurs at \(t = 1\)

Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\). Find \(d_1(X, t) = \E\left(\left|X - t\right|\right)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the value of \(t\) where the minimum occurs.

Suppose that \(X\) is uniformly distributed on the set \([0, 1] \cup [2, 3]\). Find \(d_1(X, t) = \E\left(\left|X - t\right|\right)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the values of \(t\) where the minimum occurs.

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Hölder's inequality in the following cases:

  1. \(j = k = 2\)
  2. \(j = 3\), \(k = \frac{3}{2}\)
Details:
  1. \(\|X\|_2 \|Y\|_2 = \frac{5}{12}\)
  2. \(\|X\|_3 + \|Y\|_{3/2} \approx 0.4248\)

Counterexamples

The following exercise shows that convergence with probability 1 does not imply convergence in mean.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables with \[ \P\left(X = n^3\right) = \frac{1}{n^2}, \; \P(X_n = 0) = 1 - \frac{1}{n^2}; \quad n \in \N_+ \]

  1. \(X_n \to 0\) as \(n \to \infty\) with probability 1.
  2. \(X_n \to 0\) as \(n \to \infty\) in probability.
  3. \(\E(X_n) \to \infty\) as \(n \to \infty\).
Details:
  1. This follows from the basic characterization of convergence with probability 1: \( \sum_{n=1}^\infty \P(X_n \gt \epsilon) = \sum_{n=1}^\infty 1 / n^2 \lt \infty \) for \( 0 \lt \epsilon \lt 1 \).
  2. This follows since convergence with probability 1 implies convergence in probability.
  3. Note that \( \E(X_n) = n^3 / n^2 = n \) for \( n \in \N_+ \).

The following exercise shows that convergence in mean does not imply convergence with probability 1.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent indicator random variables with \[ \P(X_n = 1) = \frac{1}{n}, \; \P(X_n = 0) = 1 - \frac{1}{n}; \quad n \in \N_+ \]

  1. \(\P(X_n = 0 \text{ for infinitely many } n) = 1\).
  2. \(\P(X_n = 1 \text{ for infinitely many } n) = 1\).
  3. \(\P(X_n \text{ does not converge as } n \to \infty) = 1\).
  4. \(X_n \to 0\) as \(n \to \infty\) in \(k\)th mean for every \(k \ge 1\).
Details:
  1. This follows from the second Borel-Cantelli lemma in Sectin 1.6 since \( \sum_{n=1}^\infty \P(X_n = 1) = \sum_{n=1}^\infty 1 / n = \infty \)
  2. This also follows from the second Borel-Cantelli lemma since \( \sum_{n=1}^\infty \P(X_n = 0) = \sum_{n=1}^\infty (1 - 1 / n) = \infty \).
  3. This follows from parts (a) and (b).
  4. Note that \( \E(X_n) = 1 / n \to 0 \) as \( n \to \infty \).

The following exercise show that convergence of the \( k \)th means does not imply convergence in \( k \)th mean.

Suppose that \( U \) has the Bernoulli distribution with parmaeter \( \frac{1}{2} \), so that \( \P(U = 1) = \P(U = 0) = \frac{1}{2} \). Let \( X_n = U \) for \( n \in \N_+ \) and let \( X = 1 - U \). Let \( k \in [1, \infty) \). Then

  1. \( \E(X_n^k) = \E(X^k) = \frac{1}{2} \) for \( n \in \N_+ \), so \( \E(X_n^k) \to \E(X^k) \) as \( n \to \infty \)
  2. \( \E(|X_n - X|^k) = 1 \) for \( n \in \N \) so \( X_n \) does not converge to \( X \) as \( n \to \infty \) in \( \mathscr L_k \).
Details:
  1. Note that \( X_n^k = U^k = U \) for \( n \in \N_+ \), since \( U \) just takes values 0 and 1. Also, \( U \) and \( 1 - U \) have the same distribution so \( \E(U) = \E(1 - U) = \frac{1}{2} \).
  2. Note that \( X_n - X = U - (1 - U) = 2 U - 1 \) for \( n \in \N_+ \). Again, \( U \) just takes values 0 and 1, so \( |2 U - 1| = 1 \).