9. Expected Value as an Integral

\(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\)

Initially, we defined expected value separately for discrete distributions, continuous distributions, and mixed distributions, in each case using density functions. Later we showed how these definitions can be unified, by first defining expected value for nonnegative random variables in terms of the right-tail distribution function. However, by far the best and most elegant definition of expected value is as an integral with respect to the underlying probability measure. This definition and a review of the properties of expected value are the goals of this section. No proofs are necessary (you will be happy to know), since all of the results follow from the general theory of integration. If you are a new student of probability, or are not interested in the measure-theoretic detail of the subject, you can safely skip this section.

Definitions

As usual, our starting point is a random experiment, as described in random experiment, modeled by a probability space \( (\Omega, \mathscr{F}, \P) \). So \( \Omega \) is the set of outcomes, \( \mathscr{F} \) is the \( \sigma \)-algebra of events, and \( \P \) is the probability measure on the sample space \((\Omega, \mathscr F)\).

Recall that a random variable \( X \) for the experiment is simply a measurable function from \( (\Omega, \mathscr{F}) \) into another measurable space \( (S, \mathscr{S}) \). When \( S \subseteq \R^n \), we assume that \( S \) is Lebesgue measurable, and we take \( \mathscr{S} \) to the \( \sigma \)-algebra of Lebesgue measurable subsets of \( S \). As noted above, here is the measure-theoretic definition:

If \( X \) is a real-valued random variable on the probability space, the expected value of \( X \) is defined as the integral of \( X \) with respect to \( \P \), assuming that the integral exists: \[ \E(X) = \int_\Omega X \, d\P \]

Let's review how the integral is defined in stages, but now using the notation of probability theory.

Let \( S \) denote the support set of \( X \), so that \( S \) is a measurable subset of \( \R \).

If \( S \) is finite, then \( \E(X) = \sum_{x \in S} x \, \P(X = x) \).
If \( S \subseteq [0, \infty) \), then \( \E(X) = \sup\left\{\E(Y): Y \text{ has finite range and } 0 \le Y \le X\right\} \)
For general \( S \subseteq \R \), \( \E(X) = \E\left(X^+\right) - \E\left(X^-\right) \) as long as the right side is not of the form \( \infty - \infty \), and where \( X^+ \) and \( X^- \) denote the positive and negative parts of \( X \).
If \( A \in \mathscr{F} \), then \( \E(X; A) = \E\left(X \bs{1}_A \right) \), assuming that the expected value on the right exists.

Thus, as with integrals generally, an expected value can exist as a number in \( \R \) (in which case \( X \) is integrable), can exist as \( \infty \) or \( -\infty \), or can fail to exist. In reference to part (a), a random variable with a finite set of values in \( \R \) is a simple function in the terminology of general integration. In reference to part (b), note that the expected value of a nonnegative random variable always exists in \( [0, \infty] \). In reference to part (c), \( \E(X) \) exists if and only if either \( \E\left(X^+\right) \lt \infty \) or \( \E\left(X^-\right) \lt \infty \).

Our next goal is to restate the basic theorems and properties of integrals, but in the notation of probability. Unless otherwise noted, all random variables are assumed to be real-valued.

Basic Properties

The Linear Properties

Perhaps the most important and basic properties are the linear properties. Part (a) is the additive property and part (b) is the scaling property.

Suppose that \( X \) and \( Y \) are random variables whose expected values exist, and that \( c \in \R \). Then

\( \E(X + Y) = \E(X) + \E(Y) \) as long as the right side is not of the form \( \infty - \infty \).
\( \E(c X) = c \E(X) \)

Thus, part (a) holds if at least one of the expected values on the right is finite, or if both are \( \infty \), or if both are \( -\infty \). What is ruled out are the two cases where one expected value is \( \infty \) and the other is \( -\infty \), and this is what is meant by the indeterminate form \( \infty - \infty \).

Equality and Order

Our next set of properties deal with equality and order. First, the expected value of a random variable over a null set is 0.

If \( X \) is a random variable and \( A \) is an event with \( \P(A) = 0 \). Then \( \E(X; A) = 0 \).

If \( X \) is a random variable whose expected value exists, and \( Y \) is a random variable with \( \P(X = Y) = 1 \), then \( \E(X) = \E(Y) \).

Suppose that \( X \) is a random variable and \( \P(X \ge 0) = 1 \). Then

\( \E(X) \ge 0 \)
\( \E(X) = 0 \) if and only if \( \P(X = 0) = 1 \).

So, if \( X \) is a nonnegative random variable then \( \E(X) \gt 0 \) if and only if \( \P(X \gt 0) \gt 0 \). The next result is the increasing property of expected value, perhaps the most important property after linearity.

Suppose that \( X, Y \) are random variables whose expected values exist, and that \( \P(X \le Y) = 1 \). Then

\( \E(X) \le \E(Y) \)
Except in the case that both expected values are \( \infty \) or both \( -\infty \), \( \E(X) = \E(Y) \) if and only if \( \P(X = Y) = 1 \).

So if \( X \le Y \) with probability 1 then, except in the two cases mentioned, \( \E(X) \lt \E(Y) \) if and only if \( \P(X \lt Y) \gt 0 \). The next result is the absolute value inequality.

Suppose that \( X \) is a random variable whose expected value exists. Then

\( \left| \E(X) \right| \le \E \left(\left| X \right| \right) \)
If \( \E(X) \) is finite, then equality holds in (a) if and only if \( \P(X \ge 0) = 1 \) or \( \P(X \le 0) = 1 \).

Change of Variables and Density Functions

The Change of Variables Theorem

Suppose now that \( X \) is a general random variable on the probability space \((\Omega, \mathscr F, \P)\), taking values in a measurable space \( (S, \mathscr{S}) \). Recall that the probability distribution of \( X \) is the probability measure \( P \) on \( (S, \mathscr{S}) \) given by \( P(A) = \P(X \in A) \) for \( A \in \mathscr{S} \). This is a special case of a new positive measure induced by a given positive measure and a measurable function. If \( g: S \to \R \) is measurable, then \( g(X) \) is a real-valued random variable. The following result shows how to computed the expected value of \( g(X) \) as an integral with respect to the distribution of \( X \), and is known as the change of variables theorem.

If \( g: S \to \R \) is measurable then, assuming that the expected value exists, \[\E\left[g(X)\right] = \int_S g(x) \, dP(x) \]

So, using the original definition and the change of variables theorem, and giving the variables explicitly for emphasis, we have \[ \E\left[g(X)\right] = \int_\Omega g\left[X(\omega)\right] \, d\P(\omega) = \int_S g(x) \, dP(x)\]

The Radon-Nikodym Theorem

Suppose now \( \mu \) is a positive measure on \( (S, \mathscr{S}) \), and that the distribution of \( X \) is absolutely continuous with respect to \( \mu \). Recall that this means that \( \mu(A) = 0 \) implies \( P(A) = \P(X \in A) = 0 \) for \( A \in \mathscr{S} \). By the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym, \( X \) has a probability density function \( f \) with respect to \( \mu \). That is, \[ P(A) = \P(X \in A) = \int_A f \, d\mu, \quad A \in \mathscr{S} \] In this case, we can write the expected value of \( g(X) \) as an integral with respect to the probability density function.

If \( g: S \to \R \) is measurable then, assuming that the expected value exists, \[ \E\left[g(X)\right] = \int_S g f \, d\mu \]

Again, giving the variables explicitly for emphasis, we have the following chain of integrals: \[ \E\left[g(X)\right] = \int_\Omega g\left[X(\omega)\right] \, d\P(\omega) = \int_S g(x) \, dP(x) = \int_S g(x) f(x) \, d\mu(x)\]

Discrete Distributions

Suppose first that \((S, \mathscr S, \#)\) is a discrete measure space, so that \( S \) is countable, \( \mathscr{S} = \mathscr{P}(S) \) is the collection of all subsets of \( S \), and \( \# \) is counting measure on \( (S, \mathscr S) \). Thus, \( X \) has a discrete distribution on \( S \), and this distribution is always absolutely continuous with respect to \( \# \). Specifically, \( \#(A) = 0 \) if and only if \( A = \emptyset \) and of course \( \P(X \in \emptyset) = 0 \). The probability density function \( f \) of \( X \) with respect to \( \# \), as we know, is simply \( f(x) = \P(X = x) \) for \( x \in S \). Moreover, integrals with respect to \( \# \) are sums, so \[ \E\left[g(X)\right] = \sum_{x \in S} g(x) f(x) \] assuming that the expected value exists. Existence in this case means that either the sum of the positive terms is finite or the sum of the negative terms is finite, so that the sum makes sense (and in particular does not depend on the order in which the terms are added). Specializing further, if \( X \) itself is real-valued and \( g = 1 \) we have \[ \E(X) = \sum_{x \in S} x f(x) \] which was our original definition of expected value in the discrete case.

Continuous Distributions

For the second special case, suppose that \((S, \mathscr S, \lambda^n)\) is a Euclidean measure space, so that \( S \) is a Lebesgue measurable subset of \( \R^n \) for some \( n \in \N_+ \), \(\mathscr S\) is the \(\sigma\)-algebra of Lebesgue measurable subsets of \(S\), and \( \lambda^n \) is Lebesgue measure on \( (S, \mathscr S) \). The distribution of \( X \) is absolutely continuous with respect to \( \lambda^n \) if \( \lambda^n(A) = 0 \) implies \( \P(X \in A) = 0 \) for \( A \in \mathscr{S} \). If this is the case, then a probability density function \( f \) of \( X \) has its usual meaning. Thus, \[ \E\left[g(X)\right] = \int_S g(x) f(x) \, d\lambda^n(x)\] assuming that the expected value exists. When \( g \) is a typically nice function, this integral reduces to an ordinary \( n \)-dimensional Riemann integral of calculus. Specializing further, if \( X \) is itself real-valued and \( g = 1 \) then \[ \E(X) = \int_S x f(x) \, dx \] which was our original definition of expected value in the continuous case.

Interchange Properties

In this subsection, we review properties that allow the interchange of expected value and other operations: limits of sequences, infinite sums, and integrals. We assume again that the random variables are real-valued unless otherwise specified.

Limits

Our first set of convergence results deals with the interchange of expected value and limits. We start with the expected value version of Fatou's lemma, named in honor of Pierre Fatou. Its usefulness stems from the fact that no assumptions are placed on the random variables, except that they be nonnegative.

Suppose that \( X_n \) is a nonnegative random variable for \( n \in \N_+ \). Then \[ \E\left( \liminf_{n \to \infty} X_n \right) \le \liminf_{n \to \infty} \E(X_n) \]

Our next set of results gives conditions for the interchange of expected value and limits.

Suppose that \( X_n \) is a random variable for each \( n \in \N_+ \). then \[ \E\left(\lim_{n \to \infty} X_n\right) = \lim_{n \to \infty} \E\left(X_n\right) \] in each of the following cases:

\( X_n \) is nonnegative for each \( n \in \N_+ \) and \( X_n \) is increasing in \( n \).
\( \E(X_n) \) exists for each \( n \in \N_+ \), \( \E(X_1) \gt -\infty \), and \( X_n \) is increasing in \( n \).
\( \E(X_n) \) exists for each \( n \in \N_+ \), \( \E(X_1) \lt \infty \), and \( X_n \) is decreasing in \( n \).
\( \lim_{n \to \infty} X_n \) exists, and \( \left|X_n\right| \le Y \) for \( n \in \N \) where \( Y \) is a nonnegative random variable with \( \E(Y) \lt \infty \).
\( \lim_{n \to \infty} X_n \) exists, and \( \left|X_n\right| \le c \) for \( n \in \N \) where \( c \) is a positive constant.

Statements about the random variables in (nonnegative, increasing, existence of limit, etc.) need only hold with probability 1. Part (a) is the monotone convergence theorem, one of the most important convergence results and in a sense, essential to the definition of the integral in the first place. Parts (b) and (c) are slight generalizations of the monotone convergence theorem. In parts (a), (b), and (c), note that \( \lim_{n \to \infty} X_n \) exists (with probability 1), although the limit may be \( \infty \) in parts (a) and (b) and \( -\infty \) in part (c) (with positive probability). Part (d) is the dominated convergence theorem, another of the most important convergence results. It's sometimes also known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue. Part (e) is a corollary of the dominated convergence theorem, and is known as the bounded convergence theorem.

Infinite Series

Our next results involve the interchange of expected value and an infinite sum, so these results generalize the basic additivity property of expected value.

Suppose that \( X_n \) is a random variable for \( n \in \N_+ \). Then \[ \E\left( \sum_{n=1}^\infty X_n\right) = \sum_{n=1}^\infty \E\left(X_n\right) \] in each of the following cases:

\( X_n \) is nonnegative for each \( n \in \N_+ \).
\(\E\left(\sum_{n=1}^\infty \left| X_n \right|\right) \lt \infty \)

Part (a) is a consequence of the monotone convergence theorem, and part (b) is a consequence of the dominated convergence theorem. In (b), note that \( \sum_{n=1}^\infty \left| X_n \right| \lt \infty \) and hence \( \sum_{n=1}^\infty X_n \) is absolutely convergent with probability 1. Our next result is the additivity of the expected value over a countably infinite collection of disjoint events.

Suppose that \( X \) is a random variable whose expected value exists, and that \( \{A_n: n \in \N_+\} \) is a disjoint collection events. Let \( A = \bigcup_{n=1}^\infty A_n \). Then \[ \E(X; A) = \sum_{n=1}^\infty \E(X; A_n) \]

Of course, the previous theorem applies in particular if \( X \) is nonnegative.

Integrals

Suppose that \( (T, \mathscr{T}, \mu) \) is a \( \sigma \)-finite measure space, and that \( X_t \) is a real-valued random variable for each \( t \in T \). Thus we can think of \( \left\{X_t: t \in T\right\} \) is a stochastic process indexed by \( T \). We assume that \( (\omega, t) \mapsto X_t(\omega) \) is measurable, as a function from the product space \( (\Omega \times T, \mathscr{F} \otimes \mathscr{T}) \) into \( \R \). Our next result involves the interchange of expected value and integral, and is a consequence of Fubini's theorem, named for Guido Fubini.

Under the assumptions above, \[ \E\left[\int_T X_t \, d\mu(t)\right] = \int_T \E\left(X_t\right) \, d\mu(t) \] in each of the following cases:

\( X_t \) is nonnegative for each \( t \in T \).
\(\int_T \E\left(\left|X_t\right|\right) \, d\mu(t) \lt \infty \)

Fubini's theorem actually states that the two iterated integrals above equal the joint integral \[ \int_{\Omega \times T} X_t(\omega) \, d(\P \otimes \mu)(\omega, t) \] where of course, \( \P \otimes \mu \) is the product measure on \( (\Omega \times T, \mathscr{F} \otimes \mathscr{T}) \). However, our interest is usually in evaluating the iterated integral above on the left in terms of the iterated integral on the right. Part (a) is the expected value version of Tonelli's theorem, named for Leonida Tonelli.

Examples and Exercises

You may have worked some of the computational exercises before, but try to see them in a new light, in terms of the general theory of integration.

The Cauchy Distribution

Recall that the Cauchy distribution, named for Augustin Cauchy, is a continuous distribution with probability density function \( f \) given by \[ f(x) = \frac{1}{\pi \left(1 + x^2\right)}, \quad x \in \R \]

Suppose that \( X \) has the Cauchy distribution.

Show that \( \E(X) \) does not exist.
Find \( \E\left(X^2\right) \)

Details:

\( \E\left(X^+\right) = \E\left(X^-\right) = \infty \)
\( \infty \)

Open the Cauchy Experiment and keep the default parameters. Run the experiment 1000 times and note the behaior of the sample mean.

The Pareto Distribution

Recall that the Pareto distribution, named for Vilfredo Pareto, is a continuous distribution with probability density function \( f \) given by \[ f(x) = \frac{a}{x ^{a+1}}, \quad x \in [1, \infty) \] where \( a \gt 0 \) is the shape parameter.

Suppose that \( X \) has the Pareto distribution with shape parameter \( a \). Find \( \E(X) \) is the following cases:

\(0 \lt a \le 1\)
\( a \gt 1 \)

Answer

\( \infty \)
\( \frac{a}{a - 1} \)

Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.

Suppose that \( X \) has the Pareto distribution with shape parameter \( a \). Find \( E\left(1 / X^n \right) \) for \( n \in \N_+ \).

Details:

\( \frac{a}{a + n} \)

Special Results for Nonnegative Variables

For a nonnegative variable, the moments can be obtained from integrals of the right-tail distribution function.

If \(X\) is a nonnegative random variable then \[ \E\left(X^n\right) = \int_0^\infty n x^{n-1} \P(X \gt x) \, dx \]

Details:

By Fubini's theorem we can interchange an expected value and integral when the integrand is nonnegative. Hence \[ \int_0^\infty n x^{n-1} \P(X \gt x) \, dx = \int_0^\infty n x^{n-1} \E\left[\bs{1}(X \gt x)\right] \, dx = \E \left(\int_0^\infty n x^{n-1} \bs{1}(X \gt x) \, dx \right) = \E\left( \int_0^X n x^{n-1} \, dx \right) = \E\left(X^n\right) \]

When \( n = 1 \) we have \( \E(X) = \int_0^\infty \P(X \gt x) \, dx \). We saw this result before, but now we can understand the proof in terms of Fubini's theorem.

For a random variable taking nonnegative integer values, the moments can be computed from sums involving the right-tail distribution function.

Suppose that \(X\) has a discrete distribution, taking values in \(\N\). Then \[ \E\left(X^n\right) = \sum_{k=1}^\infty \left[k^n - (k - 1)^n\right] \P(X \ge k) \]

Details:

By the theorem in , we can interchange expected value and infinite series when the terms are nonnegative. Hence \[ \sum_{k=1}^\infty \left[k^n - (k - 1)^n\right] \P(X \ge k) = \sum_{k=1}^\infty \left[k^n - (k - 1)^n\right] \E\left[\bs{1}(X \ge k)\right] = \E\left(\sum_{k=1}^\infty \left[k^n - (k - 1)^n\right] \bs{1}(X \ge k) \right) = \E\left(\sum_{k=1}^X \left[k^n - (k - 1)^n\right] \right) = \E\left(X^n\right) \]

When \( n = 1 \) we have \( \E(X) = \sum_{k=0}^\infty \P(X \ge k) \). We also saw this result before, but now we can understand the proof in terms of the interchange of sum and expected value.