4 Expectation

4.1 The Expectation of a Random Variable

The distribution of a random variable $X$ contains all of the probabilistic information about $X$. The entire distribution of $X$, however, is usually too cumbersome for presenting this information. Summaries of the distribution, such as the average value, or expected value, can be useful for giving people an idea of where we expect $X$ to be without trying to describe the entire distribution. The expected value also plays an important role in the approximation methods that arise in ?sec-6.

4.1.1 Expectation for a Discrete Distribution

Example 4.1.1

Example 4.1 (Example 4.1.1: Fair Price for a Stock) An investor is considering whether or not to invest $18 per share in a stock for one year. The value of the stock after one year, in dollars, will be $18 + X$, where $X$ is the amount by which the price changes over the year. At present $X$ is unknown, and the investor would like to compute an “average value” for $X$ in order to compare the return she expects from the investment to what she would get by putting the $18 in the bank at 5% interest.

The idea of finding an average value as in Example 4.1 arises in many applications that involve a random variable. One popular choice is what we call the mean or expected value or expectation.

The intuitive idea of the mean of a random variable is that it is the weighted average of the possible values of the random variable with the weights equal to the probabilities.

Example 4.1.2

Example 4.2 (Example 4.1.2: Stock Price Change) Suppose that the change in price of the stock in Example 4.1 is a random variable $X$ that can assume only the four different values $−2$, $0$, $1$, and $4$, and that $\Pr(X = −2) = 0.1$, $\Pr(X = 0) = 0.4$, $\Pr(X = 1) = 0.3$, and $\Pr(X = 4) = 0.2$. Then the weighted avarage of these values is

\[ −2(0.1) + 0(0.4) + 1(0.3) + 4(0.2) = 0.9. \]

The investor now compares this with the interest that would be earned on $18 at 5% for one year, which is $18 \times 0.05 = 0.9$ dollars. From this point of view, the price of $18 seems fair.

The calculation in Example 4.2 generalizes easily to every random variable that assumes only finitely many values. Possible problems arise with random variables that assume more than finitely many values, especially when the collection of possible values is unbounded.

Definition 4.1.1

Definition 4.1 (Definition 4.1.1: Mean of Bounded Discrete Random Variable.) Let $X$ be a bounded discrete random variable whose pmf is $f$. The expectation of $X$, denoted by $\mathbb{E}[X]$, is a number defined as follows:

\[ \mathbb{E}(X) = \sum_{\text{All }x}xf(x). \tag{4.1}\]

The expectation of $X$ is also referred to as the mean of $X$ or the expected value of $X$.

In Example 4.2, $\mathbb{E}(X) = 0.9$. Notice that $0.9$ is not one of the possible values of $X$ in that example. This is typically the case with discrete random variables.

Example 4.1.3

Example 4.3 (Example 4.1.3: Bernoulli Random Variable.) Let $X$ have the Bernoulli distribution with parameter $p$, that is, assume that $X$ takes only the two values 0 and 1 with $\Pr(X = 1) = p$. Then the mean of X is

\[ \mathbb{E}[X] = 0 \times (1 − p) + 1 \times p = p. \]

If $X$ is unbounded, it might still be possible to define $\mathbb{E}[X]$ as the weighted average of its possible values. However, some care is needed.

Definition 4.1.2

Definition 4.2 (Definition 4.1.2: Mean of General Discrete Random Variable.) Let $X$ be a discrete random variable whose pmf is $f$. Suppose that at least one of the following sums is finite: \[ \sum_{\text{Positive }x}xf(x), \; \sum_{\text{Negative }x}xf(x). \tag{4.2}\]

Then the mean, expectation, or expected value of $X$ is said to exist and is defined to be

\[ E(X) = \sum_{\text{All }x}xf(x). \tag{4.3}\]

If both of the sums in Equation 4.2 are infinite, then $\mathbb{E}[X]$ does not exist.

The reason that the expectation fails to exist if both of the sums in Equation 4.2 are infinite is that, in such cases, the sum in Equation 4.3 is not well-defined. It is known from calculus that the sum of an infinite series whose positive and negative terms both add to infinity either fails to converge or can be made to converge to many different values by rearranging the terms in different orders. We don’t want the meaning of expected value to depend on arbitrary choices about what order to add numbers. If only one of two sums in Equation 4.3 is infiinte, then the expected value is also infinite with the same sign as that of the sum that is infinite. If both sums are finite, then the sum in Equation 4.3 converges and doesn’t depend on the order in which the terms are added.

Example 4.1.4

Example 4.4 (Example 4.1.4: The Mean of $X$ Does Not Exist.) Let $X$ be a random variable whose pmf is

\[ f(x) = \begin{cases} \frac{1}{2|x|(|x|+1)} &\text{if }x = \pm 1, \pm 2, \pm 3, \ldots, \\ 0 &\text{otherwise.} \end{cases} \]

It can be verified that this function satisfies the conditions required to be a pmf. The two sums in Equation 4.2 are

\[ \sum_{x=-1}^{-\infty}x\frac{1}{2|x|(|x|+1)} = -\infty \text{ and }\sum_{x=1}^{\infty}x\frac{1}{2x(x+1)} = \infty; \]

hence, $\mathbb{E}[X]$ does not exist.

Example 4.1.5

Example 4.5 (Example 4.1.5: An Infinite Mean) Let $X$ be a random variable whose pmf is

\[ f(x) = \begin{cases} \frac{1}{x(x+1)} &\text{if }x = 1, 2, 3, \ldots, \\ 0 &\text{otherwise.} \end{cases} \]

The sum over negative values in Equation 4.2 is 0, so the mean of $X$ exists and is

\[ \mathbb{E}[X] = \sum_{x=1}^{\infty}x\frac{1}{x(x+1)} = \infty. \]

We say that the mean of $X$ is infinite in this case.

Note: The Expectation of $X$ Depends Only on the Distribution of $X$: Although $\mathbb{E}[X]$ is called the expectation of $X$, it depends only on the distribution of $X$. Every two random variables that have the same distribution will have the same expectation even if they have nothing to do with each other. For this reason, we shall often refer to the expectation of a distribution even if we do not have in mind a random variable with that distribution.

4.1.2 Expectation for a Continuous Distribution

The idea of computing a weighted average of the possible values can be generalized to continuous random variables by using integrals instead of sums. The distinction between bounded and unbounded random variables arises in this case for the same reasons.

Definition 4.1.3

Definition 4.3 (Definition 4.1.3: Mean of Bounded Continuous Random Variable) Let $X$ be a bounded continuous random variable whose pdf is $f$. The expectation of $X$, denoted $\mathbb{E}[X]$, is defined as follows:

\[ \mathbb{E}[X] = \int_{-\infty}^{\infty}xf(x)dx \tag{4.4}\]

Once again, the expectation is also called the mean or the expected value.

Example 4.1.6

Example 4.6 (Example 4.1.6: Expected Failure Time) An appliance has a maximum lifetime of one year. The time $X$ until it fails is a random variable with a continuous distribution having pdf

\[ f(x) = \begin{cases} 2x &\text{for }0 < x < 1, \\ 0 &\text{otherwise.} \end{cases} \]

Then

\[ \mathbb{E}[X] = \int_0^{1}x(2x)dx = \int_{0}^{1}2x^2dx = \frac{2}{3}. \]

We can also say that the expectation of the distribution with pdf $f$ is $2/3$.

For general continuous random variables, we modify Definition 4.2.

Definition 4.1.4

Definition 4.4 (Definition 4.1.4: Mean of General Continuous Random Variable) Let $X$ be a continuous random variable whose pdf is $f$. Suppose that at least one of the following integrals is finite:

\[ \int_{0}^{\infty}xf(x)dx, \; \int_{-\infty}^{0}xf(x)dx. \tag{4.5}\]

Then the mean, expectation, or expected value of $X$ is said to exist and is defined to be

\[ \mathbb{E}[X] = \int_{-\infty}^{\infty}xf(x)dx. \tag{4.6}\]

If both of the integrals in Equation 4.5 are infinite, then $\mathbb{E}[X]$ does not exist.

Example 4.1.7

Example 4.7 (Example 4.1.7: Failure after Warranty) A product has a warranty of one year. Let X be the time at which the product fails. Suppose that X has a continuous distribution with the pdf f (x) = 0 for x <1, 2 x3 for x ≥ 1. The expected time to failure is then E(X) = ∞ 1 x 2 x3 dx = ∞ 1 2 x2 dx = 2.

Example 4.1.8

Example 4.8 (Example 4.1.8: A Mean That Does Not Exist) Suppose that a random variable X has a continuous distribution for which the pdf is as follows: f (x) = 1 π(1+ x2) for −∞< x <∞. (4.1.7) This distribution is called the Cauchy distribution. We can verify the fact that ∞ −∞ f (x) dx = 1 by using the following standard result from elementary calculus: d dx tan−1 x = 1 1+ x2 for −∞< x <∞. The two integrals in (4.1.5) are ∞ 0 x π(1+ x2) dx =∞ and 0 −∞ x π(1+ x2) dx =−∞; hence, the mean of X does not exist.

4.1.3 Interpretation of the Expectation

Relation of the Mean to the Center of Gravity: The expectation of a random variable or, equivalently, the mean of its distribution can be regarded as being the center of gravity of that distribution. To illustrate this concept, consider, for example, the pmf sketched in ?fig-4-1. The $x$-axis may be regarded as a long weightless rod to which weights are attached. If a weight equal to $f(x_j)$ is attached to this rod at each point $x_j$, then the rod will be balanced if it is supported at the point $\mathbb{E}[X]$.

Now consider the pdf sketched in ?fig-4-2. In this case, the $x$-axis may be regarded as a long rod over which the mass varies continuously. If the density of the rod at each point $x$ is equal to $f(x)$, then the center of gravity of the rod will be located at the point $\mathbb{E}[X]$, and the rod will be balanced if it is supported at that point.

It can be seen from this discussion that the mean of a distribution can be affected greatly by even a very small change in the amount of probability that is assigned to a large value of $x$. For example, the mean of the distribution represented by the pmf in ?fig-4-1 can be moved to any specified point on the $x$-axis, no matter how far from the origin that point may be, by removing an arbitrarily small but positive amount of probability from one of the points $x_j$ and adding this amount of probability at a point far enough from the origin.

Suppose now that the pmf or pdf $f$ of some distribution is symmetric with respect to a given point $x_0$ on the $x$-axis. In other words, suppose that $f(x_0 + \delta) = f(x_0 − \delta)$ for all values of $\delta$. Also assume that the mean $\mathbb{E}[X]$ of this distribution exists. In accordance with the interpretation that the mean is at the center of gravity, it follows that $\mathbb{E}[X]$ must be equal to $x_0$, which is the point of symmetry. The following example emphasizes the fact that it is necessary to make certain that the mean $\mathbb{E}[X]$ exists before it can be concluded that $\mathbb{E}[X] = x_0$.

Example 4.1.9

Example 4.9 (Example 4.1.9: The Cauchy Distribution) Consider again the pdf specified by ?eq-4-1-7, which is sketched in ?fig-4-3. This pdf is symmetric with respect to the point $x = 0$. Therefore, if the mean of the Cauchy distribution existed, its value would have to be 0. However, we saw in Example 4.8 that the mean of $X$ does not exist.

The reason for the nonexistence of the mean of the Cauchy distribution is as follows: When the curve $y = f(x)$ is sketched as in ?fig-4-3, its tails approach the $x$-axis rapidly enough to permit the total area under the curve to be equal to 1. On the other hand, if each value of $f(x)$ is multiplied by $x$ and the curve $y = xf(x)$ is sketched, as in ?fig-4-4, the tails of this curve approach the $x$-axis so slowly that the total area between the $x$-axis and each part of the curve is infinite.

4.1.4 The Expectation of a Function

Example 4.1.10

Example 4.10 (Example 4.1.10: Failure Rate and Time to Failure) Suppose that appliances manufactured by a particular company fail at a rate of $X$ per year, where $X$ is currently unknown and hence is a random variable. If we are interested in predicting how long such an appliance will last before failure, we might use the mean of $1/X$. How can we calculate the mean of $Y = 1/X$?

Functions of a Single Random Variable If X is a random variable for which the pdf is f , then the expectation of each real-valued function r(X) can be found by applying the definition of expectation to the distribution of r(X) as follows: Let Y = r(X), determine the probability distribution of Y , and then determine E(Y) by applying either Eq. (4.1.1) or Eq. (4.1.4). For example, suppose that Y has a continuous distribution with the pdf g. Then E[r(X)]= E(Y) = ∞ −∞ yg(y) dy, (4.1.8) if the expectation exists.

Example 4.1.11 Failure Rate and Time to Failure. In Example 4.1.10, suppose that the pdf of X is f (x) = 3x2 if 0 < x <1, 0 otherwise. 4.1 The Expectation of a Random Variable 213 Let r(x) = 1/x. Using the methods of Sec. 3.8, we can find the pdf of Y = r(X) as g(y) = 3y −4 if y >1, 0 otherwise. The mean of Y is then E(Y) = ∞ 0 y3y −4dy = 3 2 . Although the method of Example 4.1.11 can be used to find the mean of a continuous random variable, it is not actually necessary to determine the pdf of r(X) in order to calculate the expectation E[r(X)]. In fact, it can be shown that the value of E[r(X)] can always be calculated directly using the following result. Theorem 4.1.1 Law of the Unconscious Statistician. Let X be a random variable, and let r be a realvalued function of a real variable. If X has a continuous distribution, then E[r(X)]= ∞ −∞ r(x)f (x) dx, (4.1.9) if the mean exists. If X has a discrete distribution, then E[r(X)]= All x r(x)f (x), (4.1.10) if the mean exists. Proof A general proof will not be given here. However, we shall provide a proof for two special cases. First, suppose that the distribution of X is discrete. Then the distribution of Y must also be discrete. Let g be the p.f. of Y . For this case, y yg(y) = y y Pr[r(X) = y] = y y x:r(x)=y f (x) = y x:r(x)=y r(x)f (x) = x r(x)f (x). Hence, Eq. (4.1.10) yields the same value as one would obtain from Definition 4.1.1 applied to Y . Second, suppose that the distribution of X is continuous. Suppose also, as in Sec. 3.8, that r(x) is either strictly increasing or strictly decreasing with differentiable inverse s(y). Then, if we change variables in Eq. (4.1.9) from x to y = r(x), ∞ −∞ r(x)f (x) dx = ∞ −∞ yf [s(y)] ds(y) dy dy. It now follows from Eq. (3.8.3) that the right side of this equation is equal to ∞ −∞ yg(y) dy. Hence, Eqs. (4.1.8) and (4.1.9) yield the same value. 214 Chapter 4 Expectation Theorem 4.1.1 is called the law of the unconscious statistician because many people treat Eqs. (4.1.9) and (4.1.10) as the definition of E[r(X)] and forget that the definition of the mean of Y = r(X) is given in Definitions 4.1.2 and 4.1.4. Example 4.1.12 Failure Rate and Time to Failure. In Example 4.1.11, we can apply Theorem 4.1.1 to find E(Y) = 1 0 1 x 3x2dx = 3 2 , the same result we got in Example 4.1.11. Example 4.1.13 Determining the Expectation of X1/2. Suppose that the pdf of X is as given in Example 4.1.6 and that Y = X1/2. Then, by Eq. (4.1.9), E(Y) = 1 0 x1/2(2x) dx = 2 1 0 x3/2 dx = 4 5 . Note: In General, E[g(X)] = g(E(X)). In Example 4.1.13, the mean of X1/2 is 4/5. The mean of X was computed in Example 4.1.6 as 2/3. Note that 4/5 = (2/3)1/2. In fact, unless g is a linear function, it is generally the case that E[g(X)] = g(E(X)). A linear function g does satisfy E[g(X)]= g(E(X)), as we shall see in Theorem 4.2.1. Example 4.1.14 Option Pricing. Suppose that common stock in the up-and-coming company A is currently priced at $200 per share. As an incentive to get you to work for company A, you might be offered an option to buy a certain number of shares of the stock, one year from now, at a price of $200. This could be quite valuable if you believed that the stock was very likely to rise in price over the next year. For simplicity, suppose that the priceX of the stock one year from now is a discrete random variable that can take only two values (in dollars): 260 and 180. Let p be the probability that X = 260. You want to calculate the value of these stock options, either because you contemplate the possibility of selling them or because you want to compare Company A’s offer to what other companies are offering. Let Y be the value of the option for one share when it expires in one year. Since nobody would pay $200 for the stock if the price X is less than $200, the value of the stock option is 0 if X = 180. If X = 260, one could buy the stock for $200 per share and then immediately sell it for $260. This brings in a profit of $60 per share. (For simplicity, we shall ignore dividends and the transaction costs of buying and selling stocks.) Then Y = h(X) where h(x) = 0 ifx = 180, 60 if x = 260. Assume that an investor could earn 4% risk-free on any money invested for this same year. (Assume that the 4% includes any compounding.) If no other investment options were available, a fair cost of the option would then be what is called the present value of E(Y) in one year. This equals the value c such that E(Y) = 1.04c. That is, the expected value of the option equals the amount of money the investor would have after one year without buying the option. We can find E(Y) easily: E(Y) = 0 × (1− p) + 60 × p = 60p. So, the fair price of an option to buy one share would be c = 60p/1.04 = 57.69p. How should one determine the probability p? There is a standard method used in the finance industry for choosing p in this example. That method is to assume that 4.1 The Expectation of a Random Variable 215 the present value of the mean ofX (the stock price in one year) is equal to the current value of the stock price. That is, assume that the expected value of buying one share of stock and waiting one year to sell is the same as the result of investing the current cost of the stock risk-free for one year (multiplying by 1.04 in this example). In our example, this means E(X) = 200 × 1.04. Since E(X) = 260p + 180(1− p), we set 200 × 1.04 = 260p + 180(1− p), and obtain p = 0.35. The resulting price of an option to buy one share for $200 in one year would be $57.69 × 0.35 = $20.19. This price is called the risk-neutral price of the option.One can prove (see Exercise 14 in this section) that any price other than $20.19 for the option would lead to unpleasant consequences in the market. Functions of Several Random Variables Example 4.1.15 The Expectation of a Function of Two Variables. Let X and Y have a joint pdf, and suppose that we want the mean of X2 + Y 2. The most straightforward but most difficult way to do this would be to use the methods of Sec. 3.9 to find the distribution of Z = X2 + Y 2 and then apply the definition of mean to Z. There is a version of Theorem 4.1.1 for functions of more than one random variable. Its proof is not given here. Theorem 4.1.2 Law of the Unconscious Statistician. Suppose that X1, . . . , Xn are random variables with the joint pdf f (x1, . . . , xn). Let r be a real-valued function of n real variables, and suppose that Y = r(X1, . . . , Xn). Then E(Y) can be determined directly from the relation E(Y) = . . . Rn r(x1, . . . , xn)f (x1, . . . , xn) dx1 . . . dxn, if the mean exists. Similarly, if X1, . . . , Xn have a discrete joint distribution with p.f. f (x1, . . . , xn), the mean of Y = r(X1, . . . , Xn) is E(Y) = All x1,…,xn r(x1, . . . , xn)f (x1, . . . , xn), if the mean exists. Example 4.1.16 Determining the Expectation of a Function of Two Variables. Suppose that a point (X, Y ) is chosen at random from the square S containing all points (x, y) such that 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. We shall determine the expected value of X2 + Y 2. Since X and Y have the uniform distribution over the square S, and since the area of S is 1, the joint pdf of X and Y is f (x, y) = 1 for (x, y) ∈ S, 0 otherwise. Therefore, E(X2 + Y 2) = ∞ −∞ ∞ −∞ (x2 + y2)f (x, y) dx dy = 1 0 1 0 (x2 + y2) dx dy = 2 3 . 216 Chapter 4 Expectation Note: More General Distributions. In Example 3.2.7, we introduced a type of distribution that was neither discrete nor continuous. It is possible to define expectations for such distributions also. The definition is rather cumbersome, and we shall not pursue it here. Summary The expectation, expected value, or mean of a random variable is a summary of its distribution. If the probability distribution is thought of as a distribution of mass along the real line, then the mean is the center of mass. The mean of a function r of a random variable X can be calculated directly from the distribution of X without first finding the distribution of r(X). Similarly, the mean of a function of a random vector X can be calculated directly from the distribution of X.

4.1.5 Exercises

Suppose that X has the uniform distribution on the interval [a, b]. Find the mean of X.
If an integer between 1 and 100 is to be chosen at random, what is the expected value?
In a class of 50 students, the number of students ni of each age i is shown in the following table: Agei ni 18 20 19 22 20 4 21 3 25 1 If a student is to be selected at random from the class, what is the expected value of his age?
Suppose that one word is to be selected at random from the sentence the girl put on her beautiful red hat. If X denotes the number of letters in the word that is selected, what is the value of E(X)?
Suppose that one letter is to be selected at random from the 30 letters in the sentence given in Exercise 4. If Y denotes the number of letters in the word in which the selected letter appears, what is the value of E(Y)?
Suppose that a random variable X has a continuous distribution with the pdf f given in Example 4.1.6. Find the expectation of 1/X.
Suppose that a random variable X has the uniform distribution on the interval [0, 1]. Show that the expectation of 1/X is infinite.
Suppose that X and Y have a continuous joint distribution for which the joint pdf is as follows: f (x, y) = 12y2 for 0 ≤ y ≤ x ≤ 1, 0 otherwise. Find the value of E(XY).
Suppose that a point is chosen at random on a stick of unit length and that the stick is broken into two pieces at that point. Find the expected value of the length of the longer piece.
Suppose that a particle is released at the origin of the xy-plane and travels into the half-plane where x >0. Suppose that the particle travels in a straight line and that the angle between the positive half of the x-axis and this line is α, which can be either positive or negative. Suppose, finally, that the angle α has the uniform distribution on the interval [−π/2, π/2]. Let Y be the ordinate of the point at which the particle hits the vertical line x = 1. Show that the distribution of Y is a Cauchy distribution.
Suppose that the random variables X1, . . . , Xn form a random sample of size n from the uniform distribution on the interval [0, 1]. Let Y1 = min{X1, . . . , Xn }, and let Yn = max{X1, . . . , Xn }. Find E(Y1) and E(Yn).
Suppose that the random variables X1, . . . , Xn form a random sample of size n from a continuous distribution for which the c.d.f. is F, and let the random variables Y1 and Yn be defined as in Exercise 11. Find E[F(Y1)] and E[F(Yn)].
Astock currently sells for $110 per share. Let the price of the stock at the end of a one-year period beX, which will take one of the values $100 or $300. Suppose that you have the option to buy shares of this stock at $150 per share at the end of that one-year period. Suppose that money could earn 5.8% risk-free over that one-year period. Find the risk-neutral price for the option to buy one share.
Consider the situation of pricing a stock option as in Example 4.1.14.We want to prove that a price other than $20.19 for the option to buy one share in one year for $200 would be unfair in some way.

Suppose that an investor (who has several shares of the stock already) makes the following transactions. She buys three more shares of the stock at $200 per share and sells four options for $20.19 each. The investor must borrow the extra $519.24 necessary to make these transactions at 4% for the year. At the end of the year, our investor might have to sell four shares for $200 each to the person who bought the options. In any event, she sells enough stock to pay back the amount borrowed plus the 4 percent interest. Prove that the investor has the same net worth (within rounding error) at the end of the year as she would have had without making these transactions, no matter what happens to the stock price. (Acombination of stocks and options that produces no change in net worth is called a risk-free portfolio.)
Consider the same transactions as in part (a), but this time suppose that the option price is $x where x <20.19. Prove that our investor loses |4.16x − 84| dollars of net worth no matter what happens to the stock price.
Consider the same transactions as in part (a), but this time suppose that the option price is $x where x >20.19. Prove that our investor gains 4.16x − 84 dollars of net worth no matter what happens to the stock price. The situations in parts (b) and (c) are called arbitrage opportunities. Such opportunities rarely exist for any length of time in financial markets. Imagine what would happen if the three shares and four options were changed to three million shares and four million options.

In Example 4.1.14, we showed how to price an option to buy one share of a stock at a particular price at a particular time in the future. This type of option is called a call option. A put option is an option to sell a share of a stock at a particular price $y at a particular time in the future. (If you don’t own any shares when you wish to exercise the option, you can always buy one at the market price and then sell it for $y.) The same sort of reasoning as in Example 4.1.14 could be used to price a put option. Consider the same stock as in Example 4.1.14 whose price in one year is X with the same distribution as in the example and the same risk-free interest rate. Find the risk-neutral price for an option to sell one share of that stock in one year at a price of $220.
Let Y be a discrete random variable whose p.f. is the function f in Example 4.1.4. Let X = |Y |. Prove that the distribution of X has the pdf in Example 4.1.5.

4.2 Properties of Expectations

In this section, we present some results that simplify the calculation of expectations for some common functions of random variables.

4.2.1 Basic Theorems

Suppose that $X$ is a random variable for which the expectation $\expect{X}$ exists. We shall present several results pertaining to the basic properties of expectations.

Theorem 4.2.1 Linear Function. If Y = aX + b, where a and b are finite constants, then E(Y) = aE(X) + b. Proof We first shall assume, for convenience, that X has a continuous distribution for which the p.d.f. is f . Then E(Y) = E(aX + b) = ∞ −∞ (ax + b)f (x) dx = a ∞ −∞ xf (x) dx + b ∞ −∞ f (x) dx = aE(X) + b. A similar proof can be given for a discrete distribution. 218 Chapter 4 Expectation Example 4.2.1 Calculating the Expectation of a Linear Function. Suppose that E(X) = 5. Then E(3X − 5) = 3E(X) − 5 = 10 and E(−3X + 15)=−3E(X) + 15 = 0. The following result follows from Theorem 4.2.1 with a = 0. Corollary 4.2.1 If X = c with probability 1, then E(X) = c. Example 4.2.2 Investment. An investor is trying to choose between two possible stocks to buy for a three-month investment. One stock costs $50 per share and has a rate of return of R1 dollars per share for the three-month period, where R1 is a random variable. The second stock costs $30 per share and has a rate of return of R2 per share for the same three-month period. The investor has a total of $6000 to invest. For this example, suppose that the investor will buy shares of only one stock. (In Example 4.2.3, we shall consider strategies in which the investor buys more than one stock.) Suppose that R1 has the uniform distribution on the interval [−10, 20] and that R2 has the uniform distribution on the interval [−4.5, 10]. We shall first compute the expected dollar value of investing in each of the two stocks. For the first stock, the $6000 will purchase 120 shares, so the return will be 120R1, whose mean is 120E(R1) = 600. (Solve Exercise 1 in Sec. 4.1 to see why E(R1) = 5.) For the second stock, the $6000 will purchase 200 shares, so the return will be 200R2, whose mean is 200E(R2) = 550. The first stock has a higher expected return. In addition to calculating expected return, we should also ask which of the two investments is riskier. We shall now compute the value at risk (VaR) at probability level 0.97 for each investment. (See Example 3.3.7 on page 113.) VaR will be the negative of the 1− 0.97 = 0.03 quantile for the return on each investment. For the first stock, the return 120R1 has the uniform distribution on the interval [−1200, 2400] (see Exercise 14 in Sec. 3.8) whose 0.03 quantile is (according to Example 3.3.8 on page 114) 0.03 × 2400 + 0.97 × (−1200)=−1092. So VaR= 1092. For the second stock, the return 200R2 has the uniform distribution on the interval [−900, 2000] whose 0.03 quantile is 0.03 × 2000 + 0.97 × (−900)=−813. So VaR= 813. Even though the first stock has higher expected return, the second stock seems to be slightly less risky in terms of VaR. How should we balance risk and expected return to choose between the two purchases? One way to answer this question is illustrated in Example 4.8.10, after we learn about utility. Theorem 4.2.2 If there exists a constant such that Pr(X ≥ a) = 1, then E(X) ≥ a. If there exists a constant b such that Pr(X ≤ b) = 1, then E(X) ≤ b. Proof We shall assume again, for convenience, that X has a continuous distribution for which the p.d.f. is f , and we shall suppose first that Pr(X ≥ a) = 1. Because X is bounded below, the second integral in (4.1.5) is finite. Then E(X) = ∞ −∞ xf (x) dx = ∞ a xf (x) dx ≥ ∞ a af (x) dx = a Pr(X ≥ a) = a. The proof of the other part of the theorem and the proof for a discrete distribution are similar. 4.2 Properties of Expectations 219 It follows from Theorem 4.2.2 that if Pr(a ≤ X ≤ b) = 1, then a ≤ E(X) ≤ b. Theorem 4.2.3 Suppose that E(X) = a and that either Pr(X ≥ a) =1 or Pr(X ≤ a) = 1. Then Pr(X = a) = 1. Proof We shall provide a proof for the case in which X has a discrete distribution and Pr(X ≥ a) = 1. The other cases are similar. Let x1, x2, . . . include every value x >a such that Pr(X = x) > 0, if any. Let p0 = Pr(X = a). Then, E(X) = p0a + ∞ j=1 xj Pr(X = xj ). (4.2.1) Each xj in the sum on the right side of Eq. (4.2.1) is greater than a. If we replace all of the xj ’s by a, the sum can’t get larger, and hence E(X) ≥ p0a + ∞ j=1 a Pr(X = xj ) = a. (4.2.2) Furthermore, the inequality in Eq. (4.2.2) will be strict if there is even onex >a with Pr(X = x) > 0. This contradicts E(X) = a. Hence, there can be no x >a such that Pr(X = x) > 0. Theorem 4.2.4 If X1, . . . ,Xn are n random variables such that each expectation E(Xi) is finite (i = 1, . . . , n), then E(X1 + . . . + Xn) = E(X1) + . . . + E(Xn). Proof We shall first assume that n = 2 and also, for convenience, thatX1 andX2 have a continuous joint distribution for which the joint p.d.f. is f . Then E(X1 + X2) = ∞ −∞ ∞ −∞ (x1 + x2)f (x1, x2) dx1 dx2 = ∞ −∞ ∞ −∞ x1f (x1, x2) dx1 dx2 + ∞ −∞ ∞ −∞ x2f (x1, x2) dx1 dx2 = ∞ −∞ ∞ −∞ x1f (x1, x2) dx2 dx1 + ∞ −∞ x2f2(x2) dx2 = ∞ −∞ x1f1(x1) dx1 + ∞ −∞ x2f2(x2) dx2 = E(X1) + E(X2), where f1 and f2 are the marginal p.d.f.’s of X1 and X2. The proof for a discrete distribution is similar. Finally, the theorem can be established for each positive integer n by an induction argument. It should be emphasized that, in accordance with Theorem 4.2.4, the expectation of the sum of several random variables always equals the sum of their individual expectations, regardless of what their joint distribution is. Even though the joint p.d.f. ofX1 andX2 appeared in the proof of Theorem 4.2.4, only the marginal p.d.f.’s figured into the calculation of E(X1 + X2). The next result follows easily from Theorems 4.2.1 and 4.2.4. Corollary 4.2.2 Assume that E(Xi) is finite for i = 1, . . . , n. For all constants a1, . . . , an and b, E(a1X1 + . . . + anXn + b) = a1E(X1) + . . . + anE(Xn) + b. 220 Chapter 4 Expectation Example 4.2.3 Investment Portfolio. Suppose that the investor with $6000 in Example 4.2.2 can buy shares of both of the two stocks. Suppose that the investor buys s1 shares of the first stock at $50 per share and s2 shares of the second stock at $30 per share. Such a combination of investments is called a portfolio. Ignoring possible problems with fractional shares, the values of s1 and s2 must satisfy 50s1 + 30s2 = 6000, in order to invest the entire $6000. The return on this portfolio will be s1R1 + s2R2. The mean return will be s1E(R1) + s2E(R2) = 5s1 + 2.75s2. For example, if s1 = 54 and s2 = 110, then the mean return is 572.5. Example 4.2.4 Sampling without Replacement. Suppose that a box contains red balls and blue balls and that the proportion of red balls in the box isp (0 ≤ p ≤ 1). Suppose that n balls are selected from the box at random without replacement, and let X denote the number of red balls that are selected.We shall determine the value of E(X). We shall begin by defining n random variables X1, . . . , Xn as follows: For i = 1, . . . , n, let Xi = 1 if the ith ball that is selected is red, and let Xi = 0 if the ith ball is blue. Since the n balls are selected without replacement, the random variables X1, . . . , Xn are dependent. However, the marginal distribution of each Xi can be derived easily (see Exercise 10 of Sec. 1.7). We can imagine that all the balls are arranged in the box in some random order, and that the first n balls in this arrangement are selected. Because of randomness, the probability that the ith ball in the arrangement will be red is simply p. Hence, for i = 1, . . . , n, Pr(Xi = 1) = p and Pr(Xi = 0) = 1− p. (4.2.3) Therefore, E(Xi) = 1(p) + 0(1− p) = p. From the definition of X1, . . . , Xn, it follows that X1 + . . . + Xn is equal to the total number of red balls that are selected. Therefore, X = X1 + . . . + Xn and, by Theorem 4.2.4, E(X) = E(X1) + . . . + E(Xn) = np. (4.2.4) Note: In General, E[g(X)] = g(E(X)). Theorems 4.2.1 and 4.2.4 imply that if g is a linear function of a random vector X, then E[g(X)]= g(E(X)). For a nonlinear function g, we have already seen Example 4.1.13 in which E[g(X)] = g(E(X)). Jensen’s inequality (Theorem 4.2.5) gives a relationship between E[g(X)] and g(E(X)) for another special class of functions. Definition 4.2.1 Convex Functions. Afunction g of a vector argument is convex if, for every α ∈ (0, 1), and every x and y, g[αx + (1− α)y]≥ αg(x) + (1− α)g(y). The proof of Theorem 4.2.5 is not given, but one special case is left to the reader in Exercise 13. Theorem 4.2.5 Jensen’s Inequality. Let g be a convex function, and let X be a random vector with finite mean. Then E[g(X)]≥ g(E(X)). 4.2 Properties of Expectations 221 Example 4.2.5 Sampling with Replacement. Suppose again that in a box containing red balls and blue balls, the proportion of red balls is p (0 ≤ p ≤ 1). Suppose now, however, that a random sample of n balls is selected from the box with replacement. If X denotes the number of red balls in the sample, then X has the binomial distribution with parameters n and p, as described in Sec. 3.1. We shall now determine the value of E(X). As before, for i = 1, . . . , n, let Xi = 1 if the ith ball that is selected is red, and let Xi = 0 otherwise. Then, as before, X = X1 + . . . + Xn. In this problem, the random variables X1, . . . ,Xn are independent, and the marginal distribution of each Xi is again given by Eq. (4.2.3). Therefore, E(Xi) = p for i = 1, . . . , n, and it follows from Theorem 4.2.4 that E(X) = np. (4.2.5) Thus, the mean of the binomial distribution with parameters n and p is np. The p.f. f (x) of this binomial distribution is given by Eq. (3.1.4), and the mean can be computed directly from the p.f. as follows: E(X) = n x=0 x

n x pxqn−x. (4.2.6) Hence, by Eq. (4.2.5), the value of the sum in Eq. (4.2.6) must be np. It is seen from Eqs. (4.2.4) and (4.2.5) that the expected number of red balls in a sample of n balls is np, regardless of whether the sample is selected with or without replacement. However, the distribution of the number of red balls is different depending on whether sampling is done with or without replacement (for n > 1). For example, Pr(X = n) is always smaller in Example 4.2.4 where sampling is done without replacement than in Example 4.2.5 where sampling is done with replacement, if n > 1. (See Exercise 27 in Sec. 4.9.) Example 4.2.6 Expected Number of Matches. Suppose that a person types n letters, types the addresses on n envelopes, and then places each letter in an envelope in a random manner. Let X be the number of letters that are placed in the correct envelopes. We shall find the mean of X. (In Sec. 1.10, we did a more difficult calculation with this same example.) For i = 1, . . . , n, let Xi = 1 if the ith letter is placed in the correct envelope, and let Xi = 0 otherwise. Then, for i = 1, . . . , n, Pr(Xi = 1) = 1 n and Pr(Xi = 0) = 1− 1 n . Therefore, E(Xi) = 1 n for i = 1, . . . , n. Since X = X1 + . . . + Xn, it follows that E(X) = E(X1) + . . . + E(Xn) = 1 n + . . . + 1 n = 1. Thus, the expected value of the number of correct matches of letters and envelopes is 1, regardless of the value of n. 222 Chapter 4 Expectation Expectation of a Product of Independent Random Variables Theorem 4.2.6 If X1, . . . , Xn are n independent random variables such that each expectation E(Xi) is finite (i = 1, . . . , n), then E !n i=1 Xi = !n i=1 E(Xi). Proof We shall again assume, for convenience, that X1, . . . , Xn have a continuous joint distribution for which the joint p.d.f. is f . Also, we shall let fi denote the marginal p.d.f. ofXi (i = 1, . . . , n). Then, since the variablesX1, . . . , Xn are independent, it follows that at every point (x1, . . . , xn) ∈ Rn, f (x1, . . . , xn) = !n i=1 fi(xi). Therefore, E !n i=1 Xi = ∞ −∞ . . . ∞ −∞ !n i=1 xi f (x1, . . . , xn) dx1 . . . dxn = ∞ −∞ . . . ∞ −∞ !n i=1 xifi(xi) dx1 . . . dxn = !n i=1 ∞ −∞ xifi(xi) dxi = !n i=1 E(Xi). The proof for a discrete distribution is similar. The difference betweenTheorem 4.2.4 andTheorem 4.2.6 should be emphasized. If it is assumed that each expectation is finite, the expectation of the sum of a group of random variables is always equal to the sum of their individual expectations. However, the expectation of the product of a group of random variables is not always equal to the product of their individual expectations. If the random variables are independent, then this equality will also hold. Example 4.2.7 Calculating the Expectation of a Combination of Random Variables. Suppose that X1, X2, and X3 are independent random variables such that E(Xi) = 0 and E(X2 i ) = 1 for i = 1, 2, 3. We shall determine the value of E[X2 1(X2 − 4X3)2]. Since X1, X2, and X3 are independent, it follows that the two random variables X2 1 and (X2 − 4X3)2 are also independent. Therefore, E[X2 1(X2 − 4X3)2]= E(X2 1)E[(X2 − 4X3)2] = E(X2 2 − 8X2X3 + 16X2 3) = E(X2 2) − 8E(X2X3) + 16E(X2 3) = 1− 8E(X2)E(X3) + 16 = 17. Example 4.2.8 Repeated Filtering. A filtration process removes a random proportion of particulates in water to which it is applied. Suppose that a sample of water is subjected to this process twice. Let X1 be the proportion of the particulates that are removed by the first pass. Let X2 be the proportion of what remains after the first pass that 4.2 Properties of Expectations 223 is removed by the second pass. Assume that X1 and X2 are independent random variables with common p.d.f. f (x) = 4x3 for 0 < x <1 and f (x) = 0 otherwise. Let Y be the proportion of the original particulates that remain in the sample after two passes. Then Y = (1− X1)(1− X2). Because X1 and X2 are independent, so too are 1− X1 and 1− X2. Since 1− X1 and 1− X2 have the same distribution, they have the same mean, call it μ. It follows that Y has mean μ2. We can find μ as μ = E(1− X1) = 1 0 (1− x1)4x3 1dx1 = 1− 4 5 = 0.2. It follows that E(Y) = 0.22 = 0.04. Expectation for Nonnegative Distributions Theorem 4.2.7 Integer-Valued Random Variables. Let X be a random variable that can take only the values 0, 1, 2, . . . . Then E(X) = ∞ n=1 Pr(X ≥ n). (4.2.7) Proof First, we can write E(X) = ∞ n=0 n Pr(X = n) = ∞ n=1 n Pr(X = n). (4.2.8) Next, consider the following triangular array of probabilities: Pr(X = 1) Pr(X = 2) Pr(X = 3) . . . Pr(X = 2) Pr(X = 3) . . . Pr(X = 3) . . . . . . We can compute the sum of all the elements in this array in two different ways because all of the summands are nonnegative. First, we can add the elements in each c olumn of the array and then add these column totals. Thus, we obtain the value ∞ n=1 n Pr(X = n). Second, we can add the elements in each row of the array and then add these row totals. In this way we obtain the value ∞ n=1 Pr(X ≥ n). Therefore, ∞ n=1 n Pr(X = n) = ∞ n=1 Pr(X ≥ n). Eq. (4.2.7) now follows from Eq. (4.2.8). Example 4.2.9 ExpectedNumber of Trials. Suppose that a person repeatedly tries to perform a certain task until he is successful. Suppose also that the probability of success on each given trial is p (0 <p <1) and that all trials are independent. If X denotes the number of the trial on which the first success is obtained, then E(X) can be determined as follows. Since at least one trial is always required, Pr(X ≥ 1) = 1. Also, for n = 2, 3, . . . , at least n trials will be required if and only if none of the first n − 1 trials results in success. Therefore, Pr(X ≥ n) = (1− p)n−1. 224 Chapter 4 Expectation By Eq. (4.2.7), it follows that E(X) = 1+ (1− p) + (1− p)2 + . . . = 1 1− (1− p) = 1 p . Theorem 4.2.7 has a more general version that applies to all nonnegative random variables. Theorem 4.2.8 General Nonnegative Random Variable. Let X be a nonnegative random variable with c.d.f. F. Then E(X) = ∞ 0 [1− F(x)]dx. (4.2.9) The proof of Theorem 4.2.8 is left to the reader in Exercises 1 and 2 in Sec. 4.9. Example 4.2.10 ExpectedWaiting Time. Let X be the time that a customer spends waiting for service in a queue. Suppose that the c.d.f. of X is F(x) = 0 if x ≤ 0, 1− e −2x if x >0. Then the mean of X is E(X) = ∞ 0 e −2xdx = 1 2 . Summary The mean of a linear function of a random vector is the linear function of the mean. In particular, the mean of a sum is the sum of the means.As an example, the mean of the binomial distribution with parameters n and p is np. No such relationship holds in general for nonlinear functions. For independent random variables, the mean of the product is the product of the means. Exercises 1. Suppose that the return R (in dollars per share) of a stock has the uniform distribution on the interval [−3, 7]. Suppose also, that each share of the stock costs $1.50. Let Y be the net return (total return minus cost) on an investment of 10 shares of the stock. Compute E(Y). 2. Suppose that three random variables X1, X2, X3 form a random sample from a distribution for which the mean is 5. Determine the value of E(2X1 − 3X2 + X3 − 4). 3. Suppose that three random variables X1, X2, X3 form a random sample from the uniform distribution on the interval [0, 1]. Determine the value of E[(X1 − 2X2 + X3)2]. 4. Suppose that the random variable X has the uniform distribution on the interval [0, 1], that the random variable Y has the uniform distribution on the interval [5, 9], and that X and Y are independent. Suppose also that a rectangle is to be constructed for which the lengths of two adjacent sides are X and Y . Determine the expected value of the area of the rectangle. 5. Suppose that the variables X1, . . . , Xn form a random sample of size n from a given continuous distribution on the real line for which the p.d.f. is f . Find the expectation of the number of observations in the sample that fall within a specified interval a ≤ x ≤ b.

Suppose that a particle starts at the origin of the real line and moves along the line in jumps of one unit. For each jump, the probability is p (0 ≤ p ≤ 1) that the particle will jump one unit to the left and the probability is 1− p that the particle will jump one unit to the right. Find the expected value of the position of the particle after n jumps.
Suppose that on each play of a certain game a gambler is equally likely to win or to lose. Suppose that when he wins, his fortune is doubled, and that when he loses, his fortune is cut in half. If he begins playing with a given fortune c, what is the expected value of his fortune after n independent plays of the game?
Suppose that a class contains 10 boys and 15 girls, and suppose that eight students are to be selected at random from the class without replacement. Let X denote the number of boys that are selected, and let Y denote the number of girls that are selected. Find E(X − Y).
Suppose that the proportion of defective items in a large lot is p, and suppose that a random sample of n items is selected from the lot. Let X denote the number of defective items in the sample, and let Y denote the number of nondefective items. Find E(X − Y).
Suppose that a fair coin is tossed repeatedly until a head is obtained for the first time. (a)What is the expected number of tosses that will be required? (b) What is the expected number of tails that will be obtained before the first head is obtained?
Suppose that a fair coin is tossed repeatedly until exactly k heads have been obtained. Determine the expected number of tosses that will be required. Hint: Represent the total number of tosses X in the form X = X1 + . . . + Xk, where Xi is the number of tosses required to obtain the ith head after i − 1 heads have been obtained.
Suppose that the two return random variables R1 and R2 in Examples 4.2.2 and 4.2.3 are independent. Consider the portfolio at the end of Example 4.2.3 with s1 = 54 shares of the first stock and s2 = 110 shares of the second stock.

Prove that the change in value X of the portfolio has the p.d.f. f (x) = ⎧⎪⎪⎪⎨ ⎪⎪⎪⎩ 3.87 × 10−7(x + 1035) if −1035< x <560, 6.1728 × 10−4 if 560 ≤ x ≤ 585, 3.87 × 10−7(2180 − x) if 585< x <2180, 0 otherwise. Hint: Look at Example 3.9.5.
Find the value at risk (VaR) at probability level 0.97 for the portfolio.

Prove the special case of Theorem 4.2.5 in which the function g is twice continuously differentiable and X is one-dimensional. You may assume that a twice continuously differentiable convex function has nonnegative second derivative. Hint: Expand g(X) around its mean using Taylor’s theorem with remainder. Taylor’s theorem with remainder says that if g(x) has two continuous derivatives g and g at x = x0, then there exists y between x0 and x such that g(x) = g(x0) + (x − x0)g (x0) + (x − x0)2 2 g (y).

4.3 Variance

*Although the mean of a distribution is a useful summary, it does not convey very much information about the distribution. For example, a random variable $X$ with mean 2 has the same mean as the constant random variable $Y$ such that $\Pr(Y = 2) = 1$ even if $X$ is not constant. To distinguish the distribution of $X$ from the distribution of $Y$ in this case, it might be useful to give some measure of how spread out the distribution of $X$ is. The variance of $X$ is one such measure. The standard deviation of $X$ is the square root of the variance. The variance also plays an important role in the approximation methods that arise in ?sec-6.

Example 4.3.1 Stock Price Changes. Consider the pricesAand B of two stocks at a time one month in the future. Assume that A has the uniform distribution on the interval [25, 35] and B has the uniform distribution on the interval [15, 45]. It is easy to see (from Exercise 1 in Sec. 4.1) that both stocks have a mean price of 30. But the distributions are very different. For example, A will surely be worth at least 25 while Pr(B < 25) = 1/3. But B has more upside potential also. The p.d.f.’s of these two random variables are plotted in Fig. 4.5. 226 Chapter 4 Expectation Figure 4.5 The p.d.f.’s of two uniform distributions in Example 4.3.1. Both distributions have mean equal to 30, but they are spread out differently. 0 10 20 30 40 50 60 0.02 0.04 0.06 0.08 0.10 0.12 Uniform on [25,35] Uniform on [15,45] x p.d.f. Definitions of the Variance and the Standard Deviation Although the two random prices in Example 4.3.1 have the same mean, price B is more spread out than price A, and it would be good to have a summary of the distribution that makes this easy to see. Definition 4.3.1 Variance/Standard Deviation. Let X be a random variable with finite mean μ = E(X). The variance of X, denoted by Var(X), is defined as follows: Var(X) = E[(X − μ)2]. (4.3.1) If X has infinite mean or if the mean of X does not exist, we say that Var(X) does not exist. The standard deviation of X is the nonnegative square root of Var(X) if the variance exists. If the expectation in Eq. (4.3.1) is infinite, we say that Var(X) and the standard deviation of X are infinite. When only one random variable is being discussed, it is common to denote its standard deviation by the symbol σ, and the variance is denoted by σ2. When more than one random variable is being discussed, the name of the random variable is included as a subscript to the symbol σ, e.g., σX would be the standard deviation of X while σ2 Y would be the variance of Y . Example 4.3.2 Stock Price Changes. Return to the two random variables A and B in Example 4.3.1. Using Theorem 4.1.1, we can compute Var(A) = 35 25 (a − 30)2 1 10 da = 1 10 5 −5 x2dx = 1 10 x3 3 5 x=−5 = 25 3 , Var(B) = 45 15 (b − 30)2 1 30 db = 1 30 15 −15 y2dy = 1 30 y3 3 15 y=−15 = 75. So, Var(B) is nine times as large as Var(A). The standard deviations of A and B are σA = 2.87 and σB = 8.66. Note: Variance Depends Only on the Distribution. The variance and standard deviation of a random variable X depend only on the distribution of X, just as the expectation of X depends only on the distribution. Indeed, everything that can be computed from the p.f. or p.d.f. depends only on the distribution. Two random 4.3 Variance 227 variables with the same distribution will have the same variance, even if they have nothing to do with each other. Example 4.3.3 Variance and Standard Deviation of a Discrete Distribution. Suppose that a random variable X can take each of the five values −2, 0, 1, 3, and 4 with equal probability. We shall determine the variance and standard deviation of X. In this example, E(X) = 1 5 (−2 + 0 + 1+ 3 + 4) = 1.2. Let μ = E(X) = 1.2, and define W = (X − μ)2. Then Var(X) = E(W). We can easily compute the p.f. f of W: x −2 0 1 3 4 w 10.24 1.44 0.04 3.24 7.84 f (w) 1/5 1/5 1/5 1/5 1/5 It follows that Var(X) = E(W) = 1 5 [10.24 + 1.44 + 0.04 + 3.24 + 7.84]= 4.56. The standard deviation of X is the square root of the variance, namely, 2.135. There is an alternative method for calculating the variance of a distribution, which is often easier to use. Theorem 4.3.1 Alternative Method for Calculating the Variance. For every random variable X, Var(X) = E(X2) − [E(X)]2. Proof Let E(X) = μ. Then Var(X) = E[(X − μ)2] = E(X2 − 2μX + μ2) = E(X2) − 2μE(X) + μ2 = E(X2) − μ2. Example 4.3.4 Variance of a Discrete Distribution. Once again, consider the random variable X in Example 4.3.3, which takes each of the five values −2, 0, 1, 3, and 4 with equal probability. We shall use Theorem 4.3.1 to compute Var(X). In Example 4.3.3, we computed the mean of X as μ = 1.2. To use Theorem 4.3.1, we need E(X2) = 1 5 [(−2)2 + 02 + 12 + 32 + 42]= 6. BecauseE(X) = 1.2, Theorem 4.3.1 says that Var(X) = 6 − (1.2)2 = 4.56, which agrees with the calculation in Example 4.3.3. The variance (as well as the standard deviation) of a distribution provides a measure of the spread or dispersion of the distribution around its meanμ.Asmall value of the variance indicates that the probability distribution is tightly concentrated around 228 Chapter 4 Expectation μ; a large value of the variance typically indicates that the probability distribution has a wide spread around μ. However, the variance of a distribution, as well as its mean, can be made arbitrarily large by placing even a very small but positive amount of probability far enough from the origin on the real line. Example 4.3.5 Slight Modification of a Bernoulli Distribution. Let X be a discrete random variable with the following p.d.f.: f (x) = ⎧⎪⎪⎪⎨ ⎪⎪⎪⎩ 0.5 ifx = 0, 0.499 if x = 1, 0.001 if x = 10,000, 0 otherwise. There is a sense in which the distribution of X differs very little from the Bernoulli distribution with parameter 0.5. However, the mean and variance of X are quite different from the mean and variance of the Bernoulli distribution with parameter 0.5. Let Y have the Bernoulli distribution with parameter 0.5. In Example 4.1.3, we computed the mean of Y as E(Y) = 0.5. Since Y 2 = Y , E(Y2) = E(Y) = 0.5, so Var(Y ) = 0.5− 0.52 = 0.25. The means of X and X2 are also straightforward calculations: E(X) = 0.5 × 0 + 0.499 × 1+ 0.001× 10,000 = 10.499 E(X2) = 0.5 × 02 + 0.499 × 12 + 0.001× 10,0002 = 100,000.499. So Var(X) = 99,890.27. The mean and variance of X are much larger than the mean and variance of Y . Properties of the Variance We shall now present several theorems that state basic properties of the variance. In these theorems we shall assume that the variances of all the random variables exist. The first theorem concerns the possible values of the variance. Theorem 4.3.2 For each X, Var(X) ≥ 0. If X is a bounded random variable, then Var(X) must exist and be finite. Proof Because Var(X) is the mean of a nonnegative random variable (X − μ)2, it must be nonnegative according to Theorem 4.2.2. If X is bounded, then the mean exists, and hence the variance exists. Furthermore, if X is bounded the so too is (X − μ)2, so the variance must be finite. The next theorem shows that the variance of a random variable X cannot be 0 unless the entire probability distribution of X is concentrated at a single point. Theorem 4.3.3 Var(X) = 0 if and only if there exists a constant c such that Pr(X = c) = 1. Proof Suppose first that there exists a constant c such that Pr(X = c) = 1. Then E(X) = c, and Pr[(X − c)2 = 0]= 1. Therefore, Var(X) = E[(X − c)2]= 0. Conversely, suppose that Var(X) = 0. Then Pr[(X − μ)2 ≥ 0] = 1 but E[(X − μ)2]= 0. It follows from Theorem 4.2.3 that Pr[(X − μ)2 = 0]= 1. Hence, Pr(X = μ) = 1. 4.3 Variance 229 Figure 4.6 The p.d.f. of a random variable X together with the p.d.f.’s of X + 3 and −X. Note that the spreads of all three distributions appear the same. 0.5 1.0 1.5 p.d.f. of x p.d.f. of x 3 p.d.f. of x x p.d.f. 2 0 2 4 6 Theorem 4.3.4 For constants a and b, let Y = aX + b. Then Var(Y ) = a2 Var(X), and σY = |a|σX. Proof If E(X) = μ, then E(Y) = aμ + b by Theorem 4.2.1. Therefore, Var(Y ) = E[(aX + b − aμ − b)2]= E[(aX − aμ)2] = a2E[(X − μ)2]= a2 Var(X). Taking the square root of Var(Y ) yields |a|σX. It follows from Theorem 4.3.4 that Var(X + b) = Var(X) for every constant b. This result is intuitively plausible, since shifting the entire distribution ofX a distance of b units along the real line will change the mean of the distribution by b units but the shift will not affect the dispersion of the distribution around its mean. Figure 4.6 shows the p.d.f. a random variable X together with the p.d.f. of X + 3 to illustrate how a shift of the distribution does not affect the spread. Similarly, it follows from Theorem 4.3.4 that Var(−X) = Var(X). This result also is intuitively plausible, since reflecting the entire distribution of X with respect to the origin of the real line will result in a new distribution that is the mirror image of the original one. The mean will be changed from μ to −μ, but the total dispersion of the distribution around its mean will not be affected. Figure 4.6 shows the p.d.f. of a random variable X together with the p.d.f. of −X to illustrate how a reflection of the distribution does not affect the spread. Example 4.3.6 Calculating the Variance and Standard Deviation of a Linear Function. Consider the same random variableX as in Example 4.3.3, which takes each of the five values−2, 0, 1, 3, and 4 with equal probability.We shall determine the variance and standard deviation of Y = 4X − 7. In Example 4.3.3, we computed the mean of X as μ = 1.2 and the variance as 4.56. By Theorem 4.3.4, Var(Y ) = 16 Var(X) = 72.96. Also, the standard deviation σ of Y is σY = 4σX = 4(4.56)1/2 = 8.54. 230 Chapter 4 Expectation The next theorem provides an alternative method for calculating the variance of a sum of independent random variables. Theorem 4.3.5 If X1, . . . , Xn are independent random variables with finite means, then Var(X1 + . . . + Xn) = Var(X1) + . . . + Var(Xn). Proof Suppose first that n = 2. If E(X1) = μ1 and E(X2) = μ2, then E(X1 + X2) = μ1 + μ2. Therefore, Var(X1 + X2) = E[(X1 + X2 − μ1 − μ2)2] = E[(X1 − μ1)2 + (X2 − μ2)2 + 2(X1 − μ1)(X2 − μ2)] = Var(X1) + Var(X2) + 2E[(X1 − μ1)(X2 − μ2)]. Since X1 and X2 are independent, E[(X1 − μ1)(X2 − μ2)]= E(X1 − μ1)E(X2 − μ2) = (μ1 − μ1)(μ2 − μ2) = 0. It follows, therefore, that Var(X1 + X2) = Var(X1) + Var(X2). The theorem can now be established for each positive integer n by an induction argument. It should be emphasized that the random variables in Theorem 4.3.5 must be independent. The variance of the sum of random variables that are not independent will be discussed in Sec. 4.6. By combining Theorems 4.3.4 and 4.3.5, we can now obtain the following corollary. Corollary 4.3.1 If X1, . . . , Xn are independent random variables with finite means, and if a1, . . . , an and b are arbitrary constants, then Var(a1X1 + . . . + anXn + b) = a2 1 Var(X1) + . . . + a2 n Var(Xn). Example 4.3.7 Investment Portfolio. An investor with $100,000 to invest wishes to construct a portfolio consisting of shares of one or both of two available stocks and possibly some fixed-rate investments. Suppose that the two stocks have random rates of return R1 and R2 per share for a period of one year. Suppose that R1 has a distribution with mean 6 and variance 55, while R2 has mean 4 and variance 28. Suppose that the first stock costs $60 per share and the second costs $48 per share. Suppose that money can also be invested at a fixed rate of 3.6 percent per year. The portfolio will consist of s1 shares of the first stock, s2 shares of the second stock, and all remaining money ($s3) invested at the fixed rate. The return on this portfolio will be s1R1 + s2R2 + 0.036s3, where the coefficients are constrained by 60s1 + 48s2 + s3 = 100,000, (4.3.2) 4.3 Variance 231 Figure 4.7 The set of all means and variances of investment portfolios in Example 4.3.7. The solid vertical line shows the range of possible variances for portfoloios with a mean of 7000. 0 4000 5000 6000 7000 8000 9000 10,000 Efficient portfolio with mean 7000 Efficient portfolios Range of variances Mean of portfolio return Variance of portfolio return 1.5 108 1 108 5 107 2.55 107 as well as s1, s2, s3 ≥ 0. For now, we shall assume that R1 and R2 are independent. The mean and the variance of the return on the portfolio will be E(s1R1 + s2R2 + 0.036s3) = 6s1 + 4s2 + 0.036s3, Var(s1R1 + s2R2 + 0.036s3) = 55s2 1 + 28s2 2. One method for comparing a class of portfolios is to say that portfolio A is at least as good as portfolio B if the mean return for A is at least as large as the mean return for B and if the variance for A is no larger than the variance of B. (See Markowitz, 1987, for a classic treatment of such methods.) The reason for preferring smaller variance is that large variance is associated with large deviations from the mean, and for portfolios with a common mean, some of the large deviations are going to have to be below the mean, leading to the risk of large losses. Figure 4.7 is a plot of the pairs (mean, variance) for all of the possible portfolios in this example. That is, for each (s1, s2, s3) that satisfy (4.3.2), there is a point in the outlined region of Fig. 4.7. The points to the right and toward the bottom are those that have the largest mean return for a fixed variance, and the ones that have the smallest variance for a fixed mean return. These portfolios are called efficient. For example, suppose that the investor would like a mean return of 7000. The vertical line segment above 7000 on the horizontal axis in Fig. 4.7 indicates the possible variances of all portfolios with mean return of 7000.Amongthese, the portfolio with the smallest variance is efficient and is indicated in Fig. 4.7. This portfolio has s1 = 524.7, s2 = 609.7, s3 = 39,250, and variance 2.55× 107. So, every portfolio with mean return greater than 7000 must have variance larger than 2.55× 107, and every portfolio with variance less than 2.55× 107 must have mean return smaller than 7000. The Variance of a Binomial Distribution We shall now consider again the method of generating a binomial distribution presented in Sec. 4.2. Suppose that a box contains red balls and blue balls, and that the proportion of red balls is p (0 ≤ p ≤ 1). Suppose also that a random sample of n balls is selected from the box with replacement. For i = 1, . . . , n, let Xi = 1 if the ith ball that is selected is red, and let Xi = 0 otherwise. If X denotes the total number of red balls in the sample, then X = X1 + . . . + Xn and X will have the binomial distribution with parameters n and p. 232 Chapter 4 Expectation Figure 4.8 Two binomial distributions with the same mean (2.5) but different variances. 0.05 0.10 0.15 0.20 0.25 0.30 x p.f. 0 2 4 6 8 10 n 5, p 0.5 n 10, p 0.25 Since X1, . . . , Xn are independent, it follows from Theorem 4.3.5 that Var(X) = n i=1 Var(Xi). According to Example 4.1.3, E(Xi) = p for i = 1, . . . , n. Since X2 i = Xi for each i, E(X2 i ) = E(Xi) = p. Therefore, by Theorem 4.3.1, Var(Xi) = E(X2 i ) − [E(Xi)]2 = p − p2 = p(1− p). It now follows that Var(X) = np(1− p). (4.3.3) Figure 4.8 compares two different binomial distributions with the same mean (2.5) but different variances (1.25 and 1.875). One can see how the p.f. of the distribution with the larger variance (n = 10, p = 0.25) is higher at more extreme values and lower at more central values than is the p.f. of the distribution with the smaller variance (n = 5, p = 0.5). Similarly, Fig. 4.5 compares two uniform distributions with the same mean (30) and different variances (8.33 and 75). The same pattern appears, namely that the distribution with larger variance has higher p.d.f. at more extreme values and lower p.d.f. at more central values. Interquartile Range Example 4.3.8 The Cauchy Distribution. In Example 4.1.8, we saw a distribution (the Cauchy distribution) whose mean did not exist, and hence its variance does not exist. But, we might still want to describe how spread out such a distribution is. For example, if X has the Cauchy distribution and Y = 2X, it stands to reason that Y is twice as spread out as X is, but how do we quantify this? There is a measure of spread that exists for every distribution, regardless of whether or not the distribution has a mean or variance. Recall from Definition 3.3.2 that the quantile function for a random variable is the inverse of the c.d.f., and it is defined for every random variable. 4.3 Variance 233 Definition 4.3.2 Interquartile Range (IQR). Let X be a random variable with quantile function F −1(p) for 0<p <1. The interquartile range (IQR) is defined to be F −1(0.75) − F −1(0.25). In words, the IQR is the length of the interval that contains the middle half of the distribution. Example 4.3.9 The Cauchy Distribution. Let X have the Cauchy distribution. The c.d.f. F of X can be found using a trigonometric substitution in the following integral: F(x) = x −∞ dy π(1+ y2) = 1 2 + tan−1(x) π , where tan−1(x) is the principal inverse of the tangent function, taking values from −π/2 to π/2 as x runs from −∞ to ∞. The quantile function of X is then F −1(p) = tan[π(p − 1/2)] for 0<p <1. The IQR is F −1(0.75) − F −1(0.25) = tan(π/4) − tan(−π/4) = 2. It is not difficult to show that, if Y = 2X, then the IQR of Y is 4. (See Exercise 14.) Summary The variance ofX, denoted byVar(X), is the mean of [X − E(X)]2 and measures how spread out the distribution of X is. The variance also equals E(X2) − [E(X)]2. The standard deviation is the square root of the variance. The variance of aX + b, where a and b are constants, is a2 Var(X). The variance of the sum of independent random variables is the sum of the variances. As an example, the variance of the binomial distribution with parameters n and p is np(1− p). The interquartile range (IQR) is the difference between the 0.75 and 0.25 quantiles. The IQR is a measure of spread that exists for every distribution. Exercises 1. Suppose that X has the uniform distribution on the interval [0, 1]. Compute the variance of X. 2. Suppose that one word is selected at random from the sentence the girl put on her beautiful red hat. If X denotes the number of letters in the word that is selected, what is the value of Var(X)? 3. For all numbers a and b such that a <b, find the variance of the uniform distribution on the interval [a, b]. 4. Suppose that X is a random variable for which E(X) = μ and Var(X) = σ2. Show that E[X(X − 1)]= μ(μ − 1) + σ2. 5. Let X be a random variable for which E(X) = μ and Var(X) = σ2, and let c be an arbitrary constant. Show that E[(X − c)2]= (μ − c)2 + σ2. 6. Suppose that X and Y are independent random variables whose variances exist and such that E(X) = E(Y). Show that E[(X − Y)2]= Var(X) + Var(Y ). 7. Suppose that X and Y are independent random variables for which Var(X) = Var(Y ) = 3. Find the values of (a) Var(X − Y) and (b) Var(2X − 3Y + 1). 8. Construct an example of a distribution for which the mean is finite but the variance is infinite. 9. Let X have the discrete uniform distribution on the integers 1, . . . , n. Compute the variance of X. Hint: You may wish to use the formula n k=1 k2 = n(n + 1) . (2n + 1)/6. 234 Chapter 4 Expectation 10. Consider the example efficient portfolio at the end of Example 4.3.7. Suppose that Ri has the uniform distribution on the interval [ai, bi] for i = 1, 2. a. Find the two intervals [a1, b1] and [a2, b2]. Hint: The intervals are determined by the means and variances. b. Find the value at risk (VaR) for the example portfolio at probability level 0.97. Hint: Review Example 3.9.5 to see how to find the p.d.f. of the sum of two uniform random variables. 11. Let X have the uniform distribution on the interval [0, 1]. Find the IQR of X. 12. Let X have the p.d.f. f (x) = exp(−x) for x ≥ 0, and f (x) = 0 for x <0. Find the IQR of X. 13. Let X have the binomial distribution with parameters 5 and 0.3. Find the IQR of X. Hint: Return to Example 3.3.9 and Table 3.1.

Let X be a random variable whose interquartile range is η. Let Y = 2X. Prove that the interquartile range of Y is 2η.

4.4 Moments

For a random variable $X$, the means of powers $X^k$ (called moments) for $k > 2$ have useful theoretical properties, and some of them are used for additional summaries of a distribution. The moment generating function is a related tool that aids in deriving distributions of sums of independent random variables and limiting properties of distributions.

4.4.1 Existence of Moments

For each random variable X and every positive integer k, the expectation E(Xk) is called the kth moment of X. In particular, in accordance with this terminology, the mean of X is the first moment of X. It is said that the kth moment exists if and only if E(|X|k) <∞. If the random variable X is bounded, that is, if there are finite numbers a and b such that Pr(a ≤ X ≤ b) = 1, then all moments of X must necessarily exist. It is possible, however, that all moments ofX exist even thoughX is not bounded. It is shown in the next theorem that if the kth moment of X exists, then all moments of lower order must also exist. Theorem 4.4.1 If E(|X|k) <∞ for some positive integer k, then E(|X|j) <∞ for every positive integer j such that j <k. Proof We shall assume, for convenience, that the distribution of X is continuous and the p.d.f. is f . Then E(|X|j ) = ∞ −∞ |x|jf (x) dx = |x|≤1 |x|jf (x) dx + |x|>1 |x|jf (x) dx ≤ |x|≤1 1 . f (x) dx + |x|>1 |x|kf (x) dx ≤ Pr(|X| ≤ 1) + E(|X|k). By hypothesis, E(|X|k) <∞. It therefore follows that E(|X|j) <∞. A similar proof holds for a discrete or a more general type of distribution. In particular, it follows from Theorem 4.4.1 that if E(X2) <∞, then both the mean of X and the variance of X exist. Theorem 4.4.1 extends to the case in which 4.4 Moments 235 j and k are arbitrary positive numbers rather than just integers. (See Exercise 15 in this section.)We will not make use of such a result in this text, however. Central Moments Suppose that X is a random variable for which E(X) = μ. For every positive integer k, the expectation E[(X − μ)k] is called the kth central moment of X or the kth moment of X about the mean. In particular, in accordance with this terminology, the variance of X is the second central moment of X. For every distribution, the first central moment must be 0 because E(X − μ) = μ − μ = 0. Furthermore, if the distribution of X is symmetric with respect to its mean μ, and if the central moment E[(X − μ)k] exists for a given odd integer k, then the value of E[(X − μ)k] will be 0 because the positive and negative terms in this expectation will cancel one another. Example 4.4.1 A Symmetric p.d.f. Suppose that X has a continuous distribution for which the p.d.f. has the following form: f (x) = ce −(x−3)2/2 for −∞< x <∞. We shall determine the mean of X and all the central moments. It can be shown that for every positive integer k, ∞ −∞ |x|ke −(x−3)2/2 dx <∞. Hence, all the moments ofX exist. Furthermore, since f (x) is symmetric with respect to the point x = 3, then E(X) = 3. Because of this symmetry, it also follows that E[(X − 3)k]= 0 for every odd positive integer k. For even k = 2n, we can find a recursive formula for the sequence of central moments. First, let y = x − μ in all the integral fomulas. Then, for n ≥ 1, the 2nth central moment is m2n = ∞ −∞ y2nce −y2/2dy. Use integration by parts with u = y2n−1 and dv = ye −y2/2dy. It follows that du = (2n − 1)y2n−2dy and v =−e −y2/2. So, m2n = ∞ −∞ udv = uv|∞ y=−∞ − ∞ −∞ vdu = −y2n−1e −y2/2 ∞ y=−∞ + (2n − 1) ∞ −∞ y2n−2ce −y2/2dy = (2n − 1)m2(n−1). Because y0 = 1, m0 is just the integral of the p.d.f.; hence, m0 = 1. It follows that m2n =“n i=1(2i − 1) for n = 1, 2, . . .. So, for example, m2 = 1, m4 = 3, m6 = 15, and so on. Skewness In Example 4.4.1, we saw that the odd central moments are all 0 for a distribution that is symmetric. This leads to the following distributional summary that is used to measure lack of symmetry. Definition 4.4.1 Skewness. Let X be a random variable with mean μ, standard deviation σ, and finite third moment. The skewness of X is defined to be E[(X − μ)3]/σ 3. 236 Chapter 4 Expectation The reason for dividing the third central moment by σ3 is to make the skewness measure only the lack of symmetry rather than the spread of the distribution. Example 4.4.2 Skewness of Binomial Distributions. Let X have the binomial distribution with parameters 10 and 0.25. The p.f. of this distribution appears in Fig. 4.8. It is not difficult to see that the p.f. is not symmetric. The skewness can be computed as follows: First, note that the mean is μ = 10 × 0.25 = 2.5 and that the standard deviation is σ = (10 × 0.25 × 0.75) 1/2 = 1.369. Second, compute E[(X − 2.5)3]= (0 − 2.5)3

10 0 0.250 0.7510 + . . . + (10 − 2.5)3

10 10 0.2500 0.750 = 0.9375. Finally, the skewness is 0.9375 1.3693 = 0.3652. For comparison, the skewness of the binomial distribution with parameters 10 and 0.2 is 0.4743, and the skewness of the binomial distribution with parameters 10 and 0.3 is 0.2761. The absolute value of the skewness increases as the probability of success moves away from 0.5. It is straightforward to show that the skewness of the binomial distribution with parameters n and p is the negative of the skewness of the binomial distribution with parameters n and 1− p. (See Exercise 16 in this section.) Moment Generating Functions We shall now consider a different way to characterize the distribution of a random variable that is more closely related to its moments than to where its probability is distributed. Definition 4.4.2 Moment Generating Function. Let X be a random variable. For each real number t , define ψ(t) = E(etX). (4.4.1) The function ψ(t) is called the moment generating function (abbreviated m.g.f.) of X. Note: The Moment Generating Function of X Depends Only on the Distribution of X. Since the m.g.f. is the expected value of a function of X, it must depend only on the distribution of X. If X and Y have the same distribution, they must have the same m.g.f. If the random variable X is bounded, then the expectation in Eq. (4.4.1) must be finite for all values of t . In this case, therefore, the m.g.f. of X will be finite for all values of t . On the other hand, if X is not bounded, then the m.g.f. might be finite for some values of t and might not be finite for others. It can be seen from Eq. (4.4.1), however, that for every random variable X, the m.g.f. ψ(t) must be finite at the point t = 0 and at that point its value must be ψ(0) = E(1) = 1. The next result explains how the name “moment generating function” arose. Theorem 4.4.2 LetX be a random variables whose m.g.f. ψ(t) is finite for all values of t in some open interval around the point t = 0. Then, for each integer n > 0, the nth moment of X, 4.4 Moments 237 E(Xn), is finite and equals the nth derivative ψ(n)(t) at t = 0. That is, E(Xn) = ψ(n)(0) for n = 1, 2, . . . . We sketch the proof at the end of this section. Example 4.4.3 Calculating an m.g.f. Suppose that X is a random variable for which the p.d.f. is as follows: f (x) = e −x for x >0, 0 otherwise. We shall determine the m.g.f. of X and also Var(X). For each real number t , ψ(t) = E(etX) = ∞ 0 etxe −x dx = ∞ 0 e(t−1)x dx. The final integral in this equation will be finite if and only if t < 1. Therefore, ψ(t) is finite only for t < 1. For each such value of t , ψ(t) = 1 1− t . Since ψ(t) is finite for all values of t in an open interval around the point t = 0, all moments of X exist. The first two derivatives of ψ are ψ (t) = 1 (1− t)2 and ψ (t) = 2 (1− t)3 . Therefore, E(X) = ψ (0) = 1 and E(X2) = ψ (0) = 2. It now follows that Var(X) = ψ (0) − [ψ (0)]2 = 1. Properties of Moment Generating Functions We shall now present three basic theorems pertaining to moment generating functions. Theorem 4.4.3 Let X be a random variable for which the m.g.f. is ψ1; let Y = aX + b, where a and b are given constants; and let ψ2 denote the m.g.f. of Y . Then for every value of t such that ψ1(at) is finite, ψ2(t) = ebtψ1(at). (4.4.2) Proof By the definition of an m.g.f., ψ2(t) = E(etY ) = E[et (aX+b)]= ebtE(eatX) = ebtψ1(at). Example 4.4.4 Calculating the m.g.f. of a Linear Function. Suppose that the distribution of X is as specified in Example 4.4.3. We saw that the m.g.f. of X for t < 1 is ψ1(t) = 1 1− t . If Y = 3 − 2X, then the m.g.f. of Y is finite for t > −1/2 and will have the value ψ2(t) = e3tψ1(−2t) = e3t 1+ 2t . 238 Chapter 4 Expectation The next theorem shows that the m.g.f. of the sum of an arbitrary number of independent random variables has a very simple form. Because of this property, the m.g.f. is an important tool in the study of such sums. Theorem 4.4.4 Suppose that X1, . . . , Xn are n independent random variables; and for i = 1, . . . , n, letψi denote the m.g.f. ofXi . Let Y = X1+ . . . + Xn, and let the m.g.f. of Y be denoted by ψ. Then for every value of t such that ψi(t) is finite for i = 1, . . . , n, ψ(t) = !n i=1 ψi(t ). (4.4.3) Proof By definition, ψ(t) = E(etY ) = E[et (X1+…+Xn)]= E !n i=1 etXi . Since the random variables X1, . . . ,Xn are independent, it follows from Theorem 4.2.6 that E !n i=1 etXi = !n i=1 E(etXi ). Hence, ψ(t) = !n i=1 ψi(t ). The Moment Generating Function for the Binomial Distribution Suppose that a random variable X has the binomial distribution with parameters n and p. In Sections 4.2 and 4.3, the mean and the variance ofX were determined by representing X as the sum of n independent random variables X1, . . . , Xn. In this representation, the distribution of each variable Xi is as follows: Pr(Xi = 1) = p and Pr(Xi = 0) = 1− p. We shall now use this representation to determine the m.g.f. of X = X1 + . . . + Xn. Since each of the random variables X1, . . . , Xn has the same distribution, the m.g.f. of each variable will be the same. For i = 1, . . . , n, the m.g.f. of Xi is ψi(t) = E(etXi) = (et) Pr(Xi = 1) + (1) Pr(Xi = 0) = pet + 1− p. It follows from Theorem 4.4.4 that the m.g.f. of X in this case is ψ(t) = (pet + 1− p)n. (4.4.4) Uniqueness of Moment Generating Functions We shall now state one more important property of the m.g.f. The proof of this property is beyond the scope of this book and is omitted. Theorem 4.4.5 If the m.g.f.’s of two random variables X1 and X2 are finite and identical for all values of t in an open interval around the point t = 0, then the probability distributions of X1 and X2 must be identical. 4.4 Moments 239 Theorem 4.4.5 is the justification for the claim made at the start of this discussion, namely, that the m.g.f. is another way to characterize the distribution of a random variable. The Additive Property of the Binomial Distribution Moment generating functions provide a simple way to derive the distribution of the sum of two independent binomial random variables with the same second parameter. Theorem 4.4.6 If X1 and X2 are independent random variables, and if Xi has the binomial distribution with parameters ni and p (i = 1, 2), then X1 + X2 has the binomial distribution with parameters n1 + n2 and p. Proof L et ψi denote the m.g.f. of Xi for i = 1, 2. It follows from Eq. (4.4.4) that ψi(t) = (pet + 1− p)ni . Let ψ denote the m.g.f. of X1 + X2. Then, by Theorem 4.4.4, ψ(t) = (pet + 1− p)n1+n2. It can be seen from Eq. (4.4.4) that this function ψ is the m.g.f. of the binomial distribution with parameters n1+ n2 and p. Hence, byTheorem 4.4.5, the distribution of X1 + X2 must be that binomial distribution. Sketch of the Proof of Theorem 4.4.2 First, we indicate why all moments of X are finite. Let t > 0 be such that both ψ(t) and ψ(−t) are finite. Define g(x) = etx + e −tx. Notice that E[g(X)]= ψ(t) + ψ(−t) <∞. (4.4.5) On every bounded interval of x values, g(x) is bounded. For each integer n > 0, as |x|→∞, g(x) is eventually larger than |x|n. It follows from these facts and (4.4.5) that E|Xn| <∞. Although it is beyond the scope of this book, it can be shown that the derivative ψ (t) exists at the point t = 0, and that at t = 0, the derivative of the expectation in Eq. (4.4.1) must be equal to the expectation of the derivative. Thus, ψ (0) = d dt E(etX) t=0 = E

d dt etX t=0 . But

d dt etX t=0 = (XetX)t=0 = X. It follows that ψ (0) = E(X). In other words, the derivative of the m.g.f. ψ(t) at t = 0 is the mean of X. Furthermore, it can be shown that it is possible to differentiate ψ(t) an arbitrary number of times at the point t = 0. For n = 1, 2, . . . , the nth derivative ψ(n)(0) at t = 0 will satisfy the following relation: ψ(n)(0) = dn dtn E(etX) t=0 = E

dn dtn etX t=0 = E[(XnetX)t=0]= E(Xn). 240 Chapter 4 Expectation Thus, ψ (0) = E(X), ψ (0) = E(X2), ψ (0) = E(X3), and so on. Hence, we see that the m.g.f., if it is finite in an open interval around t = 0, can be used to generate all of the moments of the distribution by taking derivatives at t = 0. Summary If the kth moment of a random variable exists, then so does the jth moment for every j <k. The moment generating function of X, ψ(t) = E(etX), if it is finite for t in a neighborhood of 0, can be used to find moments of X. The kth derivative of ψ(t) at t = 0 is E(Xk). The m.g.f. characterizes the distribution in the sense that all random variables that have the same m.g.f. have the same distribution. Exercises 1. If X has the uniform distribution on the interval [a, b], what is the value of the fifth central moment of X? 2. If X has the uniform distribution on the interval [a, b], write a formula for every even central moment of X. 3. Suppose that X is a random variable for which E(X) = 1, E(X2) = 2, and E(X3) = 5. Find the value of the third central moment of X. 4. Suppose that X is a random variable such that E(X2) is finite. (a) Show that E(X2) ≥ [E(X)]2. (b) Show that E(X2) = [E(X)]2 if and only if there exists a constant c such that Pr(X = c) = 1. Hint: Var(X) ≥ 0. 5. Suppose that X is a random variable with mean μ and variance σ2, and that the fourth moment of X is finite. Show that E[(X − μ)4]≥ σ4. 6. Suppose that X has the uniform distribution on the interval [a, b]. Determine the m.g.f. of X. 7. Suppose thatXis a random variable for which the m.g.f. is as follows: ψ(t) = 1 4 (3et + e −t ) for −∞< t <∞. Find the mean and the variance of X. 8. Suppose thatXis a random variable for which the m.g.f. is as follows: ψ(t) = et2+3t for −∞< t <∞. Find the mean and the variance of X. 9. Let X be a random variable with mean μ and variance σ2, and let ψ1(t) denote the m.g.f. of X for −∞ < t <∞. Let c be a given positive constant, and let Y be a random variable for which the m.g.f. is ψ2(t) = ec[ψ1(t)−1] for −∞< t <∞. Find expressions for the mean and the variance of Y in terms of the mean and the variance of X. 10. Suppose that the random variables X and Y are i.i.d. and that the m.g.f. of each is ψ(t) = et2+3t for −∞< t <∞. Find the m.g.f. of Z = 2X − 3Y + 4. 11. Suppose that X is a random variable for which the m.g.f. is as follows: ψ(t) = 1 5 et + 2 5 e4t + 2 5 e8t for −∞< t <∞. Find the probability distribution of X. Hint: It is a simple discrete distribution. 12. Suppose that X is a random variable for which the m.g.f. is as follows: ψ(t) = 1 6 (4 + et + e −t ) for −∞< t <∞. Find the probability distribution of X. 13. Let X have the Cauchy distribution (see Example 4.1.8). Prove that the m.g.f. ψ(t) is finite only for t = 0. 14. Let X have p.d.f. f (x) = x −2 if x >1, 0 otherwise. Prove that the m.g.f. ψ(t) is finite for all t ≤ 0 but for no t > 0.

Prove the following extension of Theorem 4.4.1: If E(|X|a) <∞for some positive number a, then E(|X|b) < ∞for every positive numberb <a. Give the proof for the case in which X has a discrete distribution.
Let X have the binomial distribution with parameters n and p. Let Y have the binomial distribution with parameters n and 1− p. Prove that the skewness of Y is the negative of the skewness of X. Hint: Let Z = n − X and show that Z has the same distribution as Y .
Find the skewness of the distribution in Example 4.4.3.

4.5 The Mean and the Median

Although the mean of a distribution is a measure of central location, the median (see Definition 3.13) is also a measure of central location for a distribution. This section presents some comparisons and contrasts between these two location summaries of a distribution.

4.5.1 The Median

It was mentioned in Section 4.1 that the mean of a probability distribution on the real line will be at the center of gravity of that distribution. In this sense, the mean of a distribution can be regarded as the center of the distribution. There is another point on the line that might also be regarded as the center of the distribution. Suppose that there is a point m0 that divides the total probability into two equal parts, that is, the probability to the left of m0 is 1/2, and the probability to the right of m0 is also 1/2. For a continuous distribution, the median of the distribution introduced in Definition 3.3.3 is such a number. If there is such an m0, it could legitimately be called a center of the distribution. It should be noted, however, that for some discrete distributions there will not be any point at which the total probability is divided into two parts that are exactly equal. Moreover, for other distributions, which may be either discrete or continuous, there will be more than one such point. Therefore, the formal definition of a median, which will now be given, must be general enough to include these possibilities. Definition 4.5.1 Median. Let X be a random variable. Every number m with the following property is called a median of the distribution of X: Pr(X ≤ m) ≥ 1/2 and Pr(X ≥ m) ≥ 1/2. Another way to understand this definition is that a median is a point m that satisfies the following two requirements: First, if m is included with the values of X to the left of m, then Pr(X ≤ m) ≥ Pr(X > m). Second, if m is included with the values of X to the right of m, then Pr(X ≥ m) ≥ Pr(X < m). If there is a number m such that Pr(X < m) = Pr(X > m), that is, if the number m does actually divide the total probability into two equal parts, then m will of course be a median of the distribution of X (see Exercise 16). Note: Multiple Medians. One can prove that every distribution must have at least one median. Indeed, the 1/2 quantile from Definition 3.3.2 is a median. (See Exercise 1.) For some distributions, every number in some interval is a median. In such 242 Chapter 4 Expectation cases, the 1/2 quantile is the minimum of the set of all medians.When a whole interval of numbers are medians of a distribution, some writers refer to the midpoint of the interval as the median. Example 4.5.1 The Median of a Discrete Distribution. Suppose that X has the following discrete distribution: Pr(X = 1) = 0.1, Pr(X = 2) = 0.2, Pr(X = 3) = 0.3, Pr(X = 4) = 0.4. The value 3 is a median of this distribution because Pr(X ≤ 3) = 0.6, which is greater than 1/2, and Pr(X ≥ 3) = 0.7, which is also greater than 1/2. Furthermore, 3 is the unique median of this distribution. Example 4.5.2 A Discrete Distribution for Which the Median Is Not Unique. Suppose that X has the following discrete distribution: Pr(X = 1) = 0.1, Pr(X = 2) = 0.4, Pr(X = 3) = 0.3, Pr(X = 4) = 0.2. Here, Pr(X ≤ 2) = 1/2, and Pr(X ≥ 3) = 1/2. Therefore, every value ofmin the closed interval 2 ≤ m ≤ 3 will be a median of this distribution. The most popular choice of median of this distribution would be the midpoint 2.5. Example 4.5.3 The Median of a Continuous Distribution. Suppose thatXhas a continuous distribution for which the p.d.f. is as follows: f (x) = 4x3 for 0 < x <1, 0 otherwise. The unique median of this distribution will be the number m such that m 0 4x3 dx = 1 m 4x3 dx = 1 2 . This number is m = 1/21/4. Example 4.5.4 A Continuous Distribution for Which the Median Is Not Unique. Suppose that X has a continuous distribution for which the p.d.f. is as follows: f (x) = ⎧⎨ ⎩ 1/2 for 0 ≤ x ≤ 1, 1 for 2.5 ≤ x ≤ 3, 0 otherwise. Here, for every value of m in the closed interval 1≤ m ≤ 2.5, Pr(X ≤ m) = Pr(X ≥ m) = 1/2. Therefore, every value of m in the interval 1≤ m ≤ 2.5 is a median of this distribution. Comparison of the Mean and the Median Example 4.5.5 Last Lottery Number. In a state lottery game, a three-digit number from 000 to 999 is drawn each day. After several years, all but one of the 1000 possible numbers has been drawn. A lottery official would like to predict how much longer it will be until that missing number is finally drawn. Let X be the number of days (X = 1 being tomorrow) until that number appears. It is not difficult to determine the distribution of X, assuming that all 1000 numbers are equally likely to be drawn each day and 4.5 The Mean and the Median 243 that the draws are independent. Let Ax stand for the event that the missing number is drawn on day x for x = 1, 2, . . . . Then {X = 1} = A1, and for x >1, {X = x} = Ac 1 ∩ . . . ∩ Ac x−1 ∩ Ax. Since the Ax events are independent and all have probability 0.001, it is easy to see that the p.f. of X is f (x) = 0.001(0.999)x−1 for x = 1, 2, . . . 0 otherwise. But, the lottery official wants to give a single-number prediction for when the number will be drawn. What summary of the distribution would be appropriate for this prediction? The lottery official in Example 4.5.5 wants some sort of “average” or “middle” number to summarize the distribution of the number of days until the last number appears. Presumably she wants a prediction that is neither excessively large nor too small. Either the mean or a median of X can be used as such a summary of the distribution. Some important properties of the mean have already been described in this chapter, and several more properties will be given later in the book. However, for many purposes the median is a more useful measure of the middle of the distribution than is the mean. For example, every distribution has a median, but not every distribution has a mean. As illustrated in Example 4.3.5, the mean of a distribution can be made very large by removing a small but positive amount of probability from any part of the distribution and assigning this amount to a sufficiently large value of x. Onthe other hand, the median may be unaffected by a similar change in probabilities. If any amount of probability is removed from a value of x larger than the median and assigned to an arbitrarily large value of x, the median of the new distribution will be the same as that of the original distribution. In Example 4.3.5, all numbers in the interval [0, 1] are medians of both random variables X and Y despite the large difference in their means. Example 4.5.6 Annual Incomes. Suppose that the mean annual income among the families in a certain community is $30,000. It is possible that only a few families in the community actually have an income as large as $30,000, but those few families have incomes that are very much larger than $30,000. As an extreme example, suppose that there are 100 families and 99 of them have income of $1,000 while the other one has income of $2,901,000. If, however, the median annual income among the families is $30,000, then at least one-half of the families must have incomes of $30,000 or more. The median has one convenient property that the mean does not have. Theorem 4.5.1 One-to-One Function. Let X be a random variable that takes values in an interval I of real numbers. Let r be a one-to-one function defined on the interval I. If m is a median of X, then r(m) is a median of r(X). Proof Let Y = r(X). We need to show that Pr(Y ≥ r(m)) ≥ 1/2 and Pr(Y ≤ r(m)) ≥ 1/2. Since r is one-to-one on the interval I , it must be either increasing or decreasing over the interval I. If r is increasing, then Y ≥ r(m) if and only if X ≥ m, so Pr(Y ≥ r(m)) = Pr(X ≥ m) ≥ 1/2. Similarly, Y ≤ r(m) if and only ifX ≤ mand Pr(Y ≤ r(m)) ≥ 1/2 also. If r is decreasing, then Y ≥ r(m) if and only if X ≤ m. The remainder of the proof is then similar to the preceding. 244 Chapter 4 Expectation We shall now consider two specific criteria by which the prediction of a random variable X can be judged. By the first criterion, the optimal prediction that can be made is the mean. By the second criterion, the optimal prediction is the median. Minimizing the Mean Squared Error Suppose that X is a random variable with mean μ and variance σ2. Suppose also that the value of X is to be observed in some experiment, but this value must be predicted before the observation can be made. One basis for making the prediction is to select some number d for which the expected value of the square of the error X − d will be a minimum. Definition 4.5.2 Mean Squared Error/M.S.E.. The number E[(X − d)2] is called the mean squared error (M.S.E.) of the prediction d. The next result shows that the number d for which the M.S.E. is minimized is E(X). Theorem 4.5.2 Let X be a random variable with finite variance σ2, and let μ = E(X). For every number d, E[(X − μ)2]≤ E[(X − d)2]. (4.5.1) Furthermore, there will be equality in the relation (4.5.1) if and only if d = μ. Proof For every value of d, E[(X − d)2]= E(X2 − 2 dX + d2) = E(X2) − 2 dμ + d2. (4.5.2) The final expression in Eq. (4.5.2) is simply a quadratic function of d. By elementary differentiation it will be found that the minimum value of this function is attained when d = μ. Hence, in order to minimize the M.S.E., the predicted value of X should be its mean μ. Furthermore, when this prediction is used, the M.S.E. is simply E[(X − μ)2]= σ2. Example 4.5.7 Last Lottery Number. In Example 4.5.5, we discussed a state lottery in which one number had never yet been drawn. Let X stand for the number of days until that last number is eventually drawn. The p.f. of X was computed in Example 4.5.5 as f (x) = 0.001(0.999)x−1 for x = 1, 2, . . . 0 otherwise. We can compute the mean of X as E(X) = ∞ x=1 x0.001(0.999)x−1 = 0.001 ∞ x=1 x(0.999)x−1. (4.5.3) At first, this sum does not look like one that is easy to compute. However, it is closely related to the general sum g(y) = ∞ x=0 yx = 1 1− y , 4.5 The Mean and the Median 245 if 0 <y <1. Using properties of power series from calculus, we know that the derivative of g(y) can be found by differentiating the individual terms of the power series. That is, g (y) = ∞ x=0 xyx−1 = ∞ x=1 xyx−1, for 0 < y <1. But we also know that g (y) = 1/(1− y)2. The last sum in Eq. (4.5.3) is g (0.999) = 1/(0.001)2. It follows that E(X) = 0.001 1 (0.001)2 = 1000. Minimizing the Mean Absolute Error Another possible basis for predicting the value of a random variable X is to choose some number d for which E(|X − d|) will be a minimum. Definition 4.5.3 Mean Absolute Error/M.A.E. The number E(|X − d|) is called the mean absolute error (M.A.E.) of the prediction d. We shall now show that the M.A.E. is minimized when the chosen value of d is a median of the distribution of X. Theorem 4.5.3 LetX be a random variable with finite mean, and letmbe a median of the distribution of X. For every number d, E(|X − m|) ≤ E(|X − d|). (4.5.4) Furthermore, there will be equality in the relation (4.5.4) if and only if d is also a median of the distribution of X. Proof For convenience, we shall assume that X has a continuous distribution for which the p.d.f. is f . The proof for any other type of distribution is similar. Suppose first that d >m. Then E(|X − d|) − E(|X − m|) = ∞ −∞ (|x − d| − |x − m|)f (x) dx = m −∞ (d − m)f (x) dx + d m (d + m − 2x)f (x) dx + ∞ d (m − d)f (x) dx ≥ m −∞ (d − m)f (x) dx + d m (m − d)f (x) dx + ∞ d (m − d)f (x) dx = (d − m)[Pr(X ≤ m) − Pr(X > m)]. (4.5.5) Since m is a median of the distribution of X, it follows that Pr(X ≤ m) ≥ 1/2 ≥ Pr(X > m). (4.5.6) The final difference in the relation (4.5.5) is therefore nonnegative. Hence, E(|X − d|) ≥ E(|X − m|). (4.5.7) Furthermore, there can be equality in the relation (4.5.7) only if the inequalities in relations (4.5.5) and (4.5.6) are actually equalities.Acareful analysis shows that these inequalities will be equalities only if d is also a median of the distribution of X. The proof for every value of d such that d <mis similar. 246 Chapter 4 Expectation Example 4.5.8 Last Lottery Number. In Example 4.5.5, in order to compute the median ofX, we must find the smallest number x such that the c.d.f. F(x) ≥ 0.5. For integer x, we have F(x) = x n=1 0.001(0.999)n−1. We can use the popular formula x n=0 yn = 1− yx+1 1− y to see that, for integer x ≥ 1, F(x) = 0.001 1− (0.999)x 1− 0.999 = 1− (0.999)x. Setting this equal to 0.5 and solving for x gives x = 692.8; hence, the median of X is 693.The median is unique becauseF(x) never takes the exact value 0.5 for any integer x. The median of X is much smaller than the mean of 1000 found in Example 4.5.7. The reason that the mean is so much larger than the median in Examples 4.5.7 and 4.5.8 is that the distribution has probability at arbitrarily large values but is bounded below. The probability at these large values pulls the mean up because there is no probability at equally small values to balance. The median is not affected by how the upper half of the probability is distributed. The following example involves a symmetric distribution. Here, the mean and median(s) are more similar. Example 4.5.9 Predicting a Discrete Uniform Random Variable. Suppose that the probability is 1/6 that a random variable X will take each of the following six values: 1, 2, 3, 4, 5, 6.We shall determine the prediction for which the M.S.E. is minimum and the prediction for which the M.A.E. is minimum. In this example, E(X) = 1 6 (1+ 2 + 3 + 4 + 5 + 6) = 3.5. Therefore, the M.S.E. will be minimized by the unique value d = 3.5. Also, every number m in the closed interval 3 ≤ m ≤ 4 is a median of the given distribution. Therefore, the M.A.E. will be minimized by every value of d such that 3 ≤ d ≤ 4 and only by such a value of d. Because the distribution of X is symmetric, the mean of X is also a median of X. Note: When the M.A.E. and M.S.E. Are Finite. We noted that the median exists for every distribution, but the M.A.E. is finite if and only if the distribution has a finite mean. Similarly, the M.S.E. is finite if and only if the distribution has a finite variance. Summary A median of X is any number m such that Pr(X ≤ m) ≥ 1/2 and Pr(X ≥ m) ≥ 1/2. To minimize E(|X − d|) by choice of d, one must choose d to be a median of X. To minimize E[(X − d)2] by choice of d, one must choose d = E(X). 4.5 The Mean and the Median 247 Exercises 1. Prove that the 1/2 quantile as defined in Definition 3.3.2 is a median as defined in Definition 4.5.1. 2. Suppose that a random variable X has a discrete distribution for which the p.f. is as follows: f (x) = cx for x = 1, 2, 3, 4, 5, 6, 0 otherwise. Determine all the medians of this distribution. 3. Suppose that a random variable X has a continuous distribution for which the p.d.f. is as follows: f (x) = e −x for x >0, 0 otherwise. Determine all the medians of this distribution. 4. In a small community consisting of 153 families, the number of families that have k children (k = 0, 1, 2, . . .) is given in the following table: Number of Number of children families 0 21 1 40 2 42 3 27 4 or more 23 Determine the mean and the median of the number of children per family. (For the mean, assume that all families with four or more children have only four children. Why doesn’t this point matter for the median?) 5. Suppose that an observed value ofX is equally likely to come from a continuous distribution for which the p.d.f. is f or from one for which the p.d.f. is g. Suppose that f (x)>0 for 0 < x <1 and f (x) = 0 otherwise, and suppose also that g(x) > 0 for 2 < x <4 and g(x) = 0 otherwise. Determine: (a) the mean and (b) the median of the distribution of X. 6. Suppose that a random variable X has a continuous distribution for which the p.d.f. f is as follows: f (x) = 2x for 0 < x <1, 0 otherwise. Determine the value of d that minimizes (a) E[(X − d)2] and (b) E(|X − d|). 7. Suppose that a person’s score X on a certain examination will be a number in the interval 0 ≤ X ≤ 1 and that X has a continuous distribution for which the p.d.f. is as follows: f (x) = x + 1 2 for 0 ≤ x ≤ 1, 0 otherwise. Determine the prediction of X that minimizes (a) the M.S.E. and (b) the M.A.E. 8. Suppose that the distribution of a random variable X is symmetric with respect to the point x = 0 and that E(X4) <∞. Show that E[(X − d)4] is minimized by the value d = 0. 9. Suppose that a fire can occur at any one of five points along a road. These points are located at −3, −1, 0, 1, and 2 in Fig. 4.9. Suppose also that the probability that each of these points will be the location of the next fire that occurs along the road is as specified in Fig. 4.9. 3 0.2 0.1 0.1 0.4 0.2 1 0 1 2 Road Figure 4.9 Probabilities for Exercise 9. a. At what point along the road should a fire engine wait in order to minimize the expected value of the square of the distance that it must travel to the next fire? b. Where should the fire engine wait to minimize the expected value of the distance that it must travel to the next fire? 10. If n houses are located at various points along a straight road, at what point along the road should a store be located in order to minimize the sum of the distances from the n houses to the store? 11. Let X be a random variable having the binomial distribution with parameters n = 7 and p = 1/4, and let Y be a random variable having the binomial distribution with parameters n = 5 and p = 1/2.Which of these two random variables can be predicted with the smaller M.S.E.? 12. Consider a coin for which the probability of obtaining a head on each given toss is 0.3. Suppose that the coin is to be tossed 15 times, and let X denote the number of heads that will be obtained. a. What prediction of X has the smallest M.S.E.? b. What prediction of X has the smallest M.A.E.? 13. Suppose that the distribution of X is symmetric around a point m. Prove that m is a median of X. 248 Chapter 4 Expectation 14. Find the median of the Cauchy distribution defined in Example 4.1.8. 15. LetX be a random variable with c.d.f. F. Suppose that a <b are numbers such that both a and b are medians of X. a. Prove that F(a) = 1/2. b. Prove that there exist a smallest c ≤ a and a largest d ≥ b such that every number in the closed interval [c, d] is a median of X. c. If X has a discrete distribution, prove that F(d) > 1/2. 16. Let X be a random variable. Suppose that there exists a number m such that Pr(X < m) = Pr(X > m). Prove that m is a median of the distribution of X. 17. Let X be a random variable. Suppose that there exists a number m such that Pr(X <m) < 1/2 and Pr(X >m) < 1/2. Prove that m is the unique median of the distribution of X. 18. Prove the following extension of Theorem 4.5.1. Let m be the p quantile of the random variable X. (See Definition 3.3.2.) If r is a strictly increasing function, then r(m) is the p quantile of r(X).

4.6 Covariance and Correlation

When we are interested in the joint distribution of two random variables, it is useful to have a summary of how much the two random variables depend on each other. The covariance and correlation are attempts to measure that dependence, but they only capture a particular type of dependence, namely linear dependence.

4.6.1 Covariance

Example 4.6.1 Test Scores. When applying for college, high school students often take a number of standardized tests. Consider a particular student who will take both a verbal and a quantitative test. Let X be this student’s score on the verbal test, and let Y be the same student’s score on the quantitative test. Although there are students who do much better on one test than the other, it might still be reasonable to expect that a student who does very well on one test to do at least a little better than average on the other. We would like to find a numerical summary of the joint distribution of X and Y that reflects the degree to which we believe a high or low score on one test will be accompanied by a high or low score on the other test. When we consider the joint distribution of two random variables, the means, the medians, and the variances of the variables provide useful information about their marginal distributions. However, these values do not provide any information about the relationship between the two variables or about their tendency to vary together rather than independently. In this section and the next one, we shall introduce summaries of a joint distribution that enable us to measure the association between two random variables, determine the variance of the sum of an arbitrary number of dependent random variables, and predict the value of one random variable by using the observed value of some other related variable. Definition 4.6.1 Covariance. Let X and Y be random variables having finite means. Let E(X) = μX and E(Y) = μY The covariance of X and Y , which is denoted by Cov(X, Y ), is defined as Cov(X, Y ) = E[(X − μX)(Y − μY )], (4.6.1) if the expectation in Eq. (4.6.1) exists. 4.6 Covariance and Correlation 249 It can be shown (see Exercise 2 at the end of this section) that if both X and Y have finite variance, then the expectation in Eq. (4.6.1) will exist and Cov(X, Y ) will be finite. However, the value of Cov(X, Y ) can be positive, negative, or zero. Example 4.6.2 Test Scores. Let X and Y be the test scores in Example 4.6.1, and suppose that they have the joint p.d.f. f (x, y) = 2xy + 0.5 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. We shall compute the covariance Cov(X, Y ). First, we shall compute the means μX and μY of X and Y , respectively. The symmetry in the joint p.d.f. means that X and Y have the same marginal distribution; hence, μX = μY . We see that μX = 1 0 1 0 [2x2y + 0.5x]dydx = 1 0 [x2 + 0.5x]dx = 1 3 + 1 4 = 7 12 , so that μY = 7/12 as well. The covariance can be computed using Theorem 4.1.2. Specifically, we must evaluate the integral 1 0 1 0

x − 7 12

y − 7 12 (2xy + 0.5) dy dx. This integral is straightforward, albeit tedious, to compute, and the result is Cov(X, Y ) = 1/144. The following result often simplifies the calculation of a covariance. Theorem 4.6.1 For all random variables X and Y such that σ2 X <∞and σ2 Y <∞, Cov(X, Y ) = E(XY) − E(X)E(Y). (4.6.2) Proof It follows from Eq. (4.6.1) that Cov(X, Y ) = E(XY − μXY − μYX + μXμY ) = E(XY) − μXE(Y) − μYE(X) + μXμY . Since E(X) = μX and E(Y) = μY , Eq. (4.6.2) is obtained. The covariance between X and Y is intended to measure the degree to which X and Y tend to be large at the same time or the degree to which one tends to be large while the other is small. Some intution about this interpretation can be gathered from a careful look at Eq. (4.6.1). For example, suppose that Cov(X, Y ) is positive. ThenX>μX andY >μY must occur together and/orX<μX andY <μY must occur together to a larger extent thanX<μX occurs withY >μY andX>μX occurs with Y <μY . Otherwise, the mean would be negative. Similarly, if Cov(X, Y ) is negative, thenX>μX andY <μY must occur together and/orX<μX andY >μY must occur together to larger extent than the other two inequalities. If Cov(X, Y ) = 0, then the extent to which X and Y are on the same sides of their respective means exactly balances the extent to which they are on opposite sides of their means. 250 Chapter 4 Expectation Correlation Although Cov(X, Y ) gives a numerical measure of the degree to which X and Y vary together, the magnitude of Cov(X, Y ) is also influenced by the overall magnitudes of X and Y . For example, in Exercise 5 in this section, you can prove that Cov(2X, Y ) = 2 Cov(X, Y ). In order to obtain a measure of association between X and Y that is not driven by arbitrary changes in the scales of one or the other random variable, we define a slightly different quantity next. Definition 4.6.2 Correlation. Let X and Y be random variables with finite variances σ2 X and σ2 Y , respectively. Then the correlation of X and Y , which is denoted by ρ(X, Y), is defined as follows: ρ(X, Y) = Cov(X, Y ) σXσY . (4.6.3) In order to determine the range of possible values of the correlation ρ(X, Y), we shall need the following result. Theorem 4.6.2 Schwarz Inequality. For all random variables U and V such that E(UV ) exists, [E(UV )]2 ≤ E(U2)E(V 2). (4.6.4) If, in addition, the right-hand side of Eq. (4.6.4) is finite, then the two sides of Eq. (4.6.4) equal the same value if and only if there are nonzero constants a and b such that aU + bV = 0 with probability 1. Proof If E(U2) = 0, then Pr(U = 0) = 1. Therefore, it must also be true that Pr(UV = 0) = 1. Hence, E(UV ) = 0, and the relation (4.6.4) is satisfied. Similarly, if E(V 2) = 0, then the relation (4.6.4) will be satisfied. Moreover, if either E(U2) or E(V 2) is infinite, then the right side of the relation (4.6.4) will be infinite. In this case, the relation (4.6.4) will surely be satisfied. For the rest of the proof, assume that 0<E(U2) <∞ and 0<E(V2) <∞. For all numbers a and b, 0 ≤ E[(aU + bV )2]= a2E(U2) + b2E(V 2) + 2abE(UV ) (4.6.5) and 0 ≤ E[(aU − bV )2]= a2E(U2) + b2E(V 2) − 2abE(UV ). (4.6.6) If we let a = [E(V 2)]1/2 and b = [E(U2)]1/2, then it follows from the relation (4.6.5) that E(UV )≥−[E(U2)E(V 2)]1/2. (4.6.7) It also follows from the relation (4.6.6) that E(UV ) ≤ [E(U2)E(V 2)]1/2. (4.6.8) These two relations together imply that the relation (4.6.4) is satisfied. Finally, suppose that the right-hand side of Eq. (4.6.4) is finite. Both sides of (4.6.4) equal the same value if and only if the same is true for either (4.6.7) or (4.6.8). Both sides of (4.6.7) equal the same value if and only if the rightmost expression in (4.6.5) is 0. This, in turn, is true if and only if E[(aU + bV )2]= 0, which occurs if and only if aU + bV = 0 with probability 1. The reader can easily check that both sides of (4.6.8) equal the same value if and only if aU − bV = 0 with probability 1. 4.6 Covariance and Correlation 251 A slight variant on Theorem 4.6.2 is the result we want. Theorem 4.6.3 Cauchy-Schwarz Inequality. Let X and Y be random variables with finite variance. Then [Cov(X, Y )]2 ≤ σ2 Xσ2 Y , (4.6.9) and −1≤ ρ(X, Y) ≤ 1. (4.6.10) Furthermore, the inequality in Eq. (4.6.9) is an equality if and only if there are nonzero constants a and b and a constant c such that aX + bY = c with probability 1. Proof Let U = X − μX and V = Y − μY . Eq. (4.6.9) now follows directly from Theorem 4.6.2. In turn, it follows from Eq. (4.6.3) that [ρ(X, Y)]2 ≤ 1 or, equivalently, that Eq. (4.6.10) holds. The final claim follows easily from the similar claim at the end of Theorem 4.6.2. Definition 4.6.3 Positively/Negatively Correlated/Uncorrelated. It is said that X and Y are positively correlated if ρ(X, Y) > 0, that X and Y are negatively correlated if ρ(X, Y) < 0, and that X and Y are uncorrelated if ρ(X, Y) = 0. It can be seen from Eq. (4.6.3) that Cov(X, Y ) and ρ(X, Y) must have the same sign; that is, both are positive, or both are negative, or both are zero. Example 4.6.3 Test Scores. For the two test scores in Example 4.6.2, we can compute the correlation ρ(X, Y). The variances of X and Y are both equal to 11/144, so the correlation is ρ(X, Y) = 1/11. Properties of Covariance and Correlation We shall now present four theorems pertaining to the basic properties of covariance and correlation. The first theorem shows that independent random variables must be uncorrelated. Theorem 4.6.4 If X and Y are independent random variables with 0<σ2 X <∞and 0<σ2 Y <∞, then Cov(X, Y ) = ρ(X, Y) = 0. Proof If X and Y are independent, then E(XY) = E(X)E(Y). Therefore, by Eq. (4.6.2), Cov(X, Y ) = 0. Also, it follows that ρ(X, Y) = 0. The converse of Theorem 4.6.4 is not true as a general rule. Two dependent random variables can be uncorrelated. Indeed, even though Y is an explicit function of X, it is possible that ρ(X, Y) = 0, as in the following examples. Example 4.6.4 Dependent but Uncorrelated Random Variables. Suppose that the random variable X can take only the three values−1, 0, and 1, and that each of these three values has the same probability. Also, let the random variable Y be defined by the relation Y = X2. We shall show that X and Y are dependent but uncorrelated. 252 Chapter 4 Expectation Figure 4.10 The shaded region is where the joint p.d.f. of (X, Y ) is constant and nonzero in Example 4.6.5. The vertical line indicates the values of Y that are possible when X = 0.5. 1.0 0.5 0.5 0.5 1.0 0.5 1.0 y x In this example, X and Y are clearly dependent, since Y is not constant and the value of Y is completely determined by the value of X. However, E(XY) = E(X3) = E(X) = 0, because X3 is the same random variable as X. Since E(XY) = 0 and E(X)E(Y) = 0, it follows from Theorem 4.6.1 that Cov(X, Y ) = 0 and that X and Y are uncorrelated. Example 4.6.5 Uniform Distribution Inside a Circle. Let (X, Y ) have joint p.d.f. that is constant on the interior of the unit circle, the shaded region in Fig. 4.10. The constant value of the p.d.f. is one over the area of the circle, that is, 1/(2π). It is clear that X and Y are dependent since the region where the joint p.d.f. is nonzero is not a rectangle. In particular, notice that the set of possible values for Y is the interval (−1, 1), but when X = 0.5, the set of possible values for Y is the smaller interval (−0.866, 0.866). The symmetry of the circle makes it clear that both X and Y have mean 0. Also, it is not difficult to see that E(XY) = xyf (x, y)dxdy = 0. To see this, notice that the integral of xy over the top half of the circle is exactly the negative of the integral of xy over the bottom half. Hence, Cov(X, Y ) = 0, but the random variables are dependent. The next result shows that if Y is a linear function of X, then X and Y must be correlated and, in fact, |ρ(X, Y)| = 1. Theorem 4.6.5 Suppose that X is a random variable such that 0<σ2 X <∞, and Y = aX + b for some constants a and b, where a = 0. Ifa >0, then ρ(X, Y) = 1. Ifa <0, then ρ(X, Y)=−1. Proof If Y = aX + b, then μY = aμX + b and Y − μY = a(X − μX). Therefore, by Eq. (4.6.1), Cov(X, Y ) = aE[(X − μX)2]= aσ2 X. Since σY = |a|σX, the theorem follows from Eq. (4.6.3). There is a converse to Theorem 4.6.5. That is, |ρ(X, Y)| = 1 implies that X and Y are linearly related. (See Exercise 17.) In general, the value of ρ(X, Y) provides a measure of the extent to which two random variables X and Y are linearly related. If 4.6 Covariance and Correlation 253 the joint distribution of X and Y is relatively concentrated around a straight line in the xy-plane that has a positive slope, then ρ(X, Y) will typically be close to 1. If the joint distribution is relatively concentrated around a straight line that has a negative slope, then ρ(X, Y) will typically be close to −1.We shall not discuss these concepts further here, but we shall consider them again when the bivariate normal distribution is introduced and studied in Sec. 5.10. Note: Correlation Measures Only Linear Relationship. A large value of |ρ(X, Y)| means that X and Y are close to being linearly related and hence are closely related. But a small value of |ρ(X, Y)| does not mean that X and Y are not close to being related. Indeed, Example 4.6.4 illustrates random variables that are functionally related but have 0 correlation. We shall now determine the variance of the sum of random variables that are not necessarily independent. Theorem 4.6.6 If X and Y are random variables such that Var(X) <∞and Var(Y ) <∞, then Var(X + Y) = Var(X) + Var(Y ) + 2 Cov(X, Y ). (4.6.11) Proof Since E(X + Y) = μX + μY , then Var(X + Y) = E[(X + Y − μX − μY )2] = E[(X − μX)2 + (Y − μY )2 + 2(X − μX)(Y − μY )] = Var(X) + Var(Y ) + 2 Cov(X, Y ). For all constants a and b, it can be shown that Cov(aX, bY ) = ab Cov(X, Y ) (see Exercise 5 at the end of this section). The following then follows easily from Theorem 4.6.6. Corollary 4.6.1 Let a, b, and c be constants. Under the conditions of Theorem 4.6.6, Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ). (4.6.12) A particularly useful special case of Corollary 4.6.1 is Var(X − Y) = Var(X) + Var(Y ) − 2 Cov(X, Y ). (4.6.13) Example 4.6.6 Investment Portfolio. Consider, once again, the investor in Example 4.3.7 on page 230 trying to choose a portfolio with $100,000 to invest.We shall make the same assumptions about the returns on the two stocks, except that now we will suppose that the correlation between the two returns R1 and R2 is −0.3, reflecting a belief that the two stocks tend to react in opposite ways to common market forces. The variance of a portfolio of s1 shares of the first stock, s2 shares of the second stock, and s3 dollars invested at 3.6% is now Var(s1R1 + s2R2 + 0.036s3) = 55s2 1 + 28s2 2 − 0.3 + 55 × 28s1s2. We continue to assume that (4.3.2) holds. Figure 4.11 shows the relationship between the mean and variance of the efficient portfolios in this example and Example 4.3.7. Notice how the variances are smaller in this example than in Example 4.3.7. This is due to the fact that the negative correlation lowers the variance of a linear combination with positive coefficients. Theorem 4.6.6 can also be extended easily to the variance of the sum of n random variables, as follows. 254 Chapter 4 Expectation Figure 4.11 Mean and variance of efficient investment portfolios. Mean portfolio return 0 4000 5000 6000 7000 8000 9000 10,000 Variance of portfolio return 1.5 108 5 107 108 Correlation 0.3 Correlation 0 Theorem 4.6.7 If X1, . . . , Xn are random variables such that Var(Xi) <∞for i = 1, . . . , n, then Var n i=1 Xi = n i=1 Var(Xi) + 2 i<j Cov(Xi, Xj ). (4.6.14) Proof For every random variable Y , Cov(Y, Y ) = Var(Y ). Therefore, by using the result in Exercise 8 at the end of this section, we can obtain the following relation: Var n i=1 Xi = Cov ⎛ ⎝ n i=1 Xi, n j=1 Xj ⎞ ⎠ = n i=1 n j=1 Cov(Xi, Xj ). We shall separate the final sum in this relation into two sums: (i) the sum of those terms for which i = j and (ii) the sum of those terms for which i = j . Then, if we use the fact that Cov(Xi, Xj ) = Cov(Xj, Xi), we obtain the relation Var n i=1 Xi = n i=1 Var(Xi) + i =j Cov(Xi, Xj ) = n i=1 Var(Xi) + 2 i<j Cov(Xi, Xj ). The following is a simple corrolary to Theorem 4.6.7. Corollary 4.6.2 If X1, . . . , Xn are uncorrelated random variables (that is, if Xi and Xj are uncorrelated whenever i = j ), then Var n i=1 Xi = n i=1 Var(Xi). (4.6.15) Corollary 4.6.2 extends Theorem 4.3.5 on page 230, which states that (4.6.15) holds if X1, . . . , Xn are independent random variables. Note: In General, Variances Add Only for Uncorrelated Random Variables. The variance of a sum of random variables should be calculated using Theorem 4.6.7 in general. Corollary 4.6.2 applies only for uncorrelated random variables. 4.6 Covariance and Correlation 255 Summary The covariance ofX and Y is Cov(X, Y ) = E{[X − E(X)][Y − E(Y)]}.The correlation is ρ(X, Y) = Cov(X, Y )/[Var(X) Var(Y )]1/2, and it measures the extent to which X and Y are linearly related. Indeed, X and Y are precisely linearly related if and only if |ρ(X, Y)| = 1. The variance of a sum of random variables can be expressed as the sum of the variances plus two times the sum of the covariances. The variance of a linear function is Var(aX + bY + c) = a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ). Exercises 1. Suppose that the pair (X, Y ) is uniformly distributed on the interior of a circle of radius 1. Compute ρ(X, Y). 2. Prove that if Var(X) < ∞ and Var(Y ) < ∞, then Cov(X, Y ) is finite. Hint: By considering the relation [(X − μX) ± (Y − μY )]2 ≥ 0, show that |(X − μX)(Y − μY )| ≤ 1 2 [(X − μX)2 + (Y − μY )2]. 3. Suppose that X has the uniform distribution on the interval [−2, 2] and Y = X6. Show that X and Y are uncorrelated. 4. Suppose that the distribution of a random variable X is symmetric with respect to the point x = 0, 0<E(X4) <∞, and Y = X2. Show that X and Y are uncorrelated. 5. For all random variables X and Y and all constants a, b, c, and d, show that Cov(aX + b, cY + d) = ac Cov(X, Y ). 6. LetX and Y be random variables such that 0<σ2 X <∞ and 0 <σ2 Y <∞. Suppose that U = aX + b and V = cY + d, where a = 0 and c = 0. Show that ρ(U, V ) = ρ(X, Y) if ac > 0, and ρ(U, V )=−ρ(X, Y) if ac < 0. 7. Let X, Y , and Z be three random variables such that Cov(X, Z) and Cov(Y, Z) exist, and let a, b, and c be arbitrary given constants. Show that Cov(aX + bY + c, Z) = a Cov(X, Z) + b Cov(Y, Z). 8. Suppose that X1, . . . , Xm and Y1, . . . , Yn are random variables such that Cov(Xi, Yj ) exists for i = 1, . . . , m and j = 1, . . . , n, and suppose that a1, . . . , am and b1, . . . , bn are constants. Show that Cov ⎛ ⎝ m i=1 aiXi, n j=1 bjYj ⎞ ⎠ = m i=1 n j=1 aibj Cov(Xi,Yj ). 9. Suppose that X and Y are two random variables, which may be dependent, and Var(X) = Var(Y ). Assuming that 0 < Var(X + Y)<∞and 0 < Var(X − Y)<∞, show that the random variables X + Y and X − Y are uncorrelated. 10. Suppose that X and Y are negatively correlated. Is Var(X + Y) larger or smaller than Var(X − Y)? 11. Show that two random variables X and Y cannot possibly have the following properties: E(X) = 3, E(Y) = 2, E(X2) = 10, E(Y2) = 29, and E(XY) = 0. 12. Suppose that X and Y have a continuous joint distribution for which the joint p.d.f. is as follows: f (x, y) = 1 3(x + y) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 2, 0 otherwise. Determine the value of Var(2X − 3Y + 8). 13. Suppose that X and Y are random variables such that Var(X) = 9, Var(Y ) = 4, and ρ(X, Y)=−1/6. Determine (a) Var(X + Y) and (b) Var(X − 3Y + 4). 14. Suppose that X, Y , and Z are three random variables such that Var(X) = 1, Var(Y ) = 4, Var(Z) = 8, Cov(X, Y ) = 1, Cov(X, Z)=−1, and Cov(Y, Z) = 2. Determine (a) Var(X + Y + Z) and (b) Var(3X − Y − 2Z + 1). 15. Suppose that X1, . . . , Xn are random variables such that the variance of each variable is 1 and the correlation between each pair of different variables is 1/4. Determine Var(X1 + . . . + Xn). 16. Consider the investor in Example 4.2.3 on page 220. Suppose that the returns R1 and R2 on the two stocks have correlation −1. A portfolio will consist of s1 shares of the first stock and s2 shares of the second stock where s1, s2 ≥ 0. Find a portfolio such that the total cost of the portfolio is $6000 and the variance of the return is 0.Why is this situation unrealistic? 17. Let X and Y be random variables with finite variance. Prove that |ρ(X, Y)| = 1 implies that there exist constants a, b, and c such that aX + bY = c with probability 1. Hint: Use Theorem 4.6.2 with U = X − μX and V = Y − μY . 18. Let X and Y have a continuous distribution with joint p.d.f. f (x, y) = x + y for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. Compute the covariance Cov(X, Y ).

4.7 Conditional Expectation

Since expectations (including variances and covariances) are properties of distributions, there will exist conditional versions of all such distributional summaries as well as conditional versions of all theorems that we have proven or will later prove about expectations. In particular, suppose that we wish to predict one random variable $Y$ using a function $d(X)$ of another random variable $X$ so as to minimize $\expect{[Y − d(X)]^2}$. Then $d(X)$ should be the conditional mean of $Y$ given $X$. There is also a very useful theorem that is an extension to expectations of the law of total probability.

4.7.1 Definition and Basic Properties

Example 4.7.1 Household Survey. Acollection of households were surveyed, and each household reported the number of members and the number of automobiles owned. The reported numbers are in Table 4.1. Suppose that we were to sample a household at random from those households in the survey and learn the number of members. What would then be the expected number of automobiles that they own? The question at the end of Example 4.7.1 is closely related to the conditional distribution of one random variable given the other, as defined in Sec. 3.6. Definition 4.7.1 Conditional Expectation/Mean. Let X and Y be random variables such that the mean of Y exists and is finite. The conditional expectation (or conditional mean) of Y given X = x is denoted by E(Y|x) and is defined to be the expectation of the conditional distribution of Y given X = x. For example, if Y has a continuous conditional distribution given X = x with conditional p.d.f. g2(y|x), then E(Y|x) = ∞ −∞ yg2(y|x) dy. (4.7.1) Similarly, if Y has a discrete conditional distribution given X = x with conditional p.f. g2(y|x), then E(Y|x) = All y yg2(y|x). (4.7.2) Table 4.1 Reported numbers of household members and automobiles in Example 4.7.1 Number of Number of members automobiles 1 2 3 4 5 6 7 8 0 10 7 3 2 2 1 0 0 1 12 21 25 30 25 15 5 1 2 1 5 10 15 20 11 5 3 3 0 2 3 5 5 3 2 1 4.7 Conditional Expectation 257 The value of E(Y|x) will not be uniquely defined for those values of x such that the marginal p.f. or p.d.f. of X satisfies f1(x) = 0. However, since these values of x form a set of points whose probability is 0, the definition of E(Y|x) at such a point is irrelevant. (See Exercise 11 in Sec. 3.6.) It is also possible that there will be some values of x such that the mean of the conditional distribution of Y given X = x is undefined for those x values. When the mean of Y exists and is finite, the set of x values for which the conditional mean is undefined has probability 0. The expressions in Eqs. (4.7.1) and (4.7.2) are functions of x. These functions of x can be computed before X is observed, and this idea leads to the following useful concept. Definition 4.7.2 Conditional Means as Random Variables. Let h(x) stand for the function of x that is denoted E(Y|x) in either (4.7.1) or (4.7.2). Define the symbol E(Y|X) to mean h(X) and call it the conditional mean of Y given X. In other words, E(Y|X) is a random variable (a function of X) whose value when X = x is E(Y|x). Obviously, we could define E(X|Y) and E(X|y) analogously. Example 4.7.2 Household Survey. Consider the household survey in Example 4.7.1. Let X be the number of members in a randomly selected household from the survey, and let Y be the number of cars owned by that household. The 250 surveyed households are all equally likely to be selected, so Pr(X = x, Y = y) is the number of households with x members and y cars, divided by 250. Those probabilities are reported in Table 4.2. Suppose that the sampled household has X = 4 members. The conditional p.f. of Y given X = 4 is g2(y|4) = f (4, y)/f1(4), which is the x = 4 column of Table 4.2 divided by f1(4) = 0.208, namely, g2(0|4) = 0.0385, g2(1|4) = 0.5769, g2(2|4) = 0.2885, g2(3|4) = 0.0962. The conditional mean of Y given X = 4 is then E(Y|4) = 0 × 0.0385 + 1× 0.5769 + 2 × 0.2885 + 3 × 0.0962 = 1.442. Similarly, we can compute E(Y|x) for all eight values of x. They are x 1 2 3 4 5 6 7 8 E(Y|x) 0.609 1.057 1.317 1.442 1.538 1.533 1.75 2 Table 4.2 Joint p.f. f (x, y) of X and Y in Example 4.7.2 together with marginal p.f.’s f1(x) and f2(y) x y 1 2 3 4 5 6 7 8 f2(y) 0 0.040 0.028 0.012 0.008 0.008 0.004 0 0 0.100 1 0.048 0.084 0.100 0.120 0.100 0.060 0.020 0.004 0.536 2 0.004 0.020 0.040 0.060 0.080 0.044 0.020 0.012 0.280 3 0 0.008 0.012 0.020 0.020 0.012 0.008 0.004 0.084 f1(x) 0.092 0.140 0.164 0.208 0.208 0.120 0.048 0.020 258 Chapter 4 Expectation The random variable that takes the value 0.609 when the sampled household has one member, takes the value 1.057 when the sampled household has two members, and so on, is the random variable E(Y|X). Example 4.7.3 A Clinical Trial. Consider a clinical trial in which a number of patients will be treated and each patient will have one of two possible outcomes: success or failure. Let P be the proportion of successes in a very large collection of patients, and let Xi = 1 if the ith patient is a success and Xi = 0 if not. Assume that the random variables X1, X2, . . . are conditionally independent given P = p with Pr(Xi = 1|P = p) = p. Let X = X1 + . . . + Xn, which is the number of patients out of the first n who are successes. We now compute the conditional mean of X given P. The patients are independent and identically distributed conditional on P = p. Hence, the conditional distribution of X given P = p is the binomial distribution with parameters n and p. As we saw in Sec. 4.2, the mean of this binomial distribution is np, so E(X|p) = np and E(X|P) = nP . Later, we will show how to compute the conditional mean of P given X. This can be used to predict P after observing X. Note: The Conditional Mean of Y Given X Is a Random Variable. Because E(Y|X) is a function of the random variable X, it is itself a random variable with its own probability distribution, which can be derived from the distribution of X. On the other hand, h(x) = E(Y|x) is a function of x that can be manipulated like any other function. The connection between the two is that when one substitutes the random variable X for x in h(x), the result is h(X) = E(Y|X). We shall now show that the mean of the random variable E(Y|X) must be E(Y). A similar calculation shows that the mean of E(X|Y) must be E(X). Theorem 4.7.1 Law of Total Probability for Expectations. Let X and Y be random variables such that Y has finite mean. Then E[E(Y|X)]= E(Y). (4.7.3) Proof We shall assume, for convenience, that X and Y have a continuous joint distribution. Then E[E(Y|X)]= ∞ −∞ E(Y|x)f1(x) dx = ∞ −∞ ∞ −∞ yg2(y|x)f1(x) dy dx. Since g2(y|x) = f (x, y)/f1(x), it follows that E[E(Y|X)]= ∞ −∞ ∞ −∞ yf(x, y) dy dx = E(Y). The proof for a discrete distribution or a more general type of distribution is similar. Example 4.7.4 Household Survey. At the end of Example 4.7.2, we described the random variable E(Y|X). Its distribution can be constructed from that description. It has a discrete distribution that takes the eight values of E(Y|x) listed near the end of that example with corresponding probabilities f1(x) for x = 1, . . . , 8. To be specific, let Z = E(Y|X), then Pr[Z = E(Y|x)]= f1(x) for x = 1, . . . , 8. The specific values are 4.7 Conditional Expectation 259 z 0.609 1.057 1.317 1.442 1.538 1.533 1.75 2 Pr(Z = z) 0.092 0.140 0.164 0.208 0.208 0.120 0.048 0.020 We can compute E(Z) = 0.609 × 0.092 + . . . + 2 × 0.020 = 1.348. The reader can verify that E(Y) = 1.348 by using the values of f2(y) in Table 4.2. Example 4.7.5 A Clinical Trial. In Example 4.7.3, we let X be the number of patients out of the first n who are successes. The conditional mean of X given P = p was computed as E(X|p) = np, where P is the proportion of successes in a large population of patients. If the distribution of P is uniform on the interval [0, 1], then the marginal expected value of X is E[E(X|P)]= E(nP ) = n/2. We will see how to calculate E(P|X) in Example 4.7.8. Example 4.7.6 Choosing Points from Uniform Distributions. Suppose that a point X is chosen in accordance with the uniform distribution on the interval [0, 1]. Also, suppose that after the valueX = x has been observed (0<x <1), a point Y is chosen in accordance with a uniform distribution on the interval [x, 1]. We shall determine the value of E(Y). For each given value of x (0 < x <1), E(Y|x) will be equal to the midpoint (1/2)(x + 1) of the interval [x, 1]. Therefore, E(Y|X) = (1/2)(X + 1) and E(Y) = E[E(Y|X)]= 1 2 [E(X) + 1]= 1 2

1 2 + 1 = 3 4 . When manipulating the conditional distribution given X = x, it is safe to act as if X is the constant x. This fact, which can simplify the calculation of certain conditional means, is now stated without proof. Theorem 4.7.2 Let X and Y be random variables, and let Z = r(X, Y) for some function r. The conditional distribution of Z given X = x is the same as the conditional distribution of r(x, Y) given X = x. One consequence of Theorem 4.7.2 when X and Y have a continuous joint distribution is that E(Z|x) = E(r(x, Y)|x) = ∞ −∞ r(x, y)g2(y|x) dy. Theorem 4.7.1 also implies that for two arbitrary random variables X and Y , E{E[r(X, Y)|X]} = E[r(X, Y)], (4.7.4) by letting Z = r(X, Y) and noting that E{E(Z|X)} = E(Z). We can define, in a similar manner, the conditional expectation of r(X, Y) given Y and the conditional expectation of a function r(X1, . . . , Xn) of several random variables given one or more of the variables X1, . . . , Xn. Example 4.7.7 Linear Conditional Expectation. Suppose that E(Y|X) = aX + b for some constants a and b. We shall determine the value of E(XY) in terms of E(X) and E(X2). By Eq. (4.7.4), E(XY) = E[E(XY|X)]. Furthermore, since X is considered to be given and fixed in the conditional expectation, E(XY|X) = XE(Y|X) = X(aX + b) = aX2 + bX. 260 Chapter 4 Expectation Therefore, E(XY) = E(aX2 + bX) = aE(X2) + bE(X). The mean is not the only feature of a conditional distribution that is important enough to get its own name. Definition 4.7.3 Conditional Variance. For every given value x, letVar(Y |x) denote the variance of the conditional distribution of Y given that X = x. That is, Var(Y |x) = E{[Y − E(Y|x)]2|x}. (4.7.5) We call Var(Y |x) the conditional variance of Y given X = x. The expression in Eq. (4.7.5) is once again a function v(x). We shall define Var(Y |X) to be v(X) and call it the conditional variance of Y given X. Note: Other Conditional Quantities. In much the same way as in Definitions 4.7.1 and 4.7.3, we could define any conditional summary of a distribution that we wish. For example, conditional quantiles of Y given X = x are the quantiles of the conditional distribution of Y given X = x. The conditional m.g.f. of Y given X = x is the m.g.f. of the conditional distribution of Y given X = x, etc. Prediction At the end of Example 4.7.3, we considered the problem of predicting the proportion P of successes in a large population of patients given the observed number X of succeses in a sample of size n. In general, consider two arbitrary random variables X and Y that have a specified joint distribution and suppose that after the value of X has been observed, the value of Y must be predicted. In other words, the predicted value of Y can depend on the value of X. We shall assume that this predicted value d(X) must be chosen so as to minimize the mean squared error E{[Y − d(X)]2}. Theorem 4.7.3 The prediction d(X) that minimizes E{[Y − d(X)]2} is d(X) = E(Y|X). Proof We shall prove the theorem in the case in which X has a continuous distribution, but the proof in the discrete case is virtually identical. Let d(X) = E(Y|X), and let d ∗ (X) be an arbitrary predictor. We need only prove that E{[Y − d(X)]2} ≤ E{[Y − d ∗ (X)]2}. It follows from Eq. (4.7.4) that E{[Y − d(X)]2} = E(E{[Y − d(X)]2|X}). (4.7.6) A similar equation holds for d ∗. Let Z = [Y − d(X)]2, and let h(x) = E(Z|x). Similarly, let Z ∗ = [Y − d ∗ (X)]2 and h ∗ (x) = E(Z ∗ |x). The right-hand side of (4.7.6) is h(x)f1(x) dx, and the corresponding expression using d ∗ is h ∗ (x)f1(x) dx. So, the proof will be complete if we can prove that h(x)f1(x) dx ≤ h ∗ (x)f1(x) dx. (4.7.7) Clearly, Eq. (4.7.7) holds if we can show that h(x) ≤ h ∗ (x) for all x. That is, the proof is complete if we can show that E{[Y − d(X)]2|x} ≤ E{[Y − d ∗ (X)]2|x}. When we condition on X = x, we are allowed to treat X as if it were the constant x, so we need to show thatE{[Y − d(x)]2|x} ≤ E{[Y − d ∗ (x)]2|x}.These last expressions are nothing more than the M.S.E.’s for two different predictions d(x) and d ∗ (x) of Y calculated 4.7 Conditional Expectation 261 using the conditional distribution of Y given X = x. As discussed in Sec. 4.5, the M.S.E. of such a prediction is smallest if the prediction is the mean of the distribution of Y . In this case, that mean is the mean of the conditional distribution of Y given X = x. Since d(x) is the mean of the conditional distribution of Y given X = x, it must have smaller M.S.E. than every other prediction d ∗ (x). Hence, h(x) ≤ h ∗ (x) for all x. If the value X = x is observed and the value E(Y|x) is predicted for Y , then the M.S.E. of this predicted value will be Var(Y |x), from Definition 4.7.3. It follows from Eq. (4.7.6) that if the prediction is to be made by using the function d(X) = E(Y|X), then the overall M.S.E., averaged over all the possible values of X, will be E[Var(Y |X)]. If the value of Y must be predicted without any information about the value of X, then, as shown in Sec. 4.5, the best prediction is the mean E(Y) and the M.S.E. is Var(Y ). However, if X can be observed before the prediction is made, the best prediction is d(X) = E(Y|X) and the M.S.E. is E[Var(Y |X)]. Thus, the reduction in the M.S.E. that can be achieved by using the observation X is Var(Y ) − E[Var(Y |X)]. (4.7.8) This reduction provides a measure of the usefulness of X in predicting Y . It is shown in Exercise 11 at the end of this section that this reduction can also be expressed as Var[E(Y|X)]. It is important to distinguish carefully between the overall M.S.E., which is E[Var(Y |X)], and the M.S.E. of the particular prediction to be made when X = x, which is Var(Y |x). Before the value of X has been observed, the appropriate value for the M.S.E. of the complete process of observing X and then predicting Y is E[Var(Y |X)]. After a particular value x of X has been observed and the prediction E(Y|x) has been made, the appropriate measure of the M.S.E. of this prediction is Var(Y |x). A useful relationship between these values is given in the following result, whose proof is left to Exercise 11. Theorem 4.7.4 Law of Total Probability for Variances. If X and Y are arbitrary random variables for which the necessary expectations and variances exist, then Var(Y ) = E[Var(Y |X)]+ Var[E(Y|X)]. Example 4.7.8 A Clinical Trial. In Example 4.7.3, let X be the number of patients out of the first 40 in a clinical trial who have success as their outcome. Let P be the probability that an individual patient is a success. Suppose that P has the uniform distribution on the interval [0, 1] before the trial begins, and suppose that the outcomes of the patients are conditionally independent given P = p. As we saw in Example 4.7.3, X has the binomial distribution with parameters 40 and p given P = p. If we needed to minimize M.S.E. in predicting P before observing X, we would use the mean of P, namely, 1/2. The M.S.E. would beVar(P ) = 1/12. However, we shall soon observe the value of X and then predict P. To do this, we shall need the conditional distribution of P given X = x. Bayes’ theorem for random variables (3.6.13) tells us that the conditional p.d.f. of P given X = x is g2(p|x) = g1(x|p)f2(p) f1(x) , (4.7.9) where g1(x|p) is the conditional p.f. of X given P = p, namely, the binomial p.f. g1(x|p) = 40 x px(1− p)40−x for x = 0, . . . , 40, f2(p) = 1 for 0<p<1 is the marginal p.d.f. of P, and f1(x) is the marginal p.f. ofX obtained from the law of total probability 262 Chapter 4 Expectation Figure 4.12 The conditional p.d.f. of P given X = 18 in Example 4.7.8. The marginal p.d.f. of P (prior to observing X) is also shown. p Density 0.2 0.4 0.6 0.8 1.0 5 4 3 2 1 0 Marginal Conditional for random variables (3.6.12): f1(x) = 1 0

40 x px(1− p)40−x dp. (4.7.10) This last integral looks difficult to compute. However, there is a simple formula for integrals of this form, namely, 1 0 pk(1− p) dp = k! ! (k + + 1)! . (4.7.11) A proof of Eq. (4.7.11) is given in Sec. 5.8. Substituting (4.7.11) into (4.7.10) yields f1(x) = 40! x!(40 − x)! x!(40 − x)! 41! = 1 41 , for x = 0, . . . , 40. Substituting this into Eq. (4.7.9) yields g2(p|x) = 41! x!(40 − x)! px(1− p)40−x, for 0<p <1. For example, with x = 18, the observed number of successes in Table 2.1, a graph of g2(p|18) is shown in Fig. 4.12. If we want to minimize the M.S.E. when predicting P, we should use E(P|x), the conditional mean. We can compute E(P|x) using the conditional p.d.f. and Eq. (4.7.11): E(P|x) = 1 0 p 41! x!(40 − x)! px(1− p)40−x dp = 41! x!(40 − x)! (x + 1)!(40 − x)! 42! = x + 1 42 . (4.7.12) So, after X = x is observed, we will predict P to be (x + 1)/42, which is very close to the proportion of the first 40 patients who are successes. The M.S.E. after observing X = x is the conditional variance Var(P |x). We can compute this using (4.7.12) and E(P2|x) = 1 0 p2 41! x!(40 − x)! px(1− p)40−x dp = 41! x!(40 − x)! (x + 2)!(40 − x)! 43! = (x + 1)(x + 2) 42 × 43 . 4.7 Conditional Expectation 263 Using the fact that Var(P |x) = E(P2|x) − [E(P|x)]2, we see that Var(P |x) = (x + 1)(41− x) 422 × 43 . The overall M.S.E. of predicting P from X is the mean of the conditional M.S.E. E[Var(P |X)]= E

(X + 1)(41− X) 422 × 43 = 1 75,852 E(−X2 + 40X + 41) = 1 75,852 − 1 41 40 x=0 x2 + 40 41 40 x=0 x + 41 = 1 75,852

− 1 41 40 × 41× 81 6 + 40 41 40 × 41 2 + 41 = 301 75,852 = 0.003968. In this calculation, we used two popular formulas, n k=0 k = n(n + 1) 2 , (4.7.13) n k=0 k2 = n(n + 1)(2n + 1) 6 . (4.7.14) The overall M.S.E. is quite a bit smaller than the value 1/12 = 0.08333, which we would have obtained before observing X. As an illustration, Fig. 4.12 shows how much more spread out the marginal distribution of P is compared to the conditional distribution of P after observing X = 18. It should be emphasized that for the conditions of Example 4.7.8, 0.003968 is the appropriate value of the overall M.S.E. when it is known that the value of X will be available for predicting P but before the explicit value of X has been determined. After the value of X = x has been determined, the appropriate value of the M.S.E. is Var(P |x) = (x+1)(41−x) 75,852 . Notice that the largest possible value of Var(P |x) is 0.005814 when x = 20 and is still much less than 1/12. A result similar to Theorem 4.7.3 holds if we are trying to minimize the M.A.E. (mean absolute error) of our prediction rather than the M.S.E. In Exercise 16, you can prove that the predictor that minimizes M.A.E. is d(X) equal to the median of the conditional distribution of Y given X. Summary The conditional mean E(Y|x) of Y given X = x is the mean of the conditional distribution of Y given X = x. This conditional distribution was defined in Chapter 3. Likewise, the conditional variance Var(Y |x) of Y given X = x is the variance of the conditional distribution. The law of total probability for expectations says that E[E(Y|X)]= E(Y). If we will observe X and then need to predict Y , the predictor that leads to the smallest M.S.E. is the conditional mean E(Y|X). 264 Chapter 4 Expectation Exercises 1. Consider again the situation described in Example 4.7.8. Compute the M.S.E. when using E(P|x) to predict P after observing X = 18. How much smaller is this than the marginal M.S.E. 1/12? 2. Suppose that 20 percent of the students who took a certain test were from school A and that the arithmetic average of their scores on the test was 80. Suppose also that 30 percent of the students were from schoolB and that the arithmetic average of their scores was 76. Suppose, finally, that the other 50 percent of the students were from school C and that the arithmetic average of their scores was 84. If a student is selected at random from the entire group that took the test, what is the expected value of her score? 3. Suppose that 0 < Var(X) <∞ and 0 < Var(Y ) <∞. Show that if E(X|Y) is constant for all values of Y , then X and Y are uncorrelated. 4. Suppose that the distribution of X is symmetric with respect to the point x = 0, that all moments ofX exist, and that E(Y|X) = aX + b, where a and b are given constants. Show that X2m and Y are uncorrelated for m = 1, 2, . . . . 5. Suppose that a point X1 is chosen from the uniform distribution on the interval [0, 1], and that after the value X1 = x1 is observed, a point X2 is chosen from a uniform distribution on the interval [x1, 1]. Suppose further that additional variables X3, X4, . . . are generated in the same way. In general, for j = 1, 2, . . . , after the value Xj = xj has been observed, Xj+1 is chosen from a uniform distribution on the interval [xj , 1].Find the value of E(Xn). 6. Suppose that the joint distribution ofX and Y is the uniform distribution on the circle x2 + y2 < 1. Find E(X|Y). 7. Suppose that X and Y have a continuous joint distribution for which the joint p.d.f. is as follows: f (x, y) = x + y for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. Find E(Y|X) and Var(Y |X). 8. Consider again the conditions of Exercise 7. (a) If it is observed that X = 1/2, what predicted value of Y will have the smallest M.S.E.? (b) What will be the value of this M.S.E.? 9. Consider again the conditions of Exercise 7. If the value of Y is to be predicted from the value of X, what will be the minimum value of the overall M.S.E.? 10. Suppose that, for the conditions in Exercises 7 and 9, a person either can pay a cost c for the opportunity of observing the value of X before predicting the value of Y or can simply predict the value of Y without first observing the value of X. If the person considers her total loss to be the cost c plus the M.S.E. of her predicted value, what is the maximum value of c that she should be willing to pay? 11. Prove Theorem 4.7.4. 12. Suppose that X and Y are random variables such that E(Y|X) = aX + b. Assuming that Cov(X, Y ) exists and that 0 < Var(X) <∞, determine expressions for a and b in terms of E(X), E(Y), Var(X), and Cov(X, Y ). 13. Suppose that a person’s score X on a mathematics aptitude test is a number in the interval (0, 1) and that his score Y on a music aptitude test is also a number in the interval (0, 1). Suppose also that in the population of all college students in the United States, the scores X and Y are distributed in accordance with the following joint p.d.f.: f (x, y) = 2 5 (2x + 3y) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. a. If a college student is selected at random, what predicted value of his score on the music test has the smallest M.S.E.? b. What predicted value of his score on the mathematics test has the smallest M.A.E.? 14. Consider again the conditions of Exercise 13. Are the scores of college students on the mathematics test and the music test positively correlated, negatively correlated, or uncorrelated? 15. Consider again the conditions of Exercise 13. (a) If a student’s score on the mathematics test is 0.8, what predicted value of his score on the music test has the smallest M.S.E.? (b) If a student’s score on the music test is 1/3, what predicted value of his score on the mathematics test has the smallest M.A.E.? 16. Define a conditional median of Y given X = x to be any median of the conditional distribution of Y given X = x. Suppose that we will get to observe X and then we will need to predict Y . Suppose that we wish to choose our prediction d(X) so as to minimize mean absolute error, E(|Y − d(X)|). Prove that d(x) should be chosen to be a conditional median of Y given X = x. Hint: You can modify the proof of Theorem 4.7.3 to handle this case. 17. Prove Theorem 4.7.2 for the case in which X and Y have a discrete joint distribution. The key to the proof is to write all of the necessary conditional p.f.’s in terms of the joint p.f. of X and Y and the marginal p.f. of X. To facilitate this, for each x and z, give a name to the set of y values such that r(x, y) = z.

4.8 Utility

Much of statistical inference consists of choosing between several available actions. Generally, we do not know for certain which choice will be best, because some important random variable has not yet been observed. For some values of that random variable one choice is best, and for other values some other choice is best. We can try to weigh the costs and benefits of the various choices against the probabilities that the various choices turn out to be best. Utility is one tool for assigning values to the costs and benefits of our choices. The expected value of the utility then balances the costs and benefits according to how likely the uncertain possibilities are.

4.8.1 Utility Functions

Example 4.11 (Example 4.8.1: Choice of Gambles.) Consider two gambles between which a gambler must choose. Each gamble will be expressed as a random variable for which positive values mean a gain to the gambler and negative values mean a loss to the gambler. The numerical values of each random variable tell the number of dollars that the gambler gains or loses. Let X have the p.f. f (x) = 0.5 ifx = 500 or x =−350, 0 otherwise, and let Y have the p.f. g(y) = 1/3 ify = 40, y = 50, or y = 60, 0 otherwise, It is simple to compute that E(X) = 75 and E(Y) = 50. How might a gambler choose between these two gambles? IsX better than Y simply because it has higher expected value?

In Example 4.8.1, a gambler who does not desire to risk losing 350 dollars for the chance of winning 500 dollars might prefer Y , which yields a certain gain of at least 40 dollars. The theory of utility was developed during the 1930s and 1940s to describe a person’s preference among gambles like those in Example 4.8.1. According to that theory, a person will prefer a gamble X for which the expectation of a certain function U(X) is a maximum, rather than a gamble for which simply the expected gain E(X) is a maximum. Definition 4.8.1 Utility Function. A person’s utility function U is a function that assigns to each possible amount x (−∞< x <∞) a number U(x) representing the actual worth to the person of gaining the amount x. Example 4.8.2 Choice of Gambles. Suppose that a person’s utility function is U and that she must choose between the gambles X and Y in Example 4.8.1. Then E[U(X)]= 1 2 U(500) + 1 2 U(−350) (4.8.1) and E[U(Y)]= 1 3 U(60) + 1 3 U(50) + 1 3 U(40). (4.8.2) 266 Chapter 4 Expectation Figure 4.13 The utility function for Example 4.8.2. 400 400 300 200 100 100 100 500 200 200 400 x U(x) The person would prefer the gamble for which the expected utility of the gain, as specified by Eq. (4.8.1) or Eq. (4.8.2), is larger. As a specific example, consider the following utility function that penalizes losses to a much greater extent than it rewards gains: U(x) = 100 log(x + 100) − 461 if x ≥ 0, x if x <0. (4.8.3) This function was chosen to be differentiable at x = 0, continuous everywhere, increasing, concave forx >0, and linear forx <0.Agraph of U(x) is given in Fig. 4.13. Using this specific U, we compute E[U(X)]= 1 2 [100 log(600) − 461]+ 1 2 (−350)=−85.4, E[U(Y)]= 1 3 [100 log(160) − 461]+ 1 3 [100 log(150) − 461]+ 1 3 [100 log(140) − 461] = 40.4. We see that a person with the utility function in Eq. (4.8.3) would prefer Y to X. Here, we formalize the principle that underlies the choice between gambles illustrated in Example 4.8.1. Definition 4.8.2 Maximizing Expected Utility. We say that a person chooses between gambles by maximizing expected utility if the following conditions hold. There is a utility function U, and when the person must choose between any two gambles X and Y , he will prefer X to Y if E[U(X)]>E[U(Y)] and will be indifferent between X and Y if E[U(X)]= E[U(Y)]. In words, Definition 4.8.2 says that a person chooses between gambles by maximizing expected utility if he will choose a gamble X for which E[U(X)] is a maximum. If one adopts a utility function, then one can (at least in principle) make choices between gambles by maximizing expected utility. The computational algorithms necessary to perform the maximization often provide a practical challenge. Conversely, if one makes choices between gambles in such a way that certain reasonable criteria apply, then one can prove that there exists a utility function such that the choices 4.8 Utility 267 correspond to maximizing expected utility. We shall not consider this latter problem in detail here; however, it is discussed by DeGroot (1970) and Schervish (1995, chapter 3) along with other aspects of the theory of utility. Examples of Utility Functions Since it is reasonable to assume that every person prefers a larger gain to a smaller gain, we shall assume that every utility function U(x) is an increasing function of the gain x. However, the shape of the function U(x) will vary from person to person and will depend on each person’s willingness to risk losses of various amounts in attempting to increase his gains. For example, consider two gambles X and Y for which the gains have the following probability distributions: Pr(X =−3) = 0.5, Pr(X = 2.5) = 0.4, Pr(X = 6) = 0.1 (4.8.4) and Pr(Y =−2) = 0.3, Pr(Y = 1) = 0.4, Pr(Y = 3) = 0.3. (4.8.5) We shall assume that a person must choose one of the following three decisions: (i) accept gamble X, (ii) accept gamble Y , or (iii) do not accept either gamble. We shall now determine the decision that a person would choose for three different utility functions. Example 4.8.3 Linear Utility Function. Suppose that U(x) = ax + b for some constants a and b, where a >0. In this case, for every gamble X, E[U(X)]= aE(X) + b. Hence, for every two gambles X and Y , E[U(X)]>E[U(Y)] if and only if E(X) > E(Y). In other words, a person who has a linear utility function will always choose a gamble for which the expected gain is a maximum. When the gambles X and Y are defined by Eqs. (4.8.4) and (4.8.5), E(X) = (0.5)(−3) + (0.4)(2.5) + (0.1)(6) = 0.1 and E(Y) = (0.3)(−2) + (0.4)(1) + (0.3)(3) = 0.7. Furthermore, since the gain from not accepting either of these gambles is 0, the expected gain from choosing not to accept either gamble is clearly 0. Since E(Y) > E(X) > 0, it follows that a person who has a linear utility function would choose to accept gamble Y . If gamble Y were not available, then the person would prefer to accept gamble X rather than not to gamble at all. Example 4.8.4 Cubic Utility Function. Suppose that a person’s utility function is U(x) = x3 for−∞< x <∞. Then for the gambles defined by Eqs. (4.8.4) and (4.8.5), E[U(X)]= (0.5)(−3)3 + (0.4)(2.5)3 + (0.1)(6)3 = 14.35 and E[U(Y)]= (0.3)(−2)3 + (0.4)(1)3 + (0.3)(3)3 = 6.1. Furthermore, the utility of not accepting either gamble is U(0) = 03 = 0. Since E[U(X)]>E[U(Y)]> 0, it follows that the person would choose to accept gamble X. If gamble X were not available, the person would prefer to accept gamble Y rather than not to gamble at all. 268 Chapter 4 Expectation Example 4.8.5 Logarithmic Utility Function. Suppose that a person’s utility function isU(x) = log(x + 4) forx >−4. Since limx→−4 log(x + 4)=−∞, a person who has this utility function cannot choose a gamble in which there is any possibility of her gain being −4 or less. For the gambles X and Y defined by Eqs. (4.8.4) and (4.8.5), E[U(X)]= (0.5)(log 1) + (0.4)(log 6.5) + (0.1)(log 10) = 0.9790 and E[U(Y)]= (0.3)(log 2) + (0.4)(log 5) + (0.3)(log 7) = 1.4355. Furthermore, the utility of not accepting either gamble isU(0) = log 4 = 1.3863. Since E[U(Y)]>U(0)>E[U(X)], it follows that the person would choose to accept gamble Y . If gamble Y were not available, the person would prefer not to gamble at all rather than to accept gamble X. Selling a Lottery Ticket Suppose that a person has a lottery ticket from which she will receive a random gain of X dollars, where X has a specified probability distribution.We shall determine the number of dollars for which the person would be willing to sell this lottery ticket. Let U denote the person’s utility function. Then the expected utility of her gain from the lottery ticket is E[U(X)]. If she sells the lottery ticket for x0 dollars, then her gain is x0 dollars, and the utility of this gain isU(x0).The person would prefer to accept x0 dollars as a certain gain rather than accept the random gain X from the lottery ticket if and only if U(x0) > E[U(X)]. Hence, the person would be willing to sell the lottery ticket for any amount x0 such that U(x0) > E[U(X)]. If U(x0) = E[U(X)], she would be equally willing to either sell the lottery ticket or accept the random gain X. Example 4.8.6 Quadratic Utility Function. Suppose that U(x) = x2 for x ≥ 0, and suppose that the person has a lottery ticket from which she will win either 36 dollars with probability 1/4 or 0 dollars with probability 3/4. For how many dollars x0 would she be willing to sell this lottery ticket? The expected utility of the gain from the lottery ticket is E[U(X)]= 1 4 U(36) + 3 4 U(0) = 1 4 (362) + 3 4 (0) = 324. Therefore, the person would be willing to sell the lottery ticket for any amount x0 such that U(x0) = x2 0 > 324. Hence, x0 > 18. In other words, although the expected gain from the lottery ticket in this example is only 9 dollars, the person would not sell the ticket for less than 18 dollars. Example 4.8.7 Square Root Utility Function. Suppose now that U(x) = x1/2 for x ≥ 0, and consider again the lottery ticket described in Example 4.8.6. The expected utility of the gain from the lottery ticket in this case is E[U(X)]= 1 4 U(36) + 3 4 U(0) = 1 4 (6) + 3 4 (0) = 1.5. Therefore, the person would be willing to sell the lottery ticket for any amount x0 such that U(x0) = x 1/2 0 > 1.5. Hence, x0 > 2.25. In other words, although the expected gain from the lottery ticket in this example is 9 dollars, the person would be willing to sell the ticket for as little as 2.25 dollars. 4.8 Utility 269 Some Statistical Decision Problems Much of the theory of statistical inference (the subject of Chapters 7–11 of this text) deals with problems in which one has to make one of several available choices. Generally, which choice is best depends on some random variable that has not yet been observed. One example was already discussed in Sec. 4.5, where we introduced the mean squared error (M.S.E.) and mean absolute error (M.A.E.) criteria for predicting a random variable. In these cases, we have to choose a number d for our prediction of a random variable Y . Which prediction will be best depends on the value of Y that we do not yet know. Random variables like −|Y − d| and −(Y − d)2 are gambles, and the choice of gamble that minimizes M.A.E. or M.S.E. is the choice that maximizes an expected utility. Example 4.8.8 Predicting a Random Variable. Suppose that Y is a random variable that we need to predict. For each possible prediction d, there is a gamble Xd =−|Y − d| that specifies our gain when we are being judged by absolute error. Alternatively, if we are being judged by squared error, the appropriate gamble to consider would be Zd =−(Y − d)2. Notice that these gambles are always negative, meaning that our gain is negative because we lose according to how far Y is from the prediction d. If our utility U is linear, then maximizing E[U(Xd)]by choice of d is the same as minimizing M.A.E. Also, maximizing E[U(Zd)] by choice of d is the same as minimizing M.S.E. The equivalence between maximizing expected utility and minimizing the mean error would continue to hold if the prediction were allowed to depend on another random variableW that we could observe before predicting. That is, our prediction would be a function d(W), and Xd =−|Y − d(W)| or Zd =−[Y − d(W)]2 would be the gamble whose expected utility we would want to compute. Example 4.8.9 Bounding a Random Variable. Suppose that Y is a random variable and that we are interested in whether or not Y ≤ c for some constant c. For example, Y could be the random variable P in our clinical trial Example 4.7.3.We might be interested in whether or not P ≤ p0, where p0 is the probability that a patient will be a success without any help from the treatment being studied. Suppose that we have to make one of two available decisions: (t) continue to promote the treatment, or (a) abandon the treatment. If we choose t , suppose that we stand to gain Xt = 106 ifP >p0, −106 if P ≤ p0. If we choose a, our gain will be Xa = 0. If our utility function is U, then the expected utility for choosing t is E[U(Xt)], and t would be the better choice if this value is greater than U(0). For example, suppose that our utility is U(x) = x0.8 if x ≥ 0, x if x <0. (4.8.6) Then U(0) = 0 and E[U(Xt)]=−106 Pr(P ≤ p0) + [106]0.8 Pr(P > p0) = 104.8 − (106 + 104.8) Pr(P ≤ p0). 270 Chapter 4 Expectation So, E[U(Xt)]> 0 if Pr(P ≤ p0) < 104.8/(106 + 104.8) = 0.0594. It makes sense that t is better than a if Pr(P ≤ p0) is small. The reason is that the utility of choosing t over a is only positive whenP >p0. This example is in the spirit of hypothesis testing, which will be the subject of Chapter 9. Example 4.8.10 Investment. In Example 4.2.2, we compared two possible stock purchases based on their expected returns and value at risk, VaR. Suppose that the investor has a nonlinear utility function for dollars. To be specific, suppose that the utility of a return of x would equal U(x) given in Eq. (4.8.6). We can calculate the expected utility of the return from each of the two possible stock purchases in Example 4.2.2 to decide which is more favorable. If R is the return per share and we buy s shares, then the return is X = sR, and the expected utility of the return is E[U(sR)]= 0 −∞ srf (r) dr + ∞ 0 (sr)0.8f (r) dr, (4.8.7) where f is the p.d.f. of R. For the first stock, the return per share is R1 distributed uniformly on the interval [−10, 20], and the number of shares would be s1= 120. This makes (4.8.7) equal to E[U(120R1)]= 0 −10 120r 30 dr + 20 0 (120r)0.8 30 dr =−12.6. For the second stock, the return per share is R2 distributed uniformly on the interval [−4.5, 10], and the number of shares would be s2 = 200. This makes (4.8.7) equal to E[U(200R2)]= 0 −4.5 200r 14.5 dr + 10 0 (200r)0.8 14.5 dr = 27.9. With this utility function, the expected utility of the first stock purchase is actually negative because the big gains (up to 120 ×20 =2400) add less to the utility (24000.8 = 506) than the big losses (up to 120×−10=−1200) take away from the utility. The second stock purchase has positive expected utility, so it would be the preferred choice in this example. Summary When we have to make choices in the face of uncertainty, we need to assess what our gains and losses will be under each of the uncertain possibilities. Utility is the value to us of those gains and losses. For example, if X represents the random gain from a possible choice, then U(X) is the value to us of the random gain we would receive if we were to make that choice. We should make the choice such that E[U(X)] is as large as possible. Exercises 1. Let α >0. A decision maker has a utility function for money of the form U(x) = xα if x >0, x if x ≤ 0. Suppose that this decision maker is trying to decide whether or not to buy a lottery ticket for $1. The lottery ticket pays $500 with probability 0.001, and it pays $0 with probability 0.999. What would the values of α have to be in order for this decision maker to prefer buying the ticket to not buying it? 4.8 Utility 271 2. Consider three gambles X, Y, and Z for which the probability distributions of the gains are as follows: Pr(X = 5) = Pr(X = 25) = 1/2, Pr(Y = 10) = Pr(Y = 20) = 1/2, Pr(Z = 15) = 1. Suppose that a person’s utility function has the form U(x) = x2 forx >0.Which of the three gambles would she prefer? 3. Determine which of the three gambles in Exercise 2 would be preferred by a person whose utility function is U(x) = x1/2 for x >0. 4. Determine which of the three gambles in Exercise 2 would be preferred by a person whose utility function has the form U(x) = ax + b, where a and b are constants (a > 0). 5. Consider a utility function U for which U(0) = 0 and U(100) = 1. Suppose that a person who has this utility function is indifferent to either accepting a gamble from which his gain will be 0 dollars with probability 1/3 or 100 dollars with probability 2/3 or accepting 50 dollars as a sure thing.What is the value of U(50)? 6. Consider a utility function U for which U(0) = 5, U(1) = 8, and U(2) = 10. Suppose that a person who has this utility function is indifferent to either of two gambles X and Y, for which the probability distributions of the gains are as follows: Pr(X =−1) = 0.6, Pr(X = 0) = 0.2, Pr(X = 2) = 0.2; Pr(Y = 0) = 0.9, Pr(Y = 1) = 0.1. What is the value of U(−1)? 7. Suppose that a person must accept a gamble X of the following form: Pr(X = a) = p and Pr(X = 1− a) = 1− p, where p is a given number such that 0 <p <1. Suppose also that the person can choose and fix the value of a (0 ≤ a ≤ 1) to be used in this gamble. Determine the value of a that the person would choose if his utility function was U(x) = log x for x >0. 8. Determine the value of a that a person would choose in Exercise 7 if his utility function was U(x) = x1/2 for x ≥ 0. 9. Determine the value of a that a person would choose in Exercise 7 if his utility function was U(x) = x for x ≥ 0. 10. Consider four gambles X1, X2, X3, and X4, for which the probability distributions of the gains are as follows: Pr(X1 = 0) = 0.2, Pr(X1 = 1) = 0.5, Pr(X1 = 2) = 0.3; Pr(X2 = 0) = 0.4, Pr(X2 = 1) = 0.2, Pr(X2 = 2) = 0.4; Pr(X3 = 0) = 0.3, Pr(X3 = 1) = 0.3, Pr(X3 = 2) = 0.4; Pr(X4 = 0) = Pr(X4 = 2) = 0.5. Suppose that a person’s utility function is such that she prefersX1 toX2. If the person were forced to accept either X3 or X4, which one would she choose? 11. Suppose that a person has a given fortune A>0 and can bet any amount b of this fortune in a certain game (0 ≤ b ≤ A). If he wins the bet, then his fortune becomes A + b; if he loses the bet, then his fortune becomes A − b. In general, let X denote his fortune after he has won or lost. Assume that the probability of his winning is p (0 < p <1) and the probability of his losing is 1− p. Assume also that his utility function, as a function of his final fortune x, is U(x) = log x for x >0. If the person wishes to bet an amount b for which the expected utility of his fortune E[U(X)] will be a maximum, what amount b should he bet? 12. Determine the amount b that the person should bet in Exercise 11 if his utility function is U(x) = x1/2 for x ≥ 0. 13. Determine the amount b that the person should bet in Exercise 11 if his utility function is U(x) = x for x ≥ 0. 14. Determine the amount b that the person should bet in Exercise 11 if his utility function is U(x) = x2 for x ≥ 0. 15. Suppose that a person has a lottery ticket from which she will win X dollars, where X has the uniform distribution on the interval [0, 4]. Suppose also that the person’s utility function is U(x) = xα for x ≥ 0, where α is a given positive constant. For how many dollars x0 would the person be willing to sell this lottery ticket? 16. Let Y be a random variable that we would like to predict. Suppose that we must choose a single number d as the prediction and that we will lose (Y − d)2 dollars. Suppose that our utility for dollars is a square root function: U(x) = √ x if x ≥ 0, − √ −x if x <0. Prove that the value of d that maximizes expected utility is a median of the distribution of Y . 17. Reconsider the conditions of Example 4.8.9. This time, suppose that p0 = 1/2 and U(x) = x0.9 if x ≥ 0, x if x <0. Suppose also that P has p.d.f. f (p) = 56p6(1− p) for 0 < p <1. Decide whether or not it is better to abandon the treatment.

4.9 Supplementary Exercises

Exercise 4.1 (Exercise 4.9.1) Suppose that the random variable $X$ has a continuous distribution with CDF $F(x)$ and pdf $f$. Suppose also that $\mathbb{E}[X]$ exists. Prove that

\[ \lim_{x \rightarrow \infty}x[1 - F(x)] = 0. \]

Hint: Use the fact that if $\mathbb{E}(X)$ exists, then

\[ \mathbb{E}[X] = \lim_{u \rightarrow \infty}\int_{-\infty}^{u}xf(x)dx. \]

Exercise 4.2 (Exercise 4.9.2) Suppose that the random variable $X$ has a continuous distribution with CDF $F(x)$. Suppose also that $\Pr(X \geq 0) = 1$ and that $\mathbb{E}[X]$ exists. Show that

\[ \mathbb{E}[X] = \int_{0}^{\infty}[1 - F(x)]dx. \]

Hint: You may use the result proven in Exercise 4.1.

Exercise 4.3 (Exercise 4.9.3) Consider again the conditions of Exercise 4.2, but suppose now that $X$ has a discrete distribution with CDF $F(x)$, rather than a continuous distribution. Show that the conclusion of Exercise 4.2 still holds.

Exercise 4.4 (Exercise 4.9.4) Suppose that $X$, $Y$, and $Z$ are nonnegative random variables such that $\Pr(X + Y + Z \leq 1.3) = 1$. Show that $X$, $Y$, and $Z$ cannot possibly have a joint distribution under which each of their marginal distributions is the uniform distribution on the interval $[0, 1]$.

Exercise 4.5 (Exercise 4.9.5) Suppose that the random variable $X$ has mean $\mu$ and variance $\sigma^2$, and that $Y = aX + b$. Determine the values of $a$ and $b$ for which $\mathbb{E}[Y] = 0$ and $\text{Var}[Y] = 1$.

Exercise 4.6 (Exercise 4.9.6) Determine the expectation of the range of a random sample of size $N$ from the uniform distribution on the interval $[0, 1]$.

Exercise 4.7 (Exercise 4.9.7) Suppose that an automobile dealer pays an amount $X$ (in thousands of dollars) for a used car and then sells it for an amount $Y$. Suppose that the random variables $X$ and $Y$ have the following joint pdf:

\[ f(x, y) = \begin{cases} \frac{1}{36}x &\text{for }0 < x < y < 6, \\ 0 &\text{otherwise.} \end{cases} \]

Determine the dealer’s expected gain from the sale.

Exercise 4.8 (Exercise 4.9.8) Suppose that $X_1, \ldots, X_n$ form a random sample of size $N$ from a continuous distribution with the following pdf:

\[ f(x) = \begin{cases} 2x &\text{for }0 < x < 1, \\ 0 &\text{otherwise.} \end{cases} \]

Let $Y_n = \max\{X_1, \ldots, X_n\}$. Evaluate $\mathbb{E}[Y_n]$.

Exercise 4.9 (Exercise 4.9.9) If $m$ is a median of the distribution of $X$, and if $Y = r(X)$ is either a nondecreasing or a nonincreasing function of $X$, show that $r(m)$ is a median of the distribution of $Y$.

Exercise 4.10 (Exercise 4.9.10) Suppose that $X_1, \ldots, X_n$ are i.i.d. random variables, each of which has a continuous distribution with median $m$. Let $Y_n = \max\{X_1, \ldots, X_n\}$. Determine the value of $\Pr(Y_n > m)$.

Exercise 4.11 (Exercise 4.9.11) Suppose that you are going to sell cola at a football game and must decide in advance how much to order. Suppose that the demand for cola at the game, in liters, has a continuous distribution with pdf $f(x)$. Suppose that you make a profit of $g$ cents on each liter that you sell at the game and suffer a loss of $c$ cents on each liter that you order but do not sell. What is the optimal amount of cola for you to order so as to maximize your expected net gain?

Exercise 4.12 (Exercise 4.9.12) Suppose that the number of hours $X$ for which a machine will operate before it fails has a continuous distribution with pdf $f(x)$. Suppose that at the time at which the machine begins operating you must decide when you will return to inspect it. If you return before the machine has failed, you incur a cost of $b$ dollars for having wasted an inspection. If you return after the machine has failed, you incur a cost of $c$ dollars per hour for the length of time during which the machine was not operating after its failure. What is the optimal number of hours to wait before you return for inspection in order to minimize your expected cost?

Exercise 4.13 (Exercise 4.9.13) Suppose that $X$ and $Y$ are random variables for which $\mathbb{E}[X] = 3$, $\mathbb{E}[Y] = 1$, $\text{Var}[X] = 4$, and $\text{Var}[Y] = 9$. Let $Z = 5X − Y + 15$. Find $\mathbb{E}[Z]$ and $\text{Var}[Z]$ under each of the following conditions: (a) $X$ and $Y$ are independent; (b) $X$ and $Y$ are uncorrelated; (c) the correlation of $X$ and $Y$ is $0.25$.

Exercise 4.14 (Exercise 4.9.14) Suppose that $X_0, X_1, \ldots, X_n$ are independent random variables, each having the same variance $\sigma^2$. Let $Y_j = X_j - X_{j-1}$ for $j = 1, \ldots, n$, and let $\overline{Y_n} = \frac{1}{n}\sum_{j=1}^nY_j$. Determine the value of $\text{Var}\mkern-3mu\left[\overline{Y_n}\right]$.

Exercise 4.15 (Exercise 4.9.15) Suppose that $X_1, \ldots, X_n$ are random variables for which $\text{Var}[X_i]$ has the same value $\sigma^2$ for $i = 1, \ldots, n$ and $\rho(X_i, X_j)$ has the same value $\rho$ for every pair of values $i$ and $j$ such that $i = j$. Prove that $\rho \geq -\frac{1}{n-1}$.

Exercise 4.16 (Exercise 4.9.16) Suppose that the joint distribution of $X$ and $Y$ is the uniform distribution over a rectangle with sides parallel to the coordinate axes in the $xy$-plane. Determine the correlation of $X$ and $Y$.

Exercise 4.17 (Exercise 4.9.17) Suppose that $n$ letters are put at random into $n$ envelopes, as in the matching problem described in Section 1.10. Determine the variance of the number of letters that are placed in the correct envelopes.

Exercise 4.18 (Exercise 4.9.18) Suppose that the random variable $X$ has mean $\mu$ and variance $\sigma^2$. Show that the third central moment of $X$ can be expressed as $\mathbb{E}[X^3] − 3\mu \sigma^2 − \mu^3$.

Exercise 4.19 (Exercise 4.9.19) Suppose that $X$ is a random variable with MGF $\psi(t)$, mean $\mu$, and variance $\sigma^2$; and let $c(t) = \log[\psi(t)]$. Prove that $c'(0) = \mu$ and $c''(0) = \sigma^2$.

Exercise 4.20 (Exercise 4.9.20) Suppose that $X$ and $Y$ have a joint distribution with means $\mu_X$ and $\mu_Y$, standard deviations $\sigma_X$ and $\sigma_Y$, and correlation $\rho$. Show that if $\mathbb{E}[Y \mid X]$ is a linear function of $X$, then

\[ \mathbb{E}[Y \mid X] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(X - \mu_X). \]

Exercise 4.21 (Exercise 4.9.21) Suppose that $X$ and $Y$ are random variables such that $\mathbb{E}[Y \mid X] = 7 − (1/4)X$ and $\mathbb{E}[X \mid Y] = 10 − Y$. Determine the correlation of $X$ and $Y$.

Exercise 4.22 (Exercise 4.9.22) Suppose that a stick having a length of 3 feet is broken into two pieces, and that the point at which the stick is broken is chosen in accordance with the pdf $f(x)$. What is the correlation between the length of the longer piece and the length of the shorter piece?

Exercise 4.23 (Exercise 4.9.23) Suppose that $X$ and $Y$ have a joint distribution with correlation $\rho > 1/2$ and that $\text{Var}[X] = \text{Var}[Y] = 1$. Show that $b = -\frac{1}{2\rho}$ is the unique value of $b$ such that the correlation of $X$ and $X + bY$ is also $\rho$.

Exercise 4.24 (Exercise 4.9.24) Suppose that four apartment buildings $A$, $B$, $C$, and $D$ are located along a highway at the points 0, 1, 3, and 5, as shown in ?fig-4-1. Suppose also that 10 percent of the employees of a certain company live in building $A$, 20 percent live in $B$, 30 percent live in $C$, and 40 percent live in $D$.

Where should the company build its new office in order to minimize the total distance that its employees must travel?
Where should the company build its new office in order to minimize the sum of the squared distances that its employees must travel?

Exercise 4.25 (Exercise 4.9.25) Suppose that $X$ and $Y$ have the following joint pdf:

\[ f(x, y) = \begin{cases} 8xy &\text{for }0 < y < x < 1, \\ 0 &\text{otherwise.} \end{cases} \]

Suppose also that the observed value of $X$ is $0.2$.

What predicted value of $Y$ has the smallest MSE?
What predicted value of $Y$ has the smallest MAE?

Exercise 4.26 (Exercise 4.9.26) For all random variables $X$, $Y$, and $Z$, let $\text{Cov}[X, Y \mid z]$ denote the covariance of $X$ and $Y$ in their conditional joint distribution given $Z = z$. Prove that

\[ \text{Cov}[X, Y] = \mathbb{E}[\text{Cov}[X, Y \mid Z]] + \text{Cov}[\mathbb{E}[X \mid Z], \mathbb{E}[Y \mid Z]]. \]

Exercise 4.27 (Exercise 4.9.27) Consider the box of red and blue balls in Examples ?exr-4-2-4 and ?exr-4-2-5. Suppose that we sample $N > 1$ balls with replacement, and let $X$ be the number of red balls in the sample. Then we sample $N$ balls without replacement, and we let $Y$ be the number of red balls in the sample. Prove that $\Pr(X = N) > \Pr(Y = N)$.

Exercise 4.28 (Exercise 4.9.28) Suppose that a person’s utility function is $U(x) = x^2$ for $x \geq 0$. Show that the person will always prefer to take a gamble in which she will receive a random gain of $X$ dollars rather than receive the amount $\mathbb{E}[X]$ with certainty, where $\Pr(X \geq 0) = 1$ and $\mathbb{E}[X] < \infty$.

Exercise 4.29 (Exercise 4.9.29) A person is given $m$ dollars, which he must allocate between an event $A$ and its complement $A^c$. Suppose that he allocates $a$ dollars to $A$ and $m − a$ dollars to $A^c$. The person’s gain is then determined as follows: If $A$ occurs, his gain is $g_1a$; if $A^c$ occurs, his gain is $g_2(m − a)$. Here, $g_1$ and $g_2$ are given positive constants. Suppose also that $\Pr(A) = p$ and the person’s utility function is $U(x) = \log(x)$ for $x > 0$. Determine the amount $a$ that will maximize the person’s expected utility, and show that this amount does not depend on the values of $g_1$ and $g_2$.