12 Simulation
12.1 What is Simulation?
Simulation is a way to use high-speed computer power to substitute for analytical calculation. The law of large numbers tells us that if we observe a large sample of i.i.d. random variables with finite mean, then the average of these random variables should be close to their mean. If we can get a computer to produce such a large i.i.d. sample, then we can average the random variables instead of trying (and possibly failing) to calculate their mean analytically. For a specific problem, one needs to figure out what types of random variables one needs, how to make a computer produce them, and how many one needs in order to have any confidence in the numerical result. Each of these issues will be addressed to some extent in this chapter. Proof of Concept We begin with some simple examples of simulation to answer questions that we can already answer analytically just to show that simulation does what it advertises. Also, these simple examples will raise some of the issues to which one must attend when trying to answer more difficult questions using simulation. Example 12.1.1 The Mean of a Distribution. The mean of the uniform distribution on the interval [0, 1] is known to be 1/2. If we had available a large number of i.i.d. uniform random variables on the interval [0, 1], say, X1, . . . , Xn, the law of large numbers says that X = 1 n n i=1 Xi should be close to the mean 1/2. Table 12.1 gives the averages of several different simulated samples of size n from the uniform distribution on [0, 1] for several different values of n. It is not difficult to see that the averages are close to 0.5 in most cases, but there is quite a bit of variation, especially for n = 100. There seems to be less variation for n = 1000, and even less for the two largest values of n. Example 12.1.2 A Normal Probability. The probability that a standard normal random variable is at least 1.0 is known to be 0.1587. If we had available a large number of i.i.d. standard normal random variables, say, X1, . . . , Xn, we could create Bernoulli random variables Y1, . . . , Yn defined by Yi = 1 if Xi ≥ 1.0 and Yi = 0 if not. Then the law of large numbers says that Y = 1 n n i=1 Yi should be close to the mean of Yi , namely, 787 788 Chapter 12 Simulation Table 12.1 Results of several different simulations in Example 12.1.1 n Replications of the simulation 100 0.485 0.481 0.484 0.569 0.441 1000 0.497 0.506 0.480 0.498 0.499 10,000 0.502 0.501 0.499 0.498 0.498 100,000 0.502 0.499 0.500 0.498 0.499 Table 12.2 Results of several different simulations in Example 12.1.2 n Replications of the simulation 100 0.16 0.18 0.17 0.22 0.14 1000 0.135 0.171 0.174 0.159 0.171 10,000 0.160 0.163 0.158 0.152 0.156 100,000 0.158 0.158 0.158 0.159 0.161 Pr(Xi ≥ 1.0) = 0.1587. Notice that Y is merely the proportion of the simulated Xi values that are at least 1.0. Table 12.2 gives the proportions of Xi ≥ 1.0 for several different simulated samples of size n from the standard normal distribution for several different values of n. It is not difficult to see that the proportions are somewhat close to 0.1587, but there is still quite a bit of variability from one simulation to the next. As we mentioned earlier, there is no need for simulation in the above examples. These were just to illustrate that simulation can do what it claims. However, one needs to be aware that, no matter how large a sample is simulated, the average of an i.i.d. sample of random variables is not necessarily going to be equal to its mean. One needs to be able to take the variability into account. The variability is apparent in Tables 12.1 and 12.2. We shall address the issue of the variability of simulations later in the chapter. The reader might also be wondering how we obtained all of the uniform and normal random variables used in the examples.Virtually every commercial statistical software package has a simulator for i.i.d. uniform random variables on the interval [0, 1]. Later in the chapter, we shall discuss ways to make use of these for simulating other distributions. One such method was already discussed in Chapter 3 on page 170. Examples in which Simulation Might Help Next, we present some examples where the basic questions are relatively simple to describe, but analytic solution would be tedious at best. Example 12.1.3 Waiting for a Break. Two servers, A and B, in a fast-food restaurant start serving customers at the same time. They agree to meet for a break after each of them has 12.1 What Is Simulation? 789 Figure 12.1 Histogram of sample of 10,000 simulated waiting times Z in Example 12.1.3. 0 500 1000 1500 2000 2500 10 20 30 Waiting time Z 40 50 60 Count served 10 customers. Presumably, one of them will finish before the other and have to wait. How long, on average, will one of the servers have to wait for the other? Suppose that we model all service times, regardless of the server, as i.i.d. random variables having the exponential distribution with parameter 0.3 customers per minute. Then the time it takes one server to server 10 customers has the gamma distribution with parameters 10 and 0.3. Let X be the time it takes A to serve 10 customers, and let Y be the time it takes B to serve 10 customers. We are asked to compute the mean of |X − Y |. The most straightforward way of finding this mean analytically would require a two-dimensional integral over the union of two nonrectangular regions. On the other hand, suppose that a computer can provide us with as many independent gamma random variables as we desire.We can then obtain a pair (X, Y ) and compute Z = |X − Y |.We then repeat this process independently as many times as we want and average all the observed Z values. The average should be close to the mean of Z. Without going into details,we actually simulated 10,000 pairs of (X, Y ) values and averaged the resulting Z values to get 11.71 minutes. A histogram of the simulated Z values is in Fig. 12.1. As a confidence builder, we simulated another 10,000 pairs and got an average of 11.77. Example 12.1.4 Long Run of Heads. You overheard someone say that they just got 12 consecutive heads while flipping a seemingly fair coin. The probability of getting 12 heads in a row in 12 independent flips of a fair coin is (0.5)12, a very small number. If the person had obtained 12 tails in a row, you probably would have heard about that instead. Even so, the probability of 12 of the same side is only (0.5)11. But then you learn that the person actually flipped the coin 100 times, and the 12 heads in a row appeared somewhere during those 100 flips. Presumably, you are less surprised to learn that the person got a run of 12 of the same side somewhere in a sequence of 100 flips. But how much larger is the probability of a run of 12 when one flips 100 times? Suppose that we can make a computer flip a fair coin as many times as we wish. We could ask it to flip 100 times and then check whether there was a run of length 12 or more. Let X = 1 if there is a run of 12 or more, and let X = 0 if not.We then repeat this process independently as many times as we want and average all the observed X values. The average should be close to the mean of X, which is the probability of obtaining a run of 12 or more in 100 flips. Without going into details, Fig. 12.2 shows a histogram of the longest runs in 10,000 repetitions of the experiment described above. For each of the 10,000 runs, we calculated X as above and found the average to be 0.0214, still a small number, 790 Chapter 12 Simulation Figure 12.2 Histogram of sample of 10,000 longest runs (head or tail). Each run was observed in 100 flips of a fair coin. Longest run 5 10 500 1000 1500 2000 2500 15 20 Count but not nearly so small as (0.5)11. We also repeated the calculation of the average with another 10,000 sets of 100 flips and got 0.0229. Anumber of details were left out of exactly how the simulations were performed in the above examples. However, it is clear what random variables we wanted to observe, namely, Z in Example 12.1.3 andX in Example 12.1.4. Many simulations can address more than one question. For instance, in Example 12.1.4, we recorded the 10,000 lengths of the longest runs even though our primary interest was in whether or not the longest run was 12 or more.We could also have tried to calculate the expected length of the longest run or other properties of the distribution of the longest run. In Example 12.1.3, we could have tried to approximate the probability that one person has to wait at least 15 minutes, etc. Figures 12.1 and 12.2 illustrate that there is variation among the 10,000 repetitions of a simulated experiment. Furthermore, each of the examples showed that a complete rerunning of all 10,000 simulated experiments can be expected to produce a different answer to each of our questions. How much different the answers should be is a matter that we shall address in Sec. 12.2, where we use the Chebyshev inequality and the central limit theorem to help us decide how many times to repeat the basic experiment. Exactly how one simulates 100 flips of a coin or a pair of gamma random variables will be taken up in Sec. 12.3. Summary Suppose that we want to know the mean of some function g of a random variable or random vector W. For instance, in Example 12.1.3 we can let W = (X, Y ) and g(W) = |X − Y |. If a computer can supply a large number of i.i.d. random variables (or random vectors) with the distribution of W, one can use the average of the simulated values of g(W) to approximate the mean of g(X). One must be careful to take the variability in g(W) into account when deciding how much confidence to place in the approximation. Exercises For each of the exercises in this section, you could also perform the simulations described with various numbers of replications if you have appropriate computer software available. Most of the distributions involved are commonly available in computer software. If a distribution is not available, the simulations can wait until methods for simulating specific distributions are introduced in Sec. 12.3.
- Assume that one can simulate as many i.i.d. exponential random variables with parameter 1 as one wishes. Explain how one could use simulation to approximate the mean of the exponential distribution with parameter 1.
- If X has the p.d.f. 1/x2 for x >1, the mean of X is infinite. What would you expect to happen if you simulated a large number of random variables with this p.d.f. and computed their average?
- If X has the Cauchy distribution, the mean of X does not exist. What would you expect to happen if you simulated a large number of Cauchy random variables and computed their average?
- Suppose that one can simulate as many i.i.d. Bernoulli random variarbles with parameter p as one wishes. Explain how to use these to approximate the mean of the geometric distribution with parameter p.
- Two serversAandBin a fast-food restaurant each start their first customers at the same time. After finishing her second customer, A notices that B has not yet finished his first customer. A then chides B for being slow, and B responds that A just got a couple of easier customers. Suppose that we model all service times, regardless of the server, as i.i.d. random variables having the exponential distribution with parameter 0.4. Let X be the sum of the first two service times for server A, and let Y be the first service time for server B. Assume that you can simulate as many i.i.d. exponential random variables with parameter 0.4 as you wish.
- Explain how to use such random variables to approximate Pr(X < Y ).
- Explain why Pr(X < Y ) is the same no matter what the common parameter is of the exponential distribuions. That is,we don’t need to simulate exponentials with parameter 0.4.We could use any parameter that is convenient, and we should get the same answer.
- Find the joint p.d.f. of X and Y , and write the two-dimensional integral whose value would be Pr(X < Y ).
12.2 Why Is Simulation Useful?
Statistical simulations are used to estimate features of distributions such as means of functions, quantiles, and other features that we cannot compute in closed form. When using a simulation estimator, it is good to compute a measure of how precise the estimator is, in addition to the estimate itself.
12.2.1 Examples of Simulation
Simulation is a technique that can be used to help shed light on how a complicated system works even if detailed analysis is unavailable. For example, engineers can simulate traffic patterns in the vicinity of a construction project to see what effects various proposed restrictions might have. A physicist can simulate the behavior of gas molecules under conditions that are covered by no known theory. Statistical simulations are used to estimate probabilistic features of our models that we cannot compute analytically. Because simulation introduces an element of randomness into an analysis, it is sometimes called Monte Carlo analysis, named after the famous European gambling center. Example 12.2.1 The M.S.E. of the Sample Median. Suppose that we are about to observe a random sample of size n from a Cauchy distribution centered at an unknown value μ. The p.d.f. of each observation is f (x) = 1 π (1+ [x − μ]2) −1, and the parameter μ is the median of the distribution. Suppose that we are interested in how well the sample median M performs as an estimator of μ. In particular, we want to calculate the M.S.E. E([M − μ]2). If we could generate a sample of n random variables from a Cauchy distribution centered at μ, we could compute the sample median M and calculate Y = (M − μ)2. The M.S.E. is then θ = E(Y). If we could 792 Chapter 12 Simulation generate a large number v of i.i.d. random variables with the same distribution as Y , say, Y (1), . . . , Y(v), then the law of large numbers would tell us that Z = 1 v v i=1 Y (i) should be close to θ. To do this, we could generate nv i.i.d. Cauchy random variables centered at μ. Then we could divide them into v sets of n each and use each set of n to compute a sample medianM(i) for i = 1, . . . , v and then compute Y (i) = (M(i) − μ)2. This is actually how several of the numbers in the tables in Sec. 10.7 were computed. These tables contain the M.S.E.’s of various estimators computed from random samples with various distributions. For example, the numbers corresponding to the sample median in Table 10.39 on page 675 are precisely what we have been discussing in this example. Note: Notation to Distinguish Simulations. We shall use superscripts in parentheses to distinguish different simulated values of the same random variable from each other. For instance, in Example 12.2.1, we used Y (i) to stand for the ith simulated value of Y . In what follows, we may be simulating subscripted random variables. For example, μ (j ) i would stand for the jth simulated value of μi . Example 12.2.1 illustrates the main features of many statistical simulations. Suppose that the quantity in which we are interested can be expressed as the expected value of some random variable that has the distribution F. Then we should try to generate a large sample of random variables with the distribution F and average them. It is often the case, as in Example 12.2.1, that the distribution F is itself very complicated. In such cases, we need to construct random variables with the distribution F from simpler random variables whose distributions are more familiar. In Example 12.2.1, the M.S.E. is the mean of the random variable Y = (M − μ)2, where M is itself the sample median of a sample of n Cauchy random variables centered at μ.We cannot easily simulate a random variable with the distribution of Y in one step, but we can simulate n Cauchy random variables and then find their sample median M and finally compute Y = (M − μ)2, which will have the desired distribution. We then repeat the simulation of Y many times. Not all statistical simulations involve the mean of a random variable. Example 12.2.2 The Median of a Complicated Distribution. Let X be an exponential random variable with unknown parameter μ. Suppose that μ has a distribution with the p.d.f. g. We are interested in the median of X. The marginal distribution of X has the p.d.f. f (x) = ∞ 0 μe −μxg(μ)dμ. This integral might not be one that we can compute. However, suppose that we can generate a large sample of random variables μ(1), . . . , μ(v) having the p.d.f. g. Then, for each i = 1, . . . , v, we can simulate X(i) having the exponential distribution with parameter μ(i). The random variables X(1), . . . , X(v) would then be a random sample from the distribution with the p.d.f. f .The median of the sampleX(1), . . . , X(v) should be close to the median of the distribution with the p.d.f. f . Example 12.2.3 A Clinical Trial. Consider the four treatment groups described in Example 2.1.4 on page 57. For i = 1, 2, 3, 4, let Pi be the probability that a patient in treatment group i will not relapse after treatment. We might be interested in how likely it is that the Pi ’s differ by certain amounts.We might assume that the Pi ’s are independent a priori with beta distributions having parameters α0 and β0.The posterior distributions of the Pi ’s are also independent beta distributions with parameters α0 + xi and β0 + ni − xi , where ni is the number of subjects in group i, and xi is the number of patients in group 12.2 Why Is Simulation Useful? 793 i who do not relapse.We could simulate a large number v of vectors (P1, P2, P3, P4) with the above beta distributions. Then we could try to answer any question we wanted about the posterior distribution of (P1, P2, P3, P4). For example, we could estimate Pr(Pi >P4) for i = 1, 2, 3, where i = 4 stands for the placebo group. This probability tells us how likely it is that each treatment is better than no treatment. We could estimate Pr(Pi >P4) by finding the proportion of sampled (P1, P2, P3, P4) vectors in which the ith coordinate is greater than the fourth coordinate. We could also estimate the probability that Pi is the largest, or the probability that all four Pi are within of each other. Example 12.2.4 Comparing Two Normal Means with Unequal Variances. On page 593 in Chapter 9, we considered how to test hypotheses concerning the means of two different normal distributions when the variances are unknown and different. This problem has a relatively simple solution in the Bayesian framework using simulation. Our parameters will be μx, τx, μy, and τy. Conditional on the parameters, let X1, . . . , Xm be i.i.d. having the normal distribution with mean μx and precision τx. Also let Y1, . . . , Yn be i.i.d. (and independent of the X’s) having the normal distribution with mean μy and precision τy. Assume that we use natural conjugate priors for the parameters with (μx, τx) independent of (μy, τy) in the prior distribution. (It is not necessary for the X parameters to be independent of the Y parameters, but it makes the presentation simpler.) Sec. 8.6 contains details on how to obtain the posterior distributions of the parameters. Since the X data and X parameters are independent of the Y data and Y parameters, we can calculate each posterior distribution separately. Let the hyperparameters of the posterior distribution of (μx, τx) be αx1, βx1, μx1, and λx1. Similarly, let the hyperparameters of the posterior distribution of (μy, τy) be αy1, βy1, μy1, and λy1. In order to test hypotheses about μx − μy, we need the posterior distribution of μx − μy. This distribution is not analytically tractable. If we can simulate a large collection of parameter vectors from their joint posterior distribution, we can compute μx − μy for each sampled vector, and these values will form a sample from the posterior distribution of μx − μy. To be more specific, let v be a large number, and for each i = 1, . . . , v, we want to simulate (μ(i) x , μ(i) y , τ(i) x , τ(i) y ) from the joint posterior distribution. To do this, we need to simulate independent gamma random variables τ (i) x and τ (i) y with the appropriate posterior distributions. After simulating these, we can simulate μ(i) x from the normal distribution with mean μx1 and variance 1/(λx1τ (i) x ). Similarly, we can simulate μ(i) y from the normal distribution with mean μy1 and variance 1/(λy1τ (i) y ). Thenμ(i) x − μ(i) y for i = 1, . . . , v is a sample from the posterior distribution of μx − μy.We shall illustrate this methodology in Example 12.3.8 after we discuss some methods for simulating pseudo-random numbers with various distributions. The simulation in Example 12.2.4 can be extended in a straightforward fashion to a comparison of three or more normal distributions with unequal variances.With more than two means to compare, questions arise about what exactly to calculate to summarize the comparison. That is, there is not just one difference like μx − μy that captures the differences between three or more means. We shall consider this situation in more detail in Examples 12.3.7 and 12.5.6. Example 12.2.5 Estimating a Standard Deviation. LetXbe a random variable whose standard deviation θ is important to estimate. Suppose that we cannot calculate θ in closed form, but we can simulate many pseudo-random values X(1), . . . , X(v) with the same distribution 794 Chapter 12 Simulation as X. Then we could compute the sample standard deviation Sv = 1 v v i=1 (X(i) − X)2 1/2 , as an estimator of θ, where X = 1 v v i=1 X(i). Since Sv is not an average, the law of large numbers does not tell us that it converges in probability to θ. However, if we let Y (i) = X(i)2, we can rewrite Sv as (Y − X 2 )1/2. In this form, we see that Sv = g(X, Y), where g(x, y) = (y − x2)1/2. Notice that g is continuous at every point (x, y) such that y ≥ x2. The law of large numbers tells us that Y converges in probability to E(X2) and that X converges in probability to E(X). Since E(X2) ≥ E(X), we can apply Exercise 16 in Sec. 6.2 to conclude that Sv converges in probability to g(E(X), E(X2)) = θ. All of the examples above involve the generation of a large number of random variables with specific distributions. Some discussion of this topic appeared in Chapter 3 beginning on page 170. Sections 12.3 and 12.5 will also discuss methods for generating random variables with specific distributions. Sections 12.4 and 12.6 will present particular classes of problems in which statistical simulation is used successfully.
12.2.2 Which Mean Do You Mean?
Simulation analyses add an additional layer of probability distributions and sampling distributions of statistics to an already probability-laden statistical analysis. A typical statistical analysis involves a probability model for a random sample of data X1, . . . , Xn. This probability model specifies the distribution of each Xi , and this distribution might have parameters such as its mean, median, variance, and other measures that we are interested in estimating or testing.We then form statistics (functions of the data), say, Y. These functions might include sample versions of the very parameters that we wish to estimate, such as a sample mean, sample median, sample variance, and the like. The distribution of Y has been called its sampling distribution. This sampling distribution also might have a mean, median, variance, and other measures that we need to calculate or deal with in some way. So far,we have three versions of mean, median, variance, and others, and we have not even begun discussing simulation. A simulation analysis might be used to try to estimate a parameter θ of the sampling distribution of the statistics Y. Typically, one would simulate i.i.d. pseudorandom Y(1), . . . , Y(v) each with the same distribution as (the sampling distribution of) Y.We then compute a summary statistic Z of Y(1), . . . , Y(v) and use Z to estimate θ. This Z might itself be a sample mean, sample median, sample variance, or other measure of the Y(1), . . . , Y(v) sample. The distribution of Z will be called its simulation distribution or Monte Carlo distribution. Features of the simulation distribution, such as its mean, median, and variance, will be called the simulation mean, simulation median, and simulation variance to make clear to which level we have climbed in this ever-expanding tree of terminology. Here is an example to illustrate all of the various levels. Example 12.2.6 Five or More Variances. Let X1, . . . , Xn be i.i.d. random variables with a continuous distribution having c.d.f. F. Let ψ denote the variance of Xi . Suppose that we decide to use the sample variance Y = n i=1(Xi − X)2/n to estimate ψ. As part of deciding 12.2 Why Is Simulation Useful? 795 Table 12.3 Levels of probability distributions, statistics, and parameters in a typical simulation analysis Distribution (D) or sample (S) Parameter (P) or statistic (S) (D) Population distribution F (P) Mean, variance, median, etc. ψ (S) Sample X = (X1, . . . , Xn) from F (S) Estimator Y of ψ based on X, e.g., sample mean, sample variance, sample median, etc. (D) Sampling distribution G of Y (P) Mean, variance, median, etc., θ of the sampling distribution of Y (S) Simulated sample Y = (Y (1), . . . , Y(v)) from G (S) Estimator Z of θ based on Y, e.g., sample mean, sample variance, sample median, etc., of Y. (D) Simulation distribution H of Z (P) Variance of simulation distribution (simulation variance) (S) Simulated data (differs by example) (S) Estimator of simulation variance, (depends on specific example) how good Y is as an estimator of ψ, we are interested in its variance θ = Var(Y ). That is, θ is the variance of the sampling distribution of Y . Suppose that we cannot calculate θ in closed form, but suppose that it is easy to simulate from the distribution F. We might then simulate nv values X (j ) i for j = 1, . . . , v, i = 1, . . . , n. For each j , we compute the sample variance Y (j ) of the sample X (j ) 1 , . . . , X(j ) n . That is, Y (j ) = n i=1(X (j ) i − X (j ) )2/n. The Y (j ) values all have the same distribution as Y itself, the sampling distribution of Y . Since we are interested in Var(Y ), we might compute the sample variance Z of the sample Y (1), . . . , Y(v). That is, Z = v i=1(Y (i) − Y)2/v. We would then use Z to estimate θ. If Z is large, it suggests that Y has large variance, and so Y is not a very good estimator of ψ. Unless we are willing to collect more data or search for a better estimator, we are stuck with a poor estimator of ψ. Finally, Z might not be a good estimator of θ because our simulation size v might not be large enough. If this is the case, we can simulate more Y (j ) values. That is, we can increase the simulation size v to get a better simulation estimator of θ. (This will not make Y a better estimator of ψ, but it will give us a better idea of how good or bad an estimator it is.) Hence, we shall also try to estimate the variance of Z (its simulation variance). Precisely how to do this varies from one example to the next, so we shall not give any details here. However, we shall explain how to estimate the simulation variance of Z for the most popular types of simulation later in this section. This estimation of variance has to end somewhere, and we shall end it with Var(Z). That is, we shall not try to assess how good our estimator of Var(Z) is. All of these levels of distributions and estimation are illustrated in Table 12.3. Example 12.2.6 is not intended to illustrate any simulation methodology. It is intended to illustrate the various levels at which probability concepts (such as variance) and their sample versions enter into a simulation study of a statistical analysis. It is important to be able to tell which variance or which sample variance is being discussed if one is to avoid becoming hopelessly confused. In this chapter, we shall focus on the features of the simulated samples, in particular the simulation distribution of statistics computed from the simulated samples. However, our examples will necessarily involve parameters and statistics that arose at earlier levels. Furthermore, the analysis of a simulation distribution will make use of the same methods (central 796 Chapter 12 Simulation limit theorem, law of large numbers, delta method, etc.) that we learned how to use with nonsimulated data.
12.2.3 Assessing Uncertainty about Simulation Results
The last step in Example 12.2.6 (summarized in the last two rows of Table 12.3) is an important part of every simulation analysis. That is, we should always attempt to assess the uncertainty in a simulation. This uncertainty is most easily assessed via the simulation variance of the simulated quantity. For instance, in Example 12.2.1, let v = 1000 and θ = 0. We can create 1000 samples of n Cauchy random variables, calculateM(i), the median of the ith sample, and compute the value Y (i) = (M(i) − 0)2. We can then average the 1000 values of Y (i). We could repeat this exercise several times, and we would not get the same result every time. This is due to the fact that, even with a large v like 1000, an estimator such as Z = 1 v v i=1 Y (i) is still a random variable with positive variance (its simulation variance). The smaller the simulation variance is, the more certain we can be that our estimator Z is close to what we are trying to estimate. But we need to estimate or bound the simulation variance before we can assess the amount of uncertainty. How we estimate the simulation variance of a result Z depends on whether Z is an average of simulated values, a smooth function of one or more averages, or a sample quantile of simulated values. The square root of our estimate of the simulation variance will be called the simulation standard error, and it is an estimate of the simulation standard deviation of Z. The simulation standard error is a popular way to summarize uncertainty about a simulation for two reasons. First, it has the same units of measurement as the quantity that was estimated (unlike the simulation variance). Second, the simulation standard error is useful for saying how likely it is that the simulation estimator is close to the parameter being estimated. We shall explain this second point in more detail after we show how to calculate the simulation standard error in several common cases. Example 12.2.7 The Simulation Standard Error of an Average. Suppose that the goal of the simulation analysis is to estimate the mean θ of some random variable Y . The simulation estimator Z will generally be the average of a large number of simulated values. A straightforward way to estimate the simulation variance for an average is the following: Suppose that we simulate some quantity Y a large number v of times in order to estimate the mean θ. That is, suppose that we simulate independent Y (1), . . . , Y(v) for large v. Suppose also that the estimator of θ is Z = 1 v v i=1 Y (i), and each Y (i) has mean θ and finite variance σ2. The sample standard deviation of the sample Y (1), . . . , Y(v) is the square root of the sample variance, namely, ˆσ = 1 v v i=1 (Y (i) − Y)2 1/2 . (12.2.1) If v is large, then ˆσ should be close to σ. The central limit theorem says that Z should have approximately the normal distribution with mean θ and variance σ2/v. Since we usually do not know σ2, we shall estimate it by ˆσ 2. This makes our estimator of the simulation variance of Z equal to ˆσ 2/v, and the simulation standard error is ˆσ/v1/2. Example 12.2.8 The Simulation Standard Error of a Smooth Function of Another Estimator. Sometimes, after estimating a quantity ψ, we also wish to estimate a smooth function of it: g(ψ). For example, we might need to estimate the square root or the logarithm of some 12.2 Why Is Simulation Useful? 797 mean. Or, we might have estimated a variance θ2, and now we want an estimator of θ, the corresponding standard deviation. In general, suppose that the parameter that we wish to estimate by simulation is θ = g(ψ), where we already have an estimatorW of ψ. Suppose further that our estimator W has approximately the normal distribution with meanψ and variance σ2/v, where v is large compared to σ2. Finally, suppose that we also have an estimator ˆσ of σ that we obtained while calculatingW. For example, W might itself be the average of v i.i.d. simulated random variables Y (i) with mean ψ and variance σ2. In this case, Eq. (12.2.1) will be our estimator of σ. Let Z = g(W) be our estimator of θ. The delta method (see Sec. 6.3) says that Z has approximately the normal distribution with mean θ = g(ψ) and variance [g (ψ)]2σ2/v. For example, if g(ψ) = ψ1/2, then W1/2 has approximately the normal distribution with mean ψ1/2 and variance σ2/[4ψv]. We already have estimates of σ and ψ, so our simulation standard error of Z is |g (W)| ˆσ/v1/2. Example 12.2.9 The Simulation Standard Error of a Sample Quantile. Suppose that the goal of a simulation analysis is to estimate the p quantile θp of some distribution G. Typically, we simulate a large number v of pseudo-random values Y (1), . . . , Y(v) with distribution G and use the sample p quantile as our estimator. On page 676, we pointed out that the sample p quantile from a large random sample of size m has approximately the normal distribution with mean θp and variance p(1− p)/[mg2(θp)], where g is the p.d.f. of the distribution G. All we care about right now is that this approximate variance has the form σ2/m, where σ2 = p(1− p)/g2(θp) is some number that does not depend on m. Suppose that we simulate k independent random samples each of size m from the distributionG. Typically, this is done by choosing the size v of the original simulated sample Y (1), . . . , Y (v) to be v = km, and then splitting the v simulated values into k subsamples of size m each. Compute the sample p quantile of each of the k random samples and call these simulated sample p quantiles Z1, . . . , Zk. To make use of the approximate normal distribution for the sample quantiles, m needs to be large. Next, compute the sample standard deviation of Z1, . . . , Zk: S = 1 k k i=1 (Zi − Z)2 1/2 , (12.2.2) where Z is the average of the k sample p quantiles. If we treat each Zi as a single simulation, then S2 is an estimator of the variance of Zi . But we just pointed out that the variance of Zi is approximately σ2/m. Hence, S2 is an estimator of σ2/m. In other words, an estimator of σ is ˆσ = m1/2S. Finally, combine all k samples into a single sample of size v = km, and compute the sample p quantile Z as our Monte Carlo estimator of θp. As we noted earlier, Z has approximately the normal distribution with mean θp and variance σ2/v. We just constructed an estimator ˆσ of σ, so our estimator of the simulation variance of Z is ˆσ 2/v = mS2/v = S2/k, and the simulation standard error is S/k1/2. Example 12.2.10 The Simulation Standard Error of a Sample Variance. Suppose that the goal of a simulation analysis is to estimate the variance θ of some estimator Y . (Example 12.2.6 was based on such a situation.) Suppose that we simulate Y (1), . . . , Y(v) and use Z = 1 v v i=1(Y (i) − Y)2 to estimate θ. We now need to estimate the simulation variance of Z. We shall rewrite Z as a smooth function of two averages and then apply a two-dimensional generalization of the delta method (see Exercise 12) in order to estimate the simulation variance. Let W(i) = Y (i)2 so that Z = W − Y 2, where W is 798 Chapter 12 Simulation the average ofW(1), . . . , W(v). Now Z is a smooth function of two averages. The twodimensional delta method developed in Exercise 12 can be applied. (The details for this very case can be derived in Exercise 13.) The results of Exercise 13 provide the following approximation to the asymptotic variance of Z. First, compute the sample variance of W(1), . . . , W(v) and call it V . Next, compute the sample covariance between the Y ’s and W’s: C = 1 v v i=1 (Y (i) − Y )(W(i) − W). The estimator of Var(Z) is then Va8r(Z) = 1 v 4Y 2 Z − 4YC + V . (12.2.3) Also, the simulation distribution of Z is approximately the normal distribution with mean θ and variance that is estimated by Eq. (12.2.3). The simulation standard error is the square root of (12.2.3). Do We Have Enough Simulations? Let Z be our Monte Carlo estimator of some parameter θ based on v simulations. Now that we are able to estimate the simulation variance of Z, we can begin to answer questions about how close we think Z is to θ. We can also try to see if we need to do more simulations in order to be confident that Z is close enough to θ. Suppose, as in all of the cases considered so far, that Z has approximately the normal distribution with mean θ and variance σ2/v, where σ2 is a number that does not depend on the simulation size. For each > 0, Pr(|Z − θ| ≤ ) ≈ 2 v1/2/σ − 1, (12.2.4) where is the standard normal c.d.f.We can use this type of approximation to help us to say how likely it is that Z is close to θ. We can replace v1/2/σ by 1 over the simulation standard error of Z in Eq. (12.2.4) to approximate the probability that |Z − θ| ≤ . We can also use (12.2.4) to decide how many more simulations to do if v was not large enough. For example, suppose that we want the probability in Eq. (12.2.4) to be γ . Then we should let v = −1
1+ γ 2 σ 2 . (12.2.5) Since we will hardly ever know σ ahead of time, it is common to estimate it by doing a preliminary simulation of size v0 and computing ˆσ based on that preliminary simulation. Example 12.2.11 The M.S.E. of the Sample Median. It is not difficult to see that we can take μ = 0 in Example 12.2.1 without loss of generality. The reason is the following: Let M(i) be the sample median of X(i) 1 , . . . , X(i) n where each X(i) j is a Cauchy random variable centered at μ. Then M(i) − μ is also the sample median of X(i) 1 − μ, . . . , X(i) n − μ, and eachX(i) j − μ is a Cauchy random variable centered at 0. Because our calculation is based on the values Y (i) = (M(i) − μ)2 for i = 1, . . . , v, we get the same result whether μ = 0 or not. So, let μ = 0. This makes Y (i) = M(i)2, and σ2 is now the variance of M(i)2. (Even though a Cauchy random variable does not even have a first moment defined, it can be shown that the sample median of at least nine i.i.d. Cauchy random variables has a finite fourth moment.) Suppose that we want our estimator Z = Y of θ to be within = 0.01 of θ with probability γ = 0.95. That 12.2 Why Is Simulation Useful? 799 is, we want Pr(|Z − θ| ≤ 0.01) = 0.95. Since Z is an average, we can compute an estimate ˆσ of σ by using Eq. (12.2.1). Suppose that we simulate v0 = 1000 samples of size n = 20 from a Cauchy distribution, compute the 1000 values of Y (i), and then compute ˆσ = 0.3892. According to Eq. (12.2.5) with σ replaced by 0.3892, we need v = [1.96 × 0.3892/0.01]2 = 5820. Hence, we need approximately 4820 additional simulations. After performing any additional simulations suggested by Eq. (12.2.5), one should recompute ˆσ . If it is much larger than the preliminary estimate, then additional simulations should be performed. Example 12.2.12 The Median of a Complicated Distribution. In Example 12.2.2, suppose that the p.d.f. g is the p.d.f. of the gamma distribution with parameters 3 and 1. Suppose that we want the probability to be 0.99 that our estimator of the median is within 0.001 of the true median. We begin with an initial simulation of size v0 =10,000. We then simulate μ(1), . . . , μ(10,000) from the gamma distribution with parameters 3 and 1. For each i, we simulate X(i) having the exponential distribution with parameter μ(i). We treat X(1), . . . , X(10,000) as k = 20 samples of size m = 500 each, and we compute the sample median Z1, . . . , Z20 of each of the 20 samples. After performing these initial simulations, suppose that we observe the value S = 0.01597 for Eq. (12.2.2). This makes ˆσ = 0.3570. Plugging this value into (12.2.5) for σ with γ = 0.99 and = 0.001 yields v = 845,747.4. This means that we need a total of 845,748 simulations to reach our desired level of confidence in the simulated result. Just to check, we simulated a total of 900,000 values and computed the sample median 0.2593 as well as a new value of S2 based on k = 100 subsamples of size m = 6200 each. The new value of ˆσ is 0.4529. Substituting 0.4529 for σ in Eq. (12.2.5) yields a new v =1,360,939, which means that we still need another 460,939 simulations. Simulating Real Processes In many scientific fields, real physical or social processes are modeled as having random components. For example, stock prices are often modeled as having lognormal distributions as in Example 5.6.10. Many processes involving waiting times and service are modeled using Poisson processes. The simple probability models that have been developed earlier in this text are merely the building blocks of which such models of real processes are constructed. Here, we shall give two examples of slightly more complicated models that can be constructed using the distributions we already know. The analyses of these models can be simplified by the use of simulation. Example 12.2.13 Option Pricing. In Example 5.6.10, we introduced the formula of Black and Scholes (1973) for pricing options. In that example, the option was to buy shares at price q of a stock whose value at time u (in the future) is a random variable Su with a known lognormal distribution. Many financial analysts believe that the standard deviation σ of log(Su) in Example 5.6.10 should not be treated as a known constant. For example, we could treat σ as a random variable with a p.d.f. f (σ). To be precise, we shall continue to assume that Su = S0e(r−σ2/2)u+σu1/2Z, but now we shall assume that both Z and σ are random variables. For convenience, we shall assume that they are independent. We shall let Z have the standard normal distribution, and we shall let τ = 1/σ 2 have the gamma distribution with known parameters α and β. The parameters α and β might result from estimating the variance of stock prices based on historical data combined with expert opinion of stock analysts. For example, 800 Chapter 12 Simulation they might be the posterior hyperparameters that result from applying a Bayesian analysis to a sample of stock prices. It is easy to see that E(Su |σ) = S0eru for all σ, and hence the law of total probability for expectations (Theorem 4.7.1) implies that E(Su) = S0eru. This is what we need for risk neutrality. The price for the option considered in Example 5.6.10 is the mean of the random variable e −ruh(Su), where h(s) = s − q if s > q, 0 otherwise. The Black-Scholes formula (5.6.18) is just the conditional mean of e −ruh(Su) given σ. To estimate the marginal mean of e −ruh(Su), we could simulate a large number of values σ(i) (i = 1, . . . , v) from the distribution of σ, substitute each σ(i) into (5.6.18), and average the results. As an example, suppose that we take the same numerical situation from the end of Example 5.6.10 with u = 1, r = 0.06, and q = S0. This time, suppose that 1/σ 2 has the gamma distribution with parameters 2 and 0.0127. (These numbers make E(σ) = 0.1, but σ has substantial variability.) We can sample v =1,000,000 values of σ from this distribution and compute (5.6.18) for each value. The average, in our simulation, is 0.0756S0, and the simulation standard error is 1.814S0 × 10−5. The option price is only slightly higher than it was when we assumed that we knew σ. When the distribution of Su is even more complicated, one can simulate Su directly and estimate the mean of h(Su). In the following example, each simulation requires a large number of steps, but each step is relatively simple. The combination of several simple steps into one complicated step is very common in simulations of real processes. Example 12.2.14 A Service Queue with Impatient Customers. Consider a queue to which customers arrive according to a Poisson process with rate λ per hour. Suppose that the queue has a single server. Each customer who arrives at the queue counts the length r of the queue (including the customer being served) and decides to leave with probability pr , for r = 1, 2, . . . .Acustomer who leaves does not enter the queue. Each customer who enters the queue waits in the order of arrival until the customer immediately in front is done being served, and then moves to the head of the queue. The time (in hours) to serve a customer, after reaching the head of the queue, is an exponential random variable with parameter μ. Assume that all service times are independent of each other and of all arrival times. We can use simulation to learn about the behavior of such a queue. For example, we could estimate the expected number of customers in the queue at a particular time t after the queue opens for business. To do this, we could simulate many, say, v, realizations of the queue operation. For each realization i, we count how many customers N(i) are in the queue at time t . Then our estimator is 1 v v i=1 N(i). To simulate a single realization, we could proceed as follows: Simulate interarrival times X1, X2, . . . of the Poisson process as i.i.d. exponential random variables with parameter λ. Let Tj = j i=1 Xi be the time at which customer j arrives. Stop simulating at the first k such that Tk > t. Only the first k − 1 customers have even arrived at the queue by time t . For each j = 1, . . . , k − 1, simulate a service time Yj having the exponential distribution with parameter μ. Let Zj stand for the time at which the jth customer reaches the head of the queue, and let Wj stand for the time at which the jth customer leaves the queue. For example, Z1 = X1 and W1 = X1 + Y1. For j >1, the jth customer first counts the length of the queue and decides whether or not to leave. Let Ui,j = 1 if customer i is still in the queue when customer j arrives (i < j ), 12.2 Why Is Simulation Useful? 801 and let Ui,j = 0 if customer i has already left the queue. Then Ui,j = 1 ifWi ≥ Tj , 0 otherwise. The number of customers in the queue when the jth customer arrives is r = j−1 i=1 Ui,j . We then simulate a random variable Vj having the Bernoulli distribution with parameter pr. If Vj = 1, customer j leaves the queue so that Wj = Tj. If customer j stays in the queue, then this customer reaches the head of the queue at time Zj = max{Tj, W1, . . . , Wj−1}. That is, the jth customer either reaches the head of the queue immediately upon arrival (if nobody is still being served) or as soon as all of the previous j − 1customers have left, whichever comes later. Also, Wj = Zj + Yj if customer j stays. For each j = 1, . . . , k − 1, the jth customer is in the queue at time t if and only if Wj ≥ t . As a numerical example, suppose that λ = 2, μ = 1, t = 3, and pr = 1− 1/r, for r ≥ 1. Suppose that the first k = 6 simulated interarrival times are 0.215, 0.713, 1.44, 0.174, 0.342, 0.382. The sum of the first five of these times is 2.884, but the sum of all six is 3.266. So, at most five customers are in the queue at time t = 3. Suppose that the simulated service times for the first five customers are 0.251, 2.215, 2.855, 0.666, 2.505. We cannot simulate the Vj ’s in advance, because we do not yet know how many customers will be in the queue when each customer j arrives. Figure 12.3 shows a time line of the simulation of the process that we are about to describe. Begin with customer 1, who has T1 = Z1 = 0.215 and W1 = 0.215 + 0.251= 0.466. For customer 2, T2 = T1 + 0.713 = 0.928>W1, so nobody is in the queue when customer 2 arrives and Z2 = T2 = 0.928. ThenW2 = Z2 + 2.215= 3.143. For customer 3, T3 = T2 + 1.44 = 2.368<W2, so r = 1. Because p1 = 0, customer 3 stays, and there is no need to Time (hours) Customer number 0 2 4 6 8 1 2 3 4 5 T5 Z5 W5 T3 Z3 W3 T4 T2 Z2 W2 W1 Z1 T1 W4 t 5 3 Figure 12.3 One simulation of a service queue. The bottom line is the time line for Example 12.2.14. Each customer is represented by one horizontal line segment. The vertical line at t = 3 crosses the horizontal lines for those customers still in the queue at time t = 3. 802 Chapter 12 Simulation simulate V3. Then Z3 = W2 = 3.143, and W3 = Z3 + 2.855 = 5.998. For customer 4, T4 = T3 + 0.174 = 2.542. Since W1 < T4<W2, W3, we have r = 2 customers in the queue. We then simulate V4 having the Bernoulli distribution with parameter p2 = 1/2. Suppose that we simulate V4 = 1, so customer 4 leaves, and we ignore the fourth simulated service time. This makes W4 = T4 = 2.542. For customer 5, T5 = T4 + 0.342 = 2.884, and customers 2 and 3 are still in the queue.We need to simulate V5 having the Bernoulli distribution with parameter p2 = 1/2. Suppose that V5 = 0, so customer 5 stays. Then Z5 =W3 = 5.988, andW5 = Z5 + 2.505= 8.393. Finally,Wj ≥ 3 for j = 2, 3, 5. This means that there areN(1) = 3 customers in the queue at time t = 3, as illustrated in Fig. 12.3. Needless to say, a computer should be programmed to do this calculation for a large simulation.
12.2.4 Summary
If we wish to compute the expected value \(\theta\) of some random variable Y , but cannot perform the necessary calculation in closed form, we can use simulation. In general, we would simulate a large random sample Y (1), . . . , Y (v) from the same distribution as Y , and then compute the sample mean Z as our estimator.We can also estimate a quantile θp of a distribution in a similar fashion. If Y (1), . . . , Y(v) is a large sample from the distribution, we can compute the sample p quantile Z. It is always a good idea to compute some measure of how good a simulation estimator is. One common measure is the simulation standard error of Z, an estimate of the standard deviation of the simulation distribution of Z. Alternatively, one could perform enough simulations to make sure that the probability is high that the Z is close to the parameter being estimated.
12.2.5 Exercises
- Eq. (12.2.4) is based on the assumption that Z has approximately a normal distribution. Occasionally, the normal approximation is not good enough. In such cases, one can let v = σ2 2(1− γ ) . (12.2.6) To be precise, let Z be the average of v independent random variables with meanμand variance σ2. Prove that if v is at least as large as the number in Eq. (12.2.6), then Pr(|Z − μ| ≤ c) ≥ γ . Hint: Use the Chebyshev inequality (6.2.3).
- In Example 12.2.11, how large would v need to be according to Eq. (12.2.6)?
- Suppose that we have available as many i.i.d. standard normal random variables as we desire. Let X stand for a random variable having the normal distribution with mean 2 and variance 49. Describe a method for estimating E(log(|X| + 1)) using simulation.
- Use a pseudo-random number generator to simulate a sample of 15 independent observations in which 13 of the 15 are drawn from the uniform distribution on the interval [−1, 1] and the other two are drawn from the uniform distribution on the interval [−10, 10]. For the 15 values that are obtained, calculate the values of (a) the sample mean, (b) the trimmed means for k = 1, 2, 3, and 4 (see Sec. 10.7), and (c) the sample median. Which of these estimators is closest to 0?
- Repeat Exercise 4 ten times, using a different pseudorandom sample each time. In other words, construct 10 independent samples, each of which contains 15 observations and each of which satisfies the conditions of Exercise
- For each sample, which of the estimators listed in Exercise 4 is closest to 0?
- For each of the estimators listed in Exercise 4, determine the square of the distance between the estimator and 0 in each of the 10 samples, and determine the average of these 10 squared distances. For which of the estimators is this average squared distance from 0 smallest? 12.2 Why Is Simulation Useful? 803
- Suppose that X and Y are independent, that X has the beta distribution with parameters 3.5 and 2.7, and that Y has the beta distribution with parameters 1.8 and 4.2. We are interested in the mean of X/(X + Y).You may assume that you have the ability to simulate as many random variables with whatever beta distributions you wish.
- Describe a simulation plan that will produce a good estimator of the mean of X/(X + Y) if enough simulations are performed.
- Suppose that you want to be 98 percent confident that your estimator is no more than 0.01 away from the actual value of E[X/(X + Y)]. Describe how you would determine an appropriate size for the simulation.
- Consider the numbers in Table 10.40 on page 676. Suppose that you have available as many standard normal random variables and as many uniform random variables on the interval [0, 1] as you desire. You want to perform a simulation to obtain the number in the “Sample median” row and = 0.05 column.
- Describe how to perform such a simulation. Hint: Let X and U be independent such that X has the standard normal distribution and U has the uniform distribution on the interval [0, 1]. Let 0 < <1, and find the distribution of Y = X ifU > , 10X ifU < .
- Perform the simulation on a computer.
- Consider the same situation described in Exercise 7. This time, consider the number in the “Trimmed mean for k = 2” row and = 0.1 column.
- Describe how to perform a simulation to produce this number.
- Perform the simulation on a computer.
- In Example 12.2.12, we can actually compute the median θ of the distribution of the Xi in closed form. Calculate the true median, and see how far the simulated value was from the true value. Hint: Find the marginal p.d.f. ofX by using the law of total probability for random variables (3.6.12) together with Eq. (5.7.10). The c.d.f. and quantile function are then easy to derive.
- Let X1, . . . , X21 be i.i.d. with the exponential distribution that has parameter λ. Let M stand for the sample median. We wish to compute the M.S.E. of M as an estimator of the median of the distribution of the Xi’s.
- Determine the median of the distribution of X1.
- Let θ be the M.S.E. of the sample median when λ = 1. Prove that the M.S.E. of the sample median equals θ/λ2 in general.
- Describe a simulation method for estimating θ.
- In Example 12.2.4, there is a slightly simpler way to simulate a sample from the posterior distribution of μx − μy. Suppose that we can simulate as many independent t pseudo-random variables as we wish with whatever degrees of freedom we want. Explain how we could use these t random variables to simulate a sample from the posterior distribution of μx − μy.
- Let (Y1, W1), . . . , (Yn, Wn) be an i.i.d. sample of random vectors with finite covariance matrix =
σyy σyw σyw σww . Let Y and W be the sample averages. Let g(y, w) be a function with continuous partial derivatives g1 and g2 with respect to y and w, respectively. Let Z = g(Y, W). The two-dimensional Taylor expansion of g around a point (y0, w0) is g(y, w) = g(y0, w0) + g1(y0, w0)(y − y0) + g2(y0, w0)(w − w0), (12.2.7) plus an error term that we shall ignore here. Let (y, w) = (Y, W)and (y0, w0) = (E(Y ), E(W)) in Eq. (12.2.7).To the level of approximation of Eq. (12.2.7), prove that Var(Z) = g1(E(Y ), E(W))2σyy + 2g1(E(Y ), E(W))g2(E(Y ), E(W))σyw + g2(E(Y ), E(W))2σww. Hint: Use the formula for the variance of a linear combination of random variables derived in Sec. 4.6. 13. Use the two-dimensional delta method from Exercise 12 to derive the estimator of the simulation variance of a sample variance as given in Eq. (12.2.3). Hint: Replace E(Y) and E(W) by Y and W, respectively, and replace by the sample variances and sample covariance. 14. Let Y be a random variable with some distribution. Suppose that you have available as many pseudo-random variables as you want with the same distribution as Y . Describe a simulation method for estimating the skewness of the distribution of Y . (See Definition 4.4.1.) 15. Suppose that the price of a stock at time u in the future is a random variable Su = S0eαu+Wu, where S0 is the current price, α is a constant, and Wu is a random variable with known distribution. Suppose that you have available as many i.i.d. random variables as you wish with the distribution of Wu. Suppose that the m.g.f. ψ(t) of Wu is known and finite on an interval that contains t = 1. a. What number should α equal in order that E(Su) = eruS0? b. We wish to price an option to purchase one share of this stock at time u for the price q. Describe how you could use simulation to estimate the price of such an option. 804 Chapter 12 Simulation 16. Consider a queue to which customers arrive according to a Poisson process with rate λ per hour. Suppose that the queue has two servers. Each customer who arrives at the queue counts the length r of the queue (including any customers being served) and decides to leave with probability pr , for r = 2, 3, . . . . A customer who leaves does not enter the queue. Each customer who enters the queue waits in the order of arrival until at least one of the two servers is available, and then begins being served by the available server. If both servers are available, the customer chooses randomly between the two servers with probability 1/2 for each, independent of all other random variables. For server i (i = 1, 2), the time (in hours) to serve a customer, after beginning service, is an exponential random variable with parameter μi . Assume that all service times are independent of each other and of all arrival times. Describe how to simulate the number of customers in the queue (including any being served) at a specific time \(t\).
12.3 Simulating Specific Distributions
In order to perform statistical simulations, we must be able to obtain pseudorandom values from a variety of distributions. In this section, we introduce some methods for simulating from specific distributions. Most computer packages with statistical capability are able to generate pseudorandom numbers with the uniform distribution on the interval [0, 1].We shall assume throughout the remainder of this section that one has available an arbitrarily large sample of what appear to be i.i.d. random variables (pseudo-random numbers) with the uniform distribution on the interval [0, 1]. Usually, we need random variables with other distributions, and the purpose of this section is to review some common methods for transforming uniform random variables into random variables with other distributions. The Probability Integral Transformation In Chapter 3, we introduced the probability integral transformation for transforming a uniform random variable X on the interval [0, 1] into a random variable Y with a continuous strictly increasing c.d.f.G. The method is to set Y = G −1(X). This method works well if G −1 is easily computed. Example 12.3.1 Generating Exponential Pseudo-Random Variables. Suppose that we want Y to have the exponential distribution with parameter λ, where λ is a known constant. The c.d.f. of Y is G(y) = 1− e −λy if y ≥ 0, 0 if y <0. We can easily invert this function to obtain G −1(x)=−log(1− x)/λ, if 0 < x <1. If X has the uniform distribution on the interval [0, 1], then − log(1− X)/λ has the exponential distribution with parameter λ. Special-Purpose Algorithms There are cases in which the desired c.d.f. G is not easy to invert. For example, if G is the standard normal c.d.f., then G −1 must be obtained by numerical approximation.
However, there is a clever method for transforming two independent uniform random variables on the interval [0, 1] into two standard normal random variables. The method was described by Box and M¨ uller (1958). Example 12.3.2 Generating Two Independent Standard Normal Variables. Let X1, X2 be independent with the uniform distribution on the interval [0, 1]. The joint p.d.f. of (X1, X2) is f (x1, x2) = 1, for 0 < x1, x2 < 1. Define Y1 = [−2 log(X1)]1/2 sin(2πX2), Y2 = [−2 log(X1)]1/2 cos(2πX2). The inverse of this transformation is X1 = exp[−(Y 2 1 + Y 2 2)/2], X2 = 1 2π arctan(Y1/Y2). Using the methods of Sec. 3.9, we compute the Jacobian, which is the determinant of the matrix of partial derivatives of the inverse function: −y1 exp[−(y2 1 + y2 2)/2] −y2 exp[−(y2 1 + y2 2)/2] 1 2πy2 1 1+(y1/y2)2 − y1 2πy2 2 1 1+(y1/y2)2 . The determinant of this matrix is J = exp[−(y2 1 + y2 2)/2]/(2π). The joint p.d.f. of (Y1, Y2) is then g(y1, y2) = f exp[(y2 1 + y2 2)/2], arctan(y1/y2)/(2π) |J | = exp[−(y2 1 + y2 2)/2]/(2π). This is the joint p.d.f. of two independent standard normal variables. Acceptance/Rejection Many other special-purpose methods exist for other distributions, also.We would like to present here one more general-purpose method that has wide applicability. The method is called acceptance/rejection. Let f be a p.d.f. and assume that we would like to sample a pseudo-random variable with this p.d.f. Assume that there exists another p.d.f. g with the following two properties: . We know how to simulate a pseudo-random variable with p.d.f. g. . There exists a constant k such that kg(x) ≥ f (x) for all x. To simulate a single Y with p.d.f. f , perform the following steps: 1. Simulate a pseudo-randomX with p.d.f. g and an independent uniform pseudorandom variable U on the interval [0, 1]. 2. If f (X) g(X) ≥ kU, (12.3.1) let Y = X, and stop the process. 3. If (12.3.1) fails, throw away X and U, and return to the first step. 806 Chapter 12 Simulation If we need more than one Y , we repeat the entire process as often as needed. We now show that the p.d.f. of each Y is f . Theorem 12.3.1 The p.d.f. of Y in the acceptance/rejection method is f . Proof First, we note that the distribution of Y is the conditional distribution of X given that (12.3.1) holds. That is, let A be the event that (12.3.1) holds, and let h(x, u|A) be the conditional joint p.d.f . of (X, U) given A. Then the p.d.f. of Y is h(x, u|A) du. This is because Y is constructed to be X conditional on (12.3.1) holding. The conditional p.d.f. of (X, U) given A is h(x, u|A) = 1 Pr(A) g(x) if f (x)/g(x) ≥ ku and 0<u<1, 0 otherwise. It is straightforward to calculate Pr(A), that is, the probability thatU ≤ f (X)/[kg(X)]. Pr(A) = ∞ −∞ f (x)/[kg(x)] 0 g(x) du dx = ∞ −∞ 1 k f (x) dx = 1 k . So, h(x, u|A) = k g(x) if f (x)/g(x) ≥ ku and 0<u<1, 0 otherwise. The integral of this function over all u values for fixed x is the p.d.f. of Y evaluated at x: h(x, u|A) du = k f (x)/[kg(x)] 0 g(x) du = f (x). Here is an example of the use of acceptance/rejection. Example 12.3.3 Simulating a Beta Distribution. Suppose that we wish to simulate a random variable Y having the beta distribution with parameters 1/2 and 1/2. The p.d.f. of Y is f (y) = 1 π y −1/2(1− y) −1/2, for 0 < y <1. Note that this p.d.f. is unbounded. However, it is easy to see that f (y) ≤ 1 π (y −1/2 + (1− y) −1/2), (12.3.2) for all 0 < y <1. The right side of Eq. (12.3.2) can be written as kg(y) with k = 4/π and g(y) = 1 2 1 2y1/2 + 1 2(1− y)1/2 . This g is a half-and-half mixture of two p.d.f.’s g1 and g2: g1(x) = 1 2x1/2 , for 0 < x <1, g2(x) = 1 2(1− x)1/2 , for 0 < x <1. (12.3.3) We can easily simulate random variables from these distributions using the probability integral transformation. To simulate a random variable X with p.d.f. g, simulate three random independent variables U1, U2, U3 with uniform distributions on the
interval [0, 1]. If U1 ≤ 1/2, simulate X from g1 using the probability integral transformation applied to U2. If U1 > 1/2, simulate X from g2 using the probability integral transformation and U2. If f (X)/g(X) ≥ kU3, let Y = X. If not, repeat the process. When using the acceptance/rejection method, one must usually reject simulated values and resimulate. The probability of accepting a value is Pr(A) in the proof of Theorem 12.3.1, namely, 1/k. The larger k is, the harder it will be to accept. In Exercise 5, you will prove that the expected number of iterations until the first acceptance is k. A common special case of acceptance/rejection is the simulation of a random variable conditional on some event. For example, let X be a random variable with the p.d.f. g, and suppose that we want the conditional distribution of X given that X>2. Then the conditional p.d.f. of X givenX>2 is f (x) = kg(x) if x >2, 0 ifx ≤ 2, where k = 1/ ∞ 2 g(x) dx. Note that f (x) ≤ kg(x) for all x, so acceptance/rejection is applicable. In fact, since f (X)/g(X) only takes the two values k and 0, we don’t need to simulate the uniform U in the acceptance/rejection algorithm.We don’t even need to compute the value k.We just reject each X ≤ 2. Here is a version of the same algorithm to solve a question that was left open in Sec. 11.8. Example 12.3.4 Computing the Size of a Two-Stage Test. In Sec. 11.8, we studied the analysis of data from a two-way layout with replication. In that section, we introduced a two-stage testing procedure. First, we tested the hypotheses (11.8.11), and then, if we accepted the null hypothesis, we proceeded to test the hypotheses (11.8.13). Unfortunately, we were unable to compute the conditional size of the second test given that the first test accepted the null hypothesis. That is, we could not calculate (11.8.15) in closed form. However, we can use simulation to estimate the conditional size. The two tests are based on U2 AB, defined in Eq. (11.8.12), and V 2 A, defined in Eq. (11.8.16). The first test rejects the null hypothesis in (11.8.11) if U2 AB ≥ d, where d is a quantile of the appropriate F distribution. The second test rejects its null hypothesis if V 2 A ≥ c, where c is yet to be determined. The random variables U2 AB and V 2 A are both ratios of various mean squares. In particular, they share a common denominator MSResid = S2 Resid/[IJ(K − 1)]. In order to determine an appropriate critical value c for the second test, we need the conditional distribution of V 2 A given that U2 AB < d, and given that both null hypotheses are true. We can sample from that conditional distribution as follows: Let the interaction mean square be MSAB = S2 Int/[(I − 1)(J − 1)], and let the mean square for factorAbeMSA = S2 A/(I − 1). Then U2 AB = MSAB/MSResid and V 2 A = MSA/MSResid. All of these mean squares are independent, and they all have different gamma distributions when the null hypotheses are both true. Most statistical computer packages will allow simulation of gamma random variables. So, we start by simulating many triples (MSAB, MSResid, MSA). Then, for each simulated triple, we compute U2 AB and V 2 A. If U2 AB ≥ d, we discard the corresponding V 2 A. The undiscarded V 2 A values are a random sample from the conditional distribution that we need. The efficiency of this algorithm could be improved slightly by simulating MSA and then computing V 2 A only when U2 AB < d is observed. 808 Chapter 12 Simulation Generating Functions of Other Random Variables It often happens that there is more than one way to simulate from a particular distribution. For example, suppose that a distribution is defined as the distribution of a particular function of other random variables (in the way that the χ2, t , and F distributions are). In such cases, there is a straightforward way to simulate the desired distribution. First, simulate the random variables in terms of which the distribution is defined, and then calculate the appropriate function. Example 12.3.5 Alternate Method for Simulating a Beta Distribution. In Exercise 6 in Sec. 5.8, you proved the following: IfU and V are independent, withU having the gamma distribution with parameters α1 and β, and V having the gamma distribution with parameters α2 and β, then U/(U + V ) has the beta distribution with parameters α1 and α2. So, if we have a method for simulating gamma random variables, we can simulate beta random variables. The case handled in Example 12.3.3 is α1 = α2 = 1/2. Let β = 1/2 so that U and V would both have gamma distributions with parameters 1/2 and 1/2, also known as the χ2 distribution with one degree of freedom. If we simulate two independent standard normal random variables X1, X2 (for example, by the method of Example 12.3.2), then X2 1 and X2 2 are independent and have the χ2 distribution with one degree of freedom. It follows that Y = X2 1/(X2 1 + X2 2) has the beta distribution with parameters 1/2 and 1/2. As another example, to simulate a χ2 random variable with 10 degrees of freedom, one could simulate 10 i.i.d. standard normals, square them, and add up the squares. Alternatively, one could simulate five random variables having the exponential distribution with parameter 1/2 and add them up. Example 12.3.6 Generating Pseudo-Random Bivariate Normal Vectors. Suppose that we wish to simulate a bivariate normal vector with the p.d.f. given in Eq. (5.10.2). This p.d.f. was constructed as the joint p.d.f. of X1 = σ1Z1 + μ1, X2 = σ2 # ρZ1 + (1− ρ2)1/2Z2 $ + μ2, (12.3.4) where Z1 and Z2 are i.i.d. with the standard normal distribution. If we use the method of Example 12.3.2 to generate independent Z1 and Z2 with the standard normal distribution, we can use the formulas in (12.3.4) to transform these into X1 and X2, which will then have the desired bivariate normal distribution. Most statistical computer packages have the capability of simulating pseudorandom variables with each of the continuous distributions that have been named in this text. The techniques of this section are really needed only for simulating less common distributions or when a statistical package is not available. Some Examples Involving Simulation of Common Distributions Example 12.3.7 Bayesian Analysis of One-Way Layout. We can perform a Bayesian analysis of a oneway layout using the same statistical model presented in Sec. 11.6 together with an improper prior for the model parameters. (We could use a proper prior, but the additional calculations would divert our attention from the simulation issues.) Let τ = 1/σ 2, as we did in Sec. 8.6. The usual improper prior for the parameters (μ1, . . . , μp, τ) has “p.d.f.” 1/τ . The posterior joint p.d.f. is then proportional to 1/τ 12.3 Simulating Specific Distributions 809 times the likelihood. The observed data are yij for j = 1, . . . , ni and i = 1, . . . , p. The likelihood function is (2π) −n/2τ n/2 exp ⎛ ⎝−τ 2 p i=1 ni j=1 (yij − μi)2 ⎞ ⎠, where n = n1 + . . . + np. To simplify the likelihood function, we can rewrite the sum of squares that appears in the exponent as p i=1 ni j=1 (yij − μi)2 = p i=1 ni(yi+ − μi)2 + S2 Resid, where yi+ is the average of yi1, . . . , yini and S2 Resid = p i=1 ni j=1 (yij − yi+)2 is the residual sum of squares. Then, the posterior p.d.f. is proportional to τ p/2 exp −τ 2 p i=1 ni(yi+ − μi)2 τ (n−p)/2−1 exp
−τ 2 S2 Resid . This expression is easily recognized as the product of the gamma p.d.f. for τ with parameters (n − p)/2 and S2 Resid/2 and the product of p normal p.d.f.’s for μ1, . . . , μp with means yi+ and precisions niτ for i = 1, . . . , p. Hence, the posterior joint distribution of the parameters is the following: Conditional on τ , the μi ’s are independent withμi having the normal distribution with mean yi+ and precision niτ . The marginal distribution of τ is the gamma distribution with parameters (n − p)/2 and S2 Resid/2. If we simulate a large sample of parameters from the posterior distribution, we could begin to answer questions about what we have learned from the data. To do this, we would first simulate a large number of τ values τ (1), . . . , τ(v). Most statistical programs allow the user to simulate gamma random variables with arbitrary first parameter and second parameter 1. So, we could simulate T (1), . . . , T (v) having the gamma distribution with parameters (n − p)/2 and 1. We could then let τ ( ) = 2T ( )/S2 Resid for = 1, . . . , v. Then, for each simulate independent μ( ) 1 , . . . , μ( ) p with μ( ) i having the normal distribution with mean yi+ and variance 1/[niτ ( )]. As a specific example, consider the hot dog data in Example 11.6.2. We begin by simulating v = 60,000 sets of parameters as described above. Now we can address the question of how much difference there is between the means. There are several ways to do this. We could compute the probability that all |μi − μj | > c for each positive c.We could compute the probability that at least one |μi − μj | > c for each positive c.We could compute the quantiles of maxi,j |μi − μj |, of mini,j |μi − μj |, or of the average of all |μi − μj |. For example, in 99 percent of the 60,000 simulations, at least one |μ( ) i − μ( ) j | > 27.94. The simulation standard error of this estimator of the 0.99 quantile of maxi,j |μi − μj | is 0.1117. (For the remainder of this example, we shall present only the simulation estimates and not their simulation standard errors.) In about 1/2 of the simulations, all |μ( ) i − μ( ) j | > 2.379. And in 99 percent of the simulations, the average of the differences was at least 14.59. Whether 27.94, 14.59, or 2.379 count as large differences depends on what decisions we need to make about the hot dogs. A useful way to summarize all of these calculations is through a plot of the sample c.d.f.’s of the largest, smallest, and average of the six |μi − μj | differences. (The sample c.d.f. of a set of numbers is defined at the very beginning of Sec. 10.6.) 810 Chapter 12 Simulation Figure 12.4 Sample c.d.f.’s of the maximum, average, and minimum of the six |μi − μj | differences for Example 12.3.7. Difference Sample d.f. 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1.0 Maximum difference Average difference Minimum difference Table 12.4 Posterior probabilities that each μi is largest and smallest in Example 12.3.7 Type Beef Meat Poultry Specialty i 1 2 3 4 Pr(μi largest|y) 0.1966 0.3211 0 0.4823 Pr(μi smallest|y) 0 0 1 0 Figure 12.4 contains such a plot for this example. If we are simply concerned with whether or not there are any differences at all between the four types of hot dogs, then the “Maximum” curve in Fig. 12.4 is the one to examine. (Can you explain why this is the case?) We can also attempt to answer questions that we would have great difficulty addressing in the ANOVA framework of Chapter 11. For example, we could ask what is the probability that each μi is the largest or smallest of the four. For each i, let Ni be the number of simulations j such that μ (j ) i is the smallest of μ (j ) 1 , . . . , μ (j ) 4 . Also let Mi be the number of simulations j such that μ (j ) i is the largest of the four means. Then Ni/60,000 is our simulation estimate of the probability that μi is the smallest mean, and Mi/60,000 is our estimate of the probability that μi is the largest mean. The results are summarized in Table 12.4. We see that μ3 is almost certainly the smallest, while μ4 has almost a 50 percent chance of being the largest. Example 12.3.8 Comparing Copper Ores. We shall illustrate the method of Example 12.2.4 using the data on copper ores from Example 9.6.5. Suppose that the prior distributions for all parameters are improper. The observed data consist of one sample of size 8 and another sample of size 10 with X = 2.6, 8 i=1(Xi − X)2 = 0.32, Y = 2.3, and 10 j=1(Yj − Y)2 = 0.22. The posterior distributions then have hyperparameters μx1 = 2.6, λx1 = 8, αx1 = 3.5, βx1 = 0.16, μy1 = 1.15, λy1 = 10, αy1 = 4.5, and βy1 = 0.11. The posterior distributions of τx and τy are, respectively, the gamma distribution with parameters 3.5 and 0.16 and the gamma distribution with parameters 4.5 and 12.3 Simulating Specific Distributions 811 Figure 12.5 Histogram of simulated μx − μy values together with posterior c.d.f. of |μx − μy | for Example 12.3.8. 0 0.5 1.0 3000 2000 1000 Histogram 0.2 0.4 0.6 0.8 1.0 d 1.0 0.8 0.6 0.4 0.2 0 Distribution function Pr(⏐mx2 my⏐ , d ) mx2 my 0.11. We can easily simulate, say, 10,000 pseudo-random values from each of these two distributions. For each simulated τx, we simulate a μx that has the normal distribution with mean 2.6 and variance 1/(8τx). For each simulated τy, we simulate a μy that has the normal distribution with mean 2.3 and variance 1/(10τy). Figure 12.5 contains a histogram of the 10,000 simulated μx − μy values together with the sample c.d.f. of |μx − μy |. It appears that μx − μy is almost always positive; indeed, it was positive for over 99 percent of the sampled values. The probability is quite high that |μx − μy | < 0.5, so that if 0.5 is not a large difference in this problem, we can be confident thatμx andμy are pretty close.On the other hand, if 0.1 is a large difference, we can be confident that μx and μy are pretty far apart. If all we care about in Example 12.3.8 is the distribution ofμx − μy, then we could simulate μx and μy directly without first simulating τx and τy. Since μx and μy are independent in this example, we could simulate each of them from their respective marginal distributions. Example 12.3.9 Power of the t Test. In Theorem 9.5.3, we showed how the power function of the t test can be computed from the noncentral t distribution function. Not all statistical packages compute noncentral t probabilities.We can use simulation to estimate these probabilities. Let Y have the noncentral t distribution withmdegrees of freedom and noncentrality parameter ψ. Then Y has the distribution of X1/(X2/m)1/2 where X1 and X2 are independent with X1 having the normal distribution with mean ψ and variance 1 and X2 having the χ2 distribution with m degrees of freedom. A simple way to estimate the c.d.f. of Y is to simulate a large number of (X1, X2) pairs and compute the sample c.d.f. of the values of X1/(X2/m)1/2. The Simulation Standard Error of a Sample c.d.f In Examples 12.3.7 and 12.3.8, we plotted the sample c.d.f.’s of functions of simulated data.We did not associate simulation standard errors with these functions.We could compute simulation standard errors for every value of the sample c.d.f., but there is a simpler way to summarize the uncertainty about a sample c.d.f.We can make use of the Glivenko-Cantelli lemma (Theorem 10.6.1). To summarize that result in the context of simulation, let Y (i), (i = 1, . . . , v) be a simulated i.i.d. sample with c.d.f. G. Let Gv be the sample c.d.f. For each real x, Gv(x) is the proportion of the simulated sample that is less than or equal to x. That is, Gv(x) is 1/v times the number of i’s such that Y (i) ≤ x. 812 Chapter 12 Simulation Theorem 10.6.1 says that if v is large, then Pr
|Gv(x) − G(x)| ≤ t v1/2 , for all x ≈ H(t), where H is the function in Table 10.32 on page 661. In particular, with t = 2, H(t) = 0.9993. So we can declare (at least approximately) that |Gv(x) − G(x)| ≤ 2/v1/2 simultaneously for all x with probability 0.9993. In Example 12.3.7, we had v = 60, 000, so each curve in Fig. 12.4 should be accurate to within 0.008 with probability 0.9993. Indeed, all three curves simultaneously should be accurate to within 0.008 with probability 0.9979. (Prove this in Exercise 14.) Simulating a Discrete Random Variable All of the examples so far in this section have concerned simulations of random variables with continuous distributions. Occasionally, one needs random variables with discrete distributions. Algorithms for simulating discrete random variables exist, and we shall describe some here. Example 12.3.10 Simulating a Bernoulli Random Variable. It is simple to simulate a pseudo-random Bernoulli random variable X with parameter p. Start with U having the uniform distribution on the interval [0, 1], and let X = 1 if U ≤ p. Otherwise, let X = 0. Since Pr(U ≤ p) = p, X has the correct distribution. This method can be used to simulate from any distribution that is supported on only two values. If f (x) = ⎧⎨ ⎩ p if x = t1, 1− p if x = t2, 0 otherwise, then let X = t1 is U ≤ p, and let X = t2 otherwise. Example 12.3.11 Simulating a Discrete Uniform Random Variable. Suppose that we wish to simulate pseudo-random variables from a distribution that has the p.f. f (x) = 1 n if x ∈ {t1, . . . , tn }, 0 otherwise. (12.3.5) The uniform distribution on the integers 1, . . . , n is an example of such a distribution. Asimple way to simulate a random variable with the p.f. (12.3.5) is the following: Let U have the uniform distribution on the interval [0, 1], and letZ be the greatest integer less than or equal to nU + 1. It is easy to see that Z takes the values 1, . . . , n with equal probability, and so X = tZ has the p.f. (12.3.5). The method described in Example 12.3.11 does not apply to more general discrete distributions. However, the method of Example 12.3.11 is useful in simulations that are done in bootstrap analyses described in Sec. 12.6. For general discrete distributions, there is an analog to the probability integral transformation. Suppose that a discrete distribution is concentrated on the values t1 < . . . < tn and that the c.d.f. is F(x) = ⎧⎪⎨ ⎪⎩ 0 ifx <t1, qi if ti ≤x <ti +1, for i = 1, . . . , n − 1, 1 ifx ≥ tn. (12.3.6) 12.3 Simulating Specific Distributions 813 The following is the quantile function from Definition 3.3.2: F −1(p) = ⎧⎪⎨ ⎪⎩ t1 if 0<p ≤ q1, ti+1 if qi <p ≤ qi+1, for i = 1, . . . , n − 2, tn if qn−1<p <1. (12.3.7) You can prove (see Exercise 13) that if U has the uniform distribution on the interval [0, 1], then F −1(U) has the c.d.f. in Eq. (12.3.6). This gives a straightforward, but inefficient, method for simulating arbitrary discrete distributions. Notice that the restriction that n be finite is not actually necessary. Even if the distribution has infinitely many possible values, F −1 can be defined by (12.3.7) by replacing n − 2 by∞and removing the last branch. Example 12.3.12 Simulating a Geometric Random Variable. Suppose that we wish to simulate a pseudorandom X having the geometric distribution with parameter p. In the notation of Eq. (12.3.7), ti = i − 1 for i = 1, 2, . . . , and qi = 1− (1− p)i . Using the probability integral transformation, we would first simulate U with the uniform distribution on the interval [0, 1]. Then we would compare U to qi for i = 1, 2, . . . , until the first time that qi <U and set X = i. In this example, we can avoid the sequence of comparisons because we have a simple formula for qi . The first i such that qi <U is the greatest integer strictly less than log(1− U)/ log(1− p). The probability integral transformation is very inefficient for discrete distributions that do not have a simple formula for qi if the number of possible values is large. Walker (1974) and Kronmal and Peterson (1979) describe a more efficient method called the alias method. The alias method works as follows: Let f be the p.f. from which we wish to simulate a random variable X. Suppose that f (x)>0 for only n different values of x. First, we write f as an average of n p.f.’s that are concentrated on one or two values each. That is, f (x) = 1 n [g1(x) + . . . + gn(x)], (12.3.8) where each gi is the p.f. of a distribution concentrated on one or two values only.We shall show how to do this in Example 12.3.13. To simulate X, first simulate an integer I that has the uniform distribution over the integers 1, . . . , n. (Use the method of Example 12.3.11.) Then simulate X from the distribution with the p.f. gI . The reader can prove in Exercise 17 that X has the p.f. f . Example 12.3.13 Simulating a Binomial Random Variable Using the Alias Method. Suppose that we need to simulate many random variables with a binomial distribution having parameters 9 and 0.4. The p.f. f of this distribution is given in a table at the end of this book. The distribution has n = 10 different values with positive probability. Since the n probabilities must add to 1, there must be x1 and y1 such that f (x1) ≤ 1/n and f (y1) ≥ 1/n. For example, x1 = 0 and y1 = 2 have f (x1) = 0.0101 and f (y1) = 0.1612. Define the first two-point p.f., g1, as g1(x) = ⎧⎨ ⎩ nf (x1) if x = x1, 1− nf (x1) if x = y1, 0 otherwise. 814 Chapter 12 Simulation In our case, g1(0) = 0.101 and g1(2) = 0.899. We then write f as f (x) = g1(x)/n + f ∗ 1 (x), where f ∗ 1 (x) = ⎧⎨ ⎩ 0 if x = x1, f (y1) − g1(y1)/n if x = y1, f (x) otherwise. In our example, f ∗ 1 (2) = 0.0713. Now, f ∗ 1 is positive at only n − 1 different values, and the sum of the positive values of f ∗ 1 is (n − 1)/n. Hence, there must exist x2 and y2 such that f ∗ 1 (x2) ≤ 1/n and f ∗ 1 (y2) ≥ 1/n. For example, x2 = 2 and y2 = 3 have f ∗ 1 (x2) = 0.0713 and f ∗ 1 (y2) = 0.2508. Define g2 by g2(x) = ⎧⎨ ⎩ nf ∗ 1 (x2) if x = x2, 1− nf ∗ 1 (x2) if x = y2, 0 otherwise. Here, g2(2) = 0.713. Now write f ∗ 1 (x) = g2(x)/n + f ∗ 2(x), where f ∗ 2(x) = ⎧⎪⎨ ⎪⎩ 0 if x = x2, f ∗ 1 (y2) − g2(y2)/n if x = y2, f ∗ 1 (x) otherwise. In our example, f ∗ 2(3) = 0.2221. Now, f ∗ 2 takes only n − 2 positive values that add up to (n − 2)/n.We can repeat this process n − 3 more times, obtaining g1, . . . , gn−1 and f ∗ n−1. Here, f ∗ n−1(x) takes only one positive value, at x = xn, say, and f ∗ n−1(xn) = 1/n. Let gn be a degenerate distribution at xn. Then f (x) = [g1(x) + . . . + gn(x)]/n for all x. After all of this initial setup, the alias method allows rapid simulation from f as follows: Simulate independent U and I with U having the uniform distribution on the interval [0, 1] and I having the uniform distribution on the integers 1, . . . , n (n = 10 in our example). If U ≤ gI (xI ), set X = xI. IfU >gI (xI ), set X = yI . Here, the values we need to perform the simulation are i 1 2 3 4 5 6 7 8 9 10 xi 0 2 1 6 7 3 8 9 4 5 yi 2 3 3 3 3 4 4 4 5 — gi(xi) 0.101 0.713 0.605 0.743 0.212 0.781 0.035 0.003 0.327 1 There is even a clever way to replace the two simulations of U and I with a single simulation. Simulate Y with the uniform distribution on the interval [0, 1], and let I be the greatest integer less than or equal to nY + 1. Then let U = nY + 1− I . (See Exercise 19.) As an example, suppose that we simulate Y with the uniform distribution on the interval [0, 1], and we obtain Y = 0.4694. Then I = 5 and U = 0.694. Since 0.694 > g5(x5) = 0.212, we set X = y5 = 3. Figure 12.6 shows a histogram of 10,000 simulated values using the alias method. All of the overhead required to set up the alias method is worth the effort only if we are going to simulate many random variables with the same discrete distribution. 12.3 Simulating Specific Distributions 815 Figure 12.6 Histogram of 10,000 simulated binomial random variables in Example 12.3.13. The X marks appear at heights equal to 10,000f (x) to illustrate the close agreement of the simulated and actual distributions. 0 2 4 6 8 500 1000 1500 2000 2500 Count X X X X X X X X X X x Summary We have seen several examples of how to transform pseudo-random uniform variables into pseudo-random variables with other distributions. The acceptance/rejection method is widely applicable, but it might require many rejected simulations for each accepted one. Also, we have seen how we can simulate random variables that are functions of other random variables (such as a noncentral t random variable). Several examples illustrated how we can make use of simulated random variables with some of the common distributions. Readers who desire a thorough treatment of the generation of pseudo-random variables with distributions other than uniform can consult Devroye (1986).
12.3.1 Exercises
- Return to Exercise 10 in Sec. 12.2. Now that we know how to simulate exponential random variables, perform the simulation developed in that exercise as follows:
- Perform v0 = 2000 simulations and compute both the estimate of θ and its simulation standard error.
- Suppose that we want our estimator of θ to be within 0.01 of θ with probability 0.99. How many simulations should we perform?
- Describe how to convert a random sample U1, . . . , Un from the uniform distribution on the interval [0, 1] to a random sample of size n from the uniform distribution on the interval [a, b].
- Show how to use the probability integral transformation to simulate random variables with the two p.d.f.’s in Eq. (12.3.3).
- Show how to simulate Cauchy random variables using the probability integral transformation.
- Prove that the expected number of iterations of the acceptance/rejection method until the first acceptance is
- (Hint: Think of each iteration as a Bernoulli trial.What is the expected number of trials (not failures) until the first success?) 6.a. Show how to simulate a random variable having the Laplace distribution with parameters 0 and 1. The p.d.f. of the Laplace distribution with parameters θ and σ is given in Eq. (10.7.5).
- Show how to simulate a standard normal random variable by first simulating a Laplace random variable and then using acceptance/rejection. Hint: Maximize e −x2/2/e −x for x ≥ 0, and notice that both distributions are symmetric around 0.
- Suppose that you have available as many i.i.d. standard normal pseudo-random numbers as you desire. Describe how you could simulate a pseudo-random number with an F distribution with four and seven degrees of freedom.
- Let X and Y be independent random variables with X having the t distribution with five degrees of freedom and Y having the t distribution with seven degrees of freedom. We are interested in E(|X − Y |). 816 Chapter 12 Simulation
- Simulate 1000 pairs of (Xi, Yi) each with the above joint distribution and estimate E(|X − Y |).
- Use your 1000 simulated pairs to estimate the variance of |X − Y | also.
- Based on your estimated variance, how many simulations would you need to be 99 percent confident that your estimator of E(|X − Y |) is within 0.01 of the actual mean?
- Show how to use acceptance/rejection to simulate random variables with the following p.d.f.: f (x) = ⎧⎪⎪⎪⎨ ⎪⎪⎪⎩ 4 3x if 0 < x ≤ 0.5, 2 3 if 0.5< x ≤ 1.5, 8 3 − 4 3x if 1.5< x ≤ 2, 0 otherwise.
- Implement the simulation in Example 12.2.3 for the clinical trial of Example 2.1.4 on page 57. Simulate 5000 parameter vectors. Use a prior distribution with α0 = 1 and β0 = 1. Estimate the probability that the imipramine group has the highest probability of no relapse. Calculate how many simulations you would need to be 95 percent confident that your estimator is within 0.01 of the true probability.
- In Example 12.3.7, we simulated the τ values by first simulating gamma random variables with parameters (n − p)/2 and 1. Suppose that our statistical software allows us to simulate χ2 random variables instead. Which χ2 distribution should we use and how would we convert the simulated χ2’s to have the appropriate gamma distribution?
- Use the blood pressure data in Table 9.2 that was described in Exercise 10 of Sec. 9.6. Suppose now that we are not confident that the variances are the same for the two treatment groups. Perform a simulation of the sort done in Example 12.3.8 to obtain a sample from the posterior distribution of the parameters when we allow the variances to be unequal.
- Draw a plot of the sample c.d.f. of the absolute value of the difference between the two group means.
- Draw a histogram of the logarithm of the ratio of the two variances to see how close together they seem to be.
- Let F −1 be defined as in Eq. (12.3.7). Let U have the uniform distribution on the interval [0, 1]. Prove that F −1(U) has the c.d.f. in Eq. (12.3.6).
- Refer to the three curves in Fig. 12.4. Call those three sample c.d.f.’s Gv,1, Gv,2, and Gv,3, and call the three c.d.f.’s that they estimate G1, G2, and G3. Use the Glivenko-Cantelli lemma (Theorem 10.6.1) to show that Pr |Gv,i(x) − Gi(x)| ≤ 0.0082, for all x and all i is about 0.9979 or larger. Hint: Use the Bonferroni inequality (Theorem 1.5.8).
- Prove that the acceptance/rejection method works for discrete distributions. That is, let f and g be p.f.’s rather than p.d.f.’s, but let the rest of the acceptance/rejection method be exactly as stated. Hint: The proof can be translated by replacing integrals over x by sums. Integrals over u should be left as integrals.
- Describe how to use the discrete version of the probability integral transformation to simulate a Poisson pseudo-random variable with mean θ.
- Let f be a p.f., and assume that Eq. (12.3.8) holds, where each gi is another p.f. Assume that X is simulated using the method described immediately after Eq. (12.3.8). Prove that X has the p.f. f .
- Use the alias method to simulate a random variable having the Poisson distribution with mean 5. Use the table of Poisson probabilities in the back of the book, and assume that 16 is the largest value that a Poisson random variable can equal. Assume that all of the probability not accounted for by the values 0, . . . , 15 is the value of the p.f. at k = 16.
- Let Y have the uniform distribution on the interval [0, 1]. Define I to be the greatest integer less than or equal to nY + 1, and define U = nY + 1− I . Prove that I and U are independent and that U has uniform distribution on the interval [0, 1].
12.4 Importance Sampling Many integrals can usefully be rewritten as means of functions of random variables. If we can simulate large numbers of random variables with the appropriate distributions, we can use these to estimate integrals that might not be possible to compute in closed form. Simulation methods are particularly well suited to estimating means of random variables. If we can simulate many random variables with the appropriate distribution, 12.4 Importance Sampling 817 we can average the simulated values to estimate the mean. Because means of random variables with continuous distributions are integrals, we might wonder whether other integrals can also be estimated by simulation methods. In principle, all finite integrals can be estimated by simulation, although some care is needed to insure that the simulation results have finite variance. Suppose that we wish to calculate b a g(x) dx for some function g with a and b both finite. We can rewrite this integral as b a g(x) dx = b a (b − a)g(x) 1 b − a dx = E[(b − a)g(X)], (12.4.1) where X is a random variable with the uniform distribution on the interval [a, b]. A simple Monte Carlo method is to simulate a large number of pseudo-random valuesX1, . . . , Xv with the uniform distribution on the interval [a, b]and estimate the integral by b−a v v i=1 g(Xi).The method just described has two commonly recognized drawbacks. First, it cannot be applied to estimate integrals over unbounded regions. Second, it can be very inefficient. If g is much larger over one portion of the interval than over another, then the values g(Xi) will have large variance, and it will take a very large value v to get a good estimator of the integral. A method that attempts to overcome both of the shortcomings just mentioned is called importance sampling. The idea of importance sampling is to do something very much like what we did in Eq. (12.4.1). That is, we shall rewrite the integral as the mean of some function of X, where X has a distribution that we can simulate easily. Suppose that we are able to simulate a pseudo-random variable X with the p.d.f. f where f (x)>0 whenever g(x) > 0. Then we can write g(x) dx = g(x) f (x) f (x) dx = E(Y), (12.4.2) where Y = g(X)/f (X). (If f (x) = 0 for some x such that g(x) > 0, then the two integrals in Eq. (12.4.2) might not be equal.) If we simulate v independent values X1, . . . , Xv with the p.d.f. f , we can estimate the integral by 1 v v i=1 Yi where Yi = g(Xi)/f (Xi). The p.d.f. f is called the importance function. It is acceptable, although inefficient, to have f (x)>0 for some x such that g(x) = 0. The key to efficient importance sampling is choosing a good importance function. The smaller the variance of Y , the better the estimator should be. That is, we would like g(X)/f (X) to be close to being a constant random variable. Example 12.4.1 Choosing an Importance Function. Suppose that we want to estimate 1 0 e −x/ (1+ x2)dx. Here are five possible choices of importance function: f0(x) = 1, for 0 < x <1, f1(x) = e −x, for 0 < x <∞, f2(x) = (1+ x2) −1/π, for −∞< x <∞, f3(x) = e −x/(1− e −1), for 0 < x <1, f4(x) = 4(1+ x2) −1/π, for 0 < x <1. Each of these p.d.f.’s is positive wherever g is positive, and each one can be simulated using the probability integral transformation. As an example, we have simulated 10,000 uniforms on the interval [0, 1], U(1), . . . , U(10,000). We then applied the five probability integral transformations to this single set of uniforms so that our comparisons do not suffer from variation due to different underlying uniform samples. 818 Chapter 12 Simulation Table 12.5 Monte Carlo estimates and ˆσj for Example 12.4.1 j 0 1 2 3 4 Y j 0.5185 0.5110 0.5128 0.5224 0.5211 ˆσ j 0.2440 0.4217 0.9312 0.0973 0.1409 Since the five p.d.f.’s are positive over different ranges, we should define g(x) = e −x/(1+ x2) if 0 < x <1, 0 otherwise. Let Fj stand for the c.d.f. corresponding to fj , and let X(i) j = F −1 j (U(i)) for i = 1, . . . , 10, 000 and j = 0, . . . , 4. Let Y (i) j = g(X(i) j )/fj(X(i) j ). Then we obtain five different estimators of g(x) dx, namely, Y j = 1 10,000 1 0,000 i=1 Y (i) j , for j = 0, . . . , 4. For each j , we also compute the sample variance of the Y (i) j values, ˆσ 2 j = 1 10,000 1 0,000 i=1 (Y (i) j − Y j )2. The simulation standard error of Y j is ˆσj/100. We list the five estimates together with the corresponding values of ˆσj in Table 12.5. The estimates are relatively close together, but some values of ˆσj are almost 10 times others. This can be understood in terms of how well each fj approximates the function g. First, note that the two worst cases are those in which fj is positive on an unbounded interval. This causes us to simulate a large number of X(i) j values for which g(X(i) j ) = 0 and hence Y (i) j = 0. This is highly inefficient. For example, with j = 2, 75 percent of the X(i) 2 values are outside of the interval (0, 1). The remaining Y (i) 2 values must be very large in order for the average to come out near the correct answer. In other words, because the Y (i) 2 values are so spread out (they range between 0 and π), we get a large value of ˆσ2. On the other hand, with j = 3, there are no 0 values for Y (i) 3 . Indeed, the Y (i) 3 values only range from 0.3161 to 0.6321. This allows ˆσ3 to be quite small. The goal in choosing an importance function is to make the Y (i) values have small variance. This is achieved by making the ratio g/f as close to constant as we can. Example 12.4.2 Calculating a Mean with No Closed-Form Expression. Let X have the gamma distribution with parameters α and 1. Suppose that we want the mean of 1/(1+ X + X2). We might wish to think of this mean as ∞ 0 1 1+ x + x2 fα(x) dx, (12.4.3) where fα is the p.d.f. of the gamma distribution with parameters α and 1. If α is not small, fα(x) is close to 0 near x = 0 and is only sizeable for x near α. For large x, 12.4 Importance Sampling 819 1/(1+ x + x2) is a lot like 1/x2. If α and x are both large, the integrand in (12.4.3) is approximately x −2fα(x). Since x −2fα(x) is a constant times fα−2(x), we could do importance sampling with importance function fα−2. For example, with α = 5, we simulate 10,000 pseudo-random variables X(1), . . . , X(10,000) having the gamma distribution with parameters 3 and 1. The sample mean of [1/(1+ X(i) + X(i)2)]f5(X(i))/f3(X(i)) is 0.05184 with sample standard deviation 0.01465. For comparison, we also simulate 10,000 pseudo-random variables Y (1), . . . , Y(10,000) with the gamma distribution having parameters 5 and 1. The average of the values of 1/(1+ Y (i) + Y (i)2) is 0.05226 with sample standard deviation 0.05103, about 3.5 times as large as we get using the f3 importance function.With α = 3, however, the two methods have nearly equal sample standard deviations. With α = 10, the importance sampling has sample standard deviation about one-tenth as large as sampling directly from the distribution of X. As we noted earlier, when α is large, 1/x2 is a better approximation to 1/(1+ x + x2) than it is when α is small. Example 12.4.3 Bivariate Normal Probabilities. Let (X1, X2) have a bivariate normal distribution, and suppose that we are interested in the probability of the event {X1 ≤ c1, X2 ≤ c2} for specific values c1, c2. In general, we cannot explicitly calculate the double integral c2 −∞ c1 −∞ f (x1, x2) dx1 dx2, (12.4.4) where f (x1, x2) is the joint p.d.f. of (X1, X2).We can write the joint p.d.f. as f (x1, x2) = g1(x1|x2)f2(x2), where g1 is the conditional p.d.f. of X1 given X2 = x2 and f2 is the marginal p.d.f. of X2. Both of these p.d.f.’s are normal p.d.f.’s, as we learned in Sec. 5.10. In particular, the conditional distribution of X1 given X2 = x2 is the normal distribution with mean and variance given by Eq. (5.10.8).We can explicitly perform the inner integration in (12.4.4) as c1 −∞ f (x1, x2) dx1 = c1 −∞ g(x1|x2)f2(x2) dx1 = f2(x2)
c1 − μ1 − ρσ1(x2 − μ2)/σ2 σ1(1− ρ)1/2 , where is the standard normal c.d.f. The integral in (12.4.4) is then the integral of this last expression with respect to x2. An efficient importance function might be the conditional p.d.f. of X2 given that X2 ≤ c2. That is, let h be the p.d.f. h(x2) = (2πσ2 2) −1/2 exp − 1 2σ2 2 (x2 − μ2)2
c2 − μ2 σ2 , for −∞< x2 ≤ c2. (12.4.5) It is not difficult to see that if U has the uniform distribution on the interval [0, 1], then W = μ2 + σ2 −1 U
c2 − μ2 σ2 (12.4.6) has the p.d.f. h. (See Exercise 5.) If we use h as an importance function and simulate W(1), . . . , W(v) with this p.d.f., then our estimator of the integral (12.4.4) is 1 v v i=1 c1 − μ1 − ρσ1(W(i) − μ2)/σ2 σ1(1− ρ)1/2
c2 − μ2 σ2 . 820 Chapter 12 Simulation It is not always possible to guarantee that an importance sampling estimator will have finite variance. In the examples in this section, we have managed to find importance functions with the following property. The ratio of the function being integrated to the importance function is bounded. This property guarantees finite variance for the importance sampling estimator. (See Exercise 8.) Stratified Importance Sampling Suppose that we are trying to estimate θ = g(x) dx, and that we contemplate using the importance function f . The simulation variance of the importance sampling estimator of θ arises from the variance of Y = g(X)/f (X), where X has the p.d.f. f . Indeed, if we simulate an importance sample of size n, the simulation variance of our estimator is σ2/n, where σ2 = Var(Y ). Stratified importance sampling attempts to reduce the simulation variance by splitting θ into θ = k j=1 θj and then estimating each θj with much smaller simulation variance. The stratified importance sampling algorithm is easiest to describe when X is simulated using the probability integral transformation. Let F be the c.d.f. corresponding to the p.d.f. f . First, we split θ as follows. Define q0 =−∞, qj = F −1(j/k) for j = 1, . . . , k − 1, and qk =∞. Then define θj = qj qj−1 g(x) dx, for j = 1, . . . , k. Clearly, θ = k j=1 θj . Next, we estimate each θj by importance sampling using the same importance function f , but restricted to the range of integration for θj .That is,we estimate θj using importance sampling with the importance function fj(x) = kf (x) if qj−1 ≤x <qj , 0 otherwise. (See Exercise 9 to see that fj is indeed a p.d.f.) To simulate a random variable with the p.d.f. fj , let V have the uniform distribution on the interval [(j − 1)/k, j/k] and set Xj = F −1(V ). The reader can prove (see Exercise 9) that Xj has the p.d.f. fj . Let σ2 j be the variance of g(Xj )/fj(Xj ). Suppose that, for each j = 1, . . . , k, we simulate an importance sample of size m with the same distribution as Xj . The variance of the estimator of θj will be σ2 j/m. Since the k estimators of θ1, . . . , θk are independent, the variance of the estimator of θ will be k j=1 σ2 j/m. To facilitate comparison to nonstratified importance sampling, let n = mk. Stratification will be an improvement if its variance is smaller than σ2/n. Since n = mk, we would like to prove that at least σ2 ≥ k k j=1 σ2 j , (12.4.7) and preferably with strict inequality. To prove (12.4.7), we note a close connection between the random variables Xj with the p.d.f. fj and X with the p.d.f. f . Let J be a random variable with the discrete uniform distribution on the integers 1, . . . , k. DefineX ∗ = XJ , so that the conditional p.d.f. of X ∗ given J = j is fj . You can prove (Exercise 11) that X ∗ and X have the same p.d.f. Let Y = g(X)/f (X) and Y ∗ = g(X ∗ ) fj(X∗) = g(X ∗ ) kf (X∗) . Then Var(Y ∗|J = j) = σ2 j and kY ∗ has the same distribution as Y. So, 12.4 Importance Sampling 821 σ2 = Var(Y ) = Var(kY ∗ ) = k2 Var(Y ∗ ). (12.4.8) Theorem 4.7.4 says that Var(Y ∗ ) = E Var(Y ∗|J) + Var[E(Y ∗|J)]. (12.4.9) By construction, E(Y ∗|J = j) = θj and Var(Y ∗|J = j) = σ2 j . Also, Var[E(Y ∗|J)]≥ 0 with strict inequality if the θj are not all the same. Since Pr(J = j) = 1/k for j = 1, . . . , k, we have E Var(Y ∗|J) = 1 k k j=1 σ2 j . (12.4.10) Combining Eqs. (12.4.8), (12.4.9), and (12.4.10), we obtain (12.4.7), with strict inequality if the θj are not all equal. Example 12.4.4 Illustration of Stratified Importance Sampling. Consider the integral that we wanted to estimate in Example 12.4.1. The best importance function appeared to be f3, with a simulation standard error of ˆσ3/100 = 9.73 × 10−4. In the present example, we allocate 10,000 simulations among k = 10 subsets of size m = 1000 each and do stratified importance sampling by dividing the range of integration [0, 1] into 10 equal-length subintervals. Doing this,we get a Monte Carlo estimate of the integral of 0.5248. To estimate the simulation standard error, we need to estimate each σj by ˆσ ∗ j and compute 10 j=1 ˆσ ∗2 j /1000. In the simulation that we are discussing, the simulation standard error for stratified importance sampling is 1.05 × 10−4, about one-tenth as small as the unstratified version.We can also do stratified importance sampling using k = 100 subsets of size m = 100. In our simulation, the estimate of the integral is the same with simulation standard error of 1.036 × 10−5. The reason that stratified importance sampling works so well in Example 12.4.4 is that the function g(x)/f3(x) is monotone, and this makes θj change about as much as it can as j changes. Hence, Var[E(Y ∗|J)] is large, making stratification very effective. Summary We introduced the method of importance sampling for calculating integrals by simulation. The idea of importance sampling for estimating g(x) dx is to choose a p.d.f. f from which we can simulate and such that g(x)/f (x) is nearly constant. Then we rewrite the integral as [g(x)/f (x)]f (x) dx.We can estimate this last integral by averaging g(X(i))/f (X(i)) where X(1), . . . , X(v) form a random sample with the p.d.f. f. A stratified version of importance sampling can produce estimators with even smaller variance. Exercises 1. Prove that the formula in Eq. (12.4.1) is the same as importance sampling in which the importance function is the p.d.f. of the uniform distribution on the interval [a, b]. 2. Let g be a function, and suppose that we wish to compute the mean of g(X) where X has the p.d.f. f . Suppose that we can simulate pseudo-random values with the p.d.f. f . Prove that the following are the same: Simulate X(i) values with the p.d.f. f , and average the values of g(X(i)) to obtain the estimator. Do importance sampling with importance function f to estimate the integral g(x)f (x) dx. 822 Chapter 12 Simulation 3. Let Y have the F distribution with m and n degrees of freedom. We wish to estimate Pr(Y > c). Consider the p.d.f. f (x) = (n/2)cn/2 xn/2+1 if x >c, 0 otherwise. a. Explain how to simulate pseudo-random numbers with the p.d.f. f . b. Explain how to estimate Pr(Y > c) using importance sampling with the importance function f . c. Look at the form of the p.d.f. of Y , Eq. (9.7.2), and explain why importance sampling might be more efficient than sampling i.i.d. F random variables with m and n degrees of freedom if c is not small. 4. We would like to calculate the integral ∞ 0 log(1 + x) exp(−x) dx. a. Simulate 10,000 exponential random variables with parameter 1 and use these to estimate the integral. Also, find the simulation standard error of your estimator. b. Simulate 10,000 gamma random variables with parameters 1.5 and 1 and use these to estimate the integral (importance sampling). Find the simulation standard error of the estimator. (In case you do not have the gamma function available, (1.5) = √ π/2.) c. Which of the two methods appears to be more efficient? Can you explain why? 5. Let U have the uniform distribution on the interval [0, 1]. Show that the random variable W defined in Eq. (12.4.6) has the p.d.f. h defined in Eq. (12.4.5). 6. Suppose that we wish to estimate the integral ∞ 1 x2 √ 2π e −0.5x2 dx. In parts (a) and (b) below, use simulation sizes of 1000. a. Estimate the integral by importance sampling using random variables having a truncated normal distribution. That is, the importance function is 1 √ 2π[1− (1)] e −0.5x2 , for x >1. b. Estimate the integral by importance sampling using random variables with the p.d.f. x exp(0.5[1− x2]), forx >1. Hint: Prove that such random variables can be obtained as follows: Start with a random variable that has the exponential distribution with parameter 0.5, add 1, then take the square root. c. Compute and compare simulation standard errors for the two estimators in parts (a) and (b). Can you explain why one is so much smaller than the other? 7. Let (X1, X2) have the bivariate normal distribution with both means equal to 0, both variances equal to 1, and the correlation equal to 0.5. We wish to estimate θ = Pr(X1 ≤ 2, X2 ≤ 1) using simulation. a. Simulate a sample of 10,000 bivariate normal vectors with the above distribution. Use the proportion of vectors satisfying the two inequalities X1 ≤ 2 and X2 ≤ 1 as the estimator Z of θ. Also compute the simulation standard error of Z. b. Use the method described in Example 12.4.3 with 10,000 simulations to produce an alternative estimator Z of θ. Compute the simulation standard error of Z and compare Z to the estimate in part (a). 8. Suppose that we wish to approximate the integral g(x) dx. Suppose that we have a p.d.f. f that we shall use as an importance function. Suppose that g(x)/f (x) is bounded. Prove that the importance sampling estimator has finite variance. 9. Let F be a continuous strictly increasing c.d.f. with p.d.f. f . Let V have the uniform distribution on the interval [a, b] with 0 ≤ a < b ≤ 1. Prove that the p.d.f. of X = F −1(V ) is f (x)/(b − a) for F −1(a) ≤ x ≤ F −1(b). (If a = 0, let F −1(a)=−∞. If b = 1, let F −1(b)=∞.) 10. For the situation described in Exercise 6, use stratified importance sampling as follows: Divide the interval (1,∞) into five intervals that each have probability 0.2 under the importance distribution. Sample 200 observations from each interval. Compute the simulation standard error. Compare this simulation to the simulation in Exercise 6 for each of parts (a) and (b). 11. In the notation used to develop stratified importance sampling, prove that X ∗ = XJ and X have the same distribution. Hint: The conditional p.d.f. of X ∗ given J = j is fj . Use the law of total probability. 12. Consider again the situation described in Exercise 15 of Sec. 12.2. Suppose that Wu has the Laplace distribution with parameters θ = 0 and σ = 0.1u1/2. See Eq. (10.7.5) for the p.d.f. a. Prove that the m.g.f. of Wu is ψ(t) = 1− t2u 100 −1 , for −10u −1/2< t<10u −1/2. b. Let r = 0.06 be the risk-free interest rate. Simulate a large number v of values of Wu with u = 1 and use these to estimate the price of an option to buy one share of this stock at time u = 1 in the future for the current price S0. Also compute the simulation standard error. c. Use importance sampling to improve on the simulation in part (b). Instead of simulating Wu values directly, simulate from the conditional distribution of Wu given that Su > S0. How much smaller is the simulation standard error?
- The method of control variates is a technique for reducing the variance of a simulation estimator. Suppose that we wish to estimate θ = E(W).Acontrol variate is another random variable V that is positively correlated with W and whose mean μ we know. Then, for every constant k > 0, E(W − kV + kμ) = θ. Also, if k is chosen carefully, Var(W − kV + kμ)<Var(W). In this exercise,we shall see how to use control variates for importance sampling, but the method is very general. Suppose that we wish to compute g(x) dx, andwewish to use the importance function f . Suppose that there is a function h such that h is similar to g but h(x) dx is known to equal the value c. Let k be a constant. Simulate X(1), . . . , X(v) with the p.d.f. f , and define W(i) = g(X(i)) f (X(i)) , V (i) = h(X(i)) f (X(i)) , Y (i) = W(i) − kV (i), for all i. Our estimator of g(x) dx is then Z = 1 v v i=1 Y (i) + kc.
- Prove that E(Z) = g(x) dx.
- Let Var(W(i)) = σ2 W and Var(V (i)) = σ2 V . Let ρ be the correlation between W(i) andV (i). Prove that the value of k that makes Var(Z) the smallest is k = σWρ/σV .
- Suppose that we wish to integrate the same function g(x) as in Example 12.4.1.
- Use the method of control variates that was described in Exercise 13 to estimate g(x) dx. Let h(x) = 1/(1+ x2) for 0 < x <1, and k = e −0.5. (This makes h about the same size as g.) Let f (x) be the function f3 in Example 12.4.1. How does the simulation standard error using control variates compare to not using control variates?
- Estimate the variances and correlation of the W(i)’s and V (i)’s (notation of Exercise 13) to see what a good value for k might be.
- The method of antithetic variates is a technique for reducing the variance of simulation estimators. Antithetic variates are negatively correlated random variables that share a common mean and common variance. The variance of the average of two antithetic variates is smaller than the variance of the average of two i.i.d. variables. In this exercise, we shall see how to use antithetic variates for importance sampling, but the method is very general. Suppose that we wish to compute g(x) dx, and we wish to use the importance function f . Suppose that we generate pseudo-random variables with the p.d.f. f using the probability integral transformation. That is, for i = 1, . . . , v, let X(i) = F −1(U(i)), where U(i) has the uniform distribution on the interval [0, 1] and F is the c.d.f. corresponding to the p.d.f. f . For each i = 1, . . . , v, define T (i) = F −1(1− U(i)), W(i) = g(X(i)) f (X(i)) , V (i) = g(T (i)) f (T (i)) , Y (i) = 0.5 # W(i) + V (i) $ . Our estimator of g(x) dx is then Z = 1 v v i=1 Y (i).
- Prove that T (i) has the same distribution as X(i).
- Prove that E(Z) = g(x) dx.
- If g(x)/f (x) is a monotone function, explain why we would expect W(i) and V (i) to be negatively correlated.
- If W(i) and V (i) are negatively correlated, show that Var(Z) is less than the variance one would get with 2v simulations without antithetic variates.
- Use the method of antithetic variates that was described in Exercise 15. Let g(x) be the function that we tried to integrate in Example 12.4.1. Let f (x) be the function f3 in Example 12.4.1. Estimate Var(Y (i)), and compare it to ˆσ 2 3 from Example 12.4.1.
- For each of the exercises in this section that requires a simulation, see if you can think of a way to use control variates or antithetic variates to reduce the variance of the simulation estimator.
12.5 Markov Chain Monte Carlo The techniques described in Sec. 12.3 for generating pseudo-random numbers with particular distributions are most useful for univariate distributions. They can be applied in many multivariate cases, but they often become unwieldy. A method based on Markov chains (see Sec. 3.10) became popular after publications by Metropolis et al. (1953) and Gelfand and Smith (1990). We shall present only the simplest form of Markov chain Monte Carlo in this section. 824 Chapter 12 Simulation The Gibbs Sampling Algorithm We shall begin with an attempt to simulate a bivariate distribution. Suppose that the joint p.d.f. of (X1, X2) is f (x1, x2) = cg(x1, x2), where we know the function g but not necessarily the value of the constant c. This type of situation arises often when computing posterior distributions. If X1 and X2 are the parameters, the function g might be the product of the prior p.d.f. times the likelihood function (in which the data are treated as known values). The constant c = 1/ g(x1, x2) dx1 dx2 makes cg(x1, x2) the posterior p.d.f. Often it is difficult to compute c, although the methods of Sec. 12.4 might be helpful. Even if we can approximate the constant c, there are other features of the posterior distribution that we might not be able to compute easily, so simulation would be useful. If the function g(x1, x2) has a special form, then there is a powerful algorithm for simulating vectors with the p.d.f. f . The required form can be described as follows: First, consider g(x1, x2) as a function of x1 for fixed x2. This function needs to look like a p.d.f. (for X1) from which we know how to simulate pseudo-random values. Similarly, if we consider g(x1, x2) as a function of x2 for fixed x1, the function needs to look like a p.d.f. for X2 from which can simulate. Example 12.5.1 Sample from a Normal Distribution. Suppose that we have observed a sample from the normal distribution with unknown mean μ and unknown precision τ . Suppose that we use a natural conjugate prior of the form described in Sec. 8.6. The product of the prior and the likelihood is given by Eq. (8.6.7) without the appropriate constant factor. We reproduce a version of that equation here for convenience: ξ(μ, τ|x) ∝ τ α1+1/2−1 exp
−τ 1 2 λ1(μ − μ1)2 + β1 , where α1, β1, μ1, and λ1 are known values once the data have been observed. Considering this as a function of μ for fixed τ , it looks like the p.d.f. of the normal distribution with mean μ1 and variance (τλ1) −1. Considering it as a function of τ for fixed μ, it looks like the p.d.f. of the gamma distribution with parameters α1 + 1/2 and λ1(μ − μ1)2/2 + β1. Both of these distributions are easy to simulate. When we consider g(x1, x2) as a function of x1 for fixed x2, we are looking at the conditional p.d.f. of X1 given X2 = x2, except for a multiplicative factor that does not depend on x1. (See Exercise 1.) Similarly, when we consider g(x1, x2) as a function of x2 for fixed x1, we are looking at the conditional p.d.f. of X2 given X1 = x1. Once we have determined that the function g(x1, x2) has the desired form, our algorithm proceeds as follows: 1. Pick a starting value x(0) 2 for X2, and set i = 0. 2. Simulate a new value x(i+1) 1 from the conditional distribution of X1 given X2 = x(i) 2 . 3. Simulate a new value x(i+1) 2 from the conditional distribution of X2 given X1 = x(i+1) 1 . 4. Replace i by i + 1 and return to step 2. The algorithm typically terminates when i reaches a sufficiently large value. Although there currently are no truly satisfactory convergence criteria, we shall introduce one convergence criterion later in this section. This algorithm is commonly called Gibbs 12.5 Markov Chain Monte Carlo 825 sampling.The name derives from an early use of the technique byGemanandGeman (1984) for sampling from a distribution that was known as the Gibbs distribution. Some Theoretical Justification So far, we have given no justification for the Gibbs sampling algorithm. The justification stems from the fact that the successive pairs (x(1) 1 , x(1) 2 ), (x(2) 1 , x(2) 2 ) . . . form the observed sequence of states from a Markov chain. This Markov chain is much more complicated than any of the Markov chains encountered in Sec. 3.10 for two reasons. First, the states are two-dimensional, and second, the number of possible states is infinite rather than finite. Even so, one can easily recognize the basic structure of a Markov chain in the description of the Gibbs sampling algorithm. Suppose that i is the current value of the iteration index. The conditional distribution of the next state pair (X(i+1) 1 , X(i+1) 2 ) given all of the available state pairs (X(1) 1 , X(1) 2 ), . . . , (X(i) 1 , X(i) 2 ) depends only on (X(i) 1 , X(i) 2 ), the current state pair. This is the same as the defining property of finite Markov chains in Sec. 3.10. Even if we agree that the sequence of pairs forms a Markov chain, why should we believe that they come from the desired distribution? The answer lies in a generalization of the second part of Theorem 3.10.4 to more general Markov chains. The generalization is mathematically too involved to present here, and it requires conditions that involve concepts that we have not introduced in this book. Nevertheless, the Gibbs sampler is constructed from a joint distribution that one can show (see Exercise 2) is a stationary distribution for the resulting Markov chain. For the cases that we illustrate in this book, the distribution of the Gibbs sampler Markov chain does indeed converge to this stationary distribution as the number of transitions increases. (For a more general discussion, see Tierney, 1994.) Because of the close connection with Markov chains, Gibbs sampling (and several related techniques) are often called Markov chain Monte Carlo. When Does the Markov Chain Converge? Although the distribution of a Markov chain may converge to its stationary distribution, after any finite time the distribution will not necessarily be the stationary distribution. In general, the distribution will get pretty close to the stationary distribution in finite time, but how do we tell, in a particular application, if we have sampled the Markov chain long enough to be confident that we are sampling from something close to the stationary distribution? Much work has been done to address this question, but there is no foolproof method. Several methods for assessing convergence of a Markov chain in a Monte Carlo analysis were reviewed by Cowles and Carlin (1996). Here we present one simple technique. Begin by sampling several versions of the Markov chain starting at k different initial values x(0) 2,1, . . . , x(0) 2, k. These k Markov chains will be useful not only for assessing convergence but also for estimating the variances of our simulation estimators. It is wise to choose the initial values x(0) 2,1, . . . , x(0) 2, k to be quite spread out. This will help us to determine whether we have a Markov chain that is very slow to converge. Next, apply the Gibbs sampling algorithm starting at each of the k initial values. This gives us k independent Markov chains, all with the same stationary distribution. If the k Markov chains have been sampled for m iterations, we can think of the observed values of X1 (or of X2) as k samples of size m each. For ease of notation, let Ti,j stand for either the value of X1 or the value of X2 from the jth iteration of the 826 Chapter 12 Simulation ith Markov chain. (We shall repeat the following analysis once for X1 and once for X2.) Now, treat Ti,j for j = 1, . . . , m as a sample of size m from the ith of k distributions for i = 1, . . . , k. If we have sampled long enough for the Markov chains to have converged approximately, then all k of these distributions should be nearly the same. This suggests that we use the F statistic from the discussion of analysis of variance (Sec. 11.6) to measure how close the k distributions are. The F statistic can be written as F = B/W where B = m k − 1 k i=1 (T i+ − T ++)2, W = 1 k(m − 1) k i=1 m j=1 (Tij − T i+)2. Here we have used the same notation as in Sec. 11.6 in which the + subscript appears in a position wherever we have averaged over all values of the subscript in that position. If the k distributions are different, then F should be large. If the distributions are the same, then F should be close to 1. As we mentioned earlier, we compute two F statistics, one using the X1 coordinates and one using the X2 coordinates. Then we could declare that we have sampled long enough when both F statistics are simultaneously less than some number slightly larger than 1. Gelman et al. (1995) describe essentially this same procedure and recommend comparing the maximum of the two F statistics to 1+ 0.44m. It is probably a good idea to start with at least m = 100 (if the iterations are fast enough) before beginning to compute the F statistics. This will help to avoid accidentally declaring success due to some “lucky” early simulations. The initial sequence of iterations of the Markov chain, before we declare convergence, is commonly called burn-in. After the burn-in iterations, one would typically treat the ensuing iterations as observations from the stationary distribution. It is common to discard the burn-in iterations because we are not confident that their distribution is close to the stationary distribution. Iterations of a Markov Chain are dependent, however, so one should not treat them as an i.i.d. sample. Even though we computed an F statistic from the various dependent observations, we did not claim that the statistic had an F distribution. Nor did we compare the statistic to a quantile of an F distribution to make our decision about convergence.We merely used the statistic as an ad hoc measure of how different the k Markov chains are. Example 12.5.2 Nursing Homes in NewMexico. We shall use the data from Sec. 8.6 on the numbers of medical in-patient days in 18 nonrural nursing homes in New Mexico in 1988. There, we modeled the observations as a random sample from the normal distribution with unknown mean μ and unknown precision τ . We used a natural conjugate prior and found the posterior hyperparameters to be α1 = 11, β1 = 50925.37, μ1 = 183.95, and λ1 = 20. We shall illustrate the above convergence diagnostic for the Gibbs sampling algorithm described in Example 12.5.1. As we found in Example 12.5.1, the conditional distribution of μ given τ is the normal distribution with mean 183.95 and variance (20τ) −1. The conditional distribution of τ given μ is the gamma distribution with parameters 11.5 and 50925.37 + 20(μ − 183.95)2. We shall start with the following k = 5 initial values for μ: 182.17, 227, 272, 137, 82. These were chosen by making a crude approximation to the posterior standard deviation of μ, namely, (β1/[λ1α1])1/2 ≈ 15, and then using the posterior mean together with values 3 and 6 posterior standard deviations above and below the posterior mean. We have to run the five Markov chains to the m = 2 iteration before we can compute the F statistics. 12.5 Markov Chain Monte Carlo 827 In our simulation, at m = 2, the larger of the two F statistics was already as low as 0.8862, and it stayed very close to 1 all the way to m = 100, at which time it seemed clear that we should stop the burn-in. Estimation Based on Gibbs Sampling So far, we have argued (without proof) that if we run the Gibbs sampling algorithm for many iterations (through burn-in), we should start to see pairs (X(i) 1 , X(i) 2 ) whose joint p.d.f. is nearly the function f from which we wanted to sample. Unfortunately, the successive pairs are not independent of each other even if they do have the same distribution. The law of large numbers does not tell us that the average of dependent random variables with the same distribution converges. However, the type of dependence that we get from a Markov chain is sufficiently regular that there are theorems that guarantee convergence of averages and even that the averages are asymptotically normal. That is, suppose that we wish to estimate the mean μ of some function h(X1, X2) based on m observations from the Markov chain. We can still assume that 1 m m i=1 h(X(i) 1 , X(i) 2 ) converges to μ, and that it has approximately the normal distribution with mean μ and variance σ2/m. However, the convergence will typically be slower than for i.i.d. samples, and σ2 will be larger than the variance of h(X1, X2). The reason for this is that the successive values of h(X(i) 1 , X(i) 2 ) are usually positively correlated. The variance of an average of positively correlated identically distributed random variables is higher than the variance of an average of the same number of i.i.d. random variables. (See Exercise 4.) We shall deal with the problems caused by correlated samples by making use of the same k independent Markov chains that we used for determining how much burnin to do. Discard the burn-in and continue to sample each Markov chain for m0 more iterations. From each Markov chain, we compute our desired estimator, either an average, a sample quantile, a sample variance, or other measure, Zj for j = 1, . . . , k. We then compute S as in Eq. (12.2.2); that is, S = ⎛ ⎝1 k k j=1 (Zj − Z)2 ⎞ ⎠ 1/2 . (12.5.1) Then S2 is an estimator of the simulation variance of the Zj ’s. Write the simulation variance as σ2/m0 and estimate σ2 by ˆσ 2 = m0S2 as we did in Example 12.2.9. Also, combine all samples from all k chains into a single sample, and use this single sample to form the overall estimator Z. The simulation standard error of our estimator Z is then (ˆσ 2/(m0k))1/2 = S/k1/2. In addition, we may wish to determine how many simulations to perform in order to obtain a precise estimator. We can substitute ˆσ for σ in Eq. (12.2.5) to get a proposed number of simulations v. These v simulations would be divided between the k Markov chains so that each chain would be run for at least v/k iterations if v/k > m0. Some Examples Example 12.5.3 Nursing Homes in New Mexico. We actually do not need Gibbs sampling in order to simulate a sample from the posterior distribution in Example 12.5.1. The reason is that we have a closed-form expression for the joint distribution of μ and τ in that example. Each of the marginal and conditional distributions are known and easy 828 Chapter 12 Simulation . . . . ………………………………………………………………. . . . t quantiles Simulated values of m Simulated values of t 24 22 2 4 120 140 160 180 200 220 240 ……………………………………………………… … . .. . . Gamma quantiles 5 10 15 20 25 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 Figure 12.7 Quantile plots of μ and τ values simulated from the posterior distribution in Example 12.5.3. The line on each plot shows the quantiles of the actual posterior distribution as found in Sec. 8.6. The horizontal axis on the left plot is labeled by quantiles of the t distribution with 22 degrees of freedom. The actual posterior of μ is a rescaling and shifting of this t distribution. The horizontal axis on the right plot is labeled by quantiles of the gamma distribution with parameters 11 and 1. The actual posterior of τ is a rescaling of this gamma distribution. to simulate. Gibbs sampling is most useful when only the conditionals are easy to simulate. However, we can illustrate the use of Gibbs sampling in Example 12.5.1 and compare the simulated results to the known marginal distributions of μ and τ . In Example 12.5.2, we started k = 5 Markov chains and burned them in for 100 iterations. Now we wish to produce a sample of (μ, τ ) pairs from the joint posterior distribution. After burn-in, we run anotherm0 = 1000 iterations for each chain.These iterations produce five correlated sequence of (μ, τ ) pairs. The correlations between successive pairs of μ values are quite small. The same is true of successive τ values. To compare the results with the known posterior distributions found in Sec. 8.6, Fig. 12.7 has a t quantile plot of the μ values and a gamma quantile plot of the τ values. (Normal quantile plots were introduced on page 720. Gamma and t quantile plots are constructed in the same way using gamma and t quantile functions in place of the standard normal quantile function.) The simulated values seem to lie close to the lines drawn on the plots in Fig. 12.7. (A few points in the tails stray a bit from the lines, but this occurs with virtually all quantile plots.) The lines in Fig. 12.7 show the quantiles of the actual posterior distributions, which are a t distribution with 22 degrees of freedom multiplied by 15.21 and centered at 183.95 for μ and the gamma distribution with parameters 11 and 50925.37 for τ . We can use the sample of (μ, τ ) pairs to estimate the posterior mean of an arbitrary function of (μ, τ ). For example, suppose that we are interested in the mean θ of μ + 1.645/τ 1/2, which is the 0.95 quantile of the unknown distribution of the original observations. The average of our 5000 simulated values of μ + 1.645/τ 1/2 is Z = 299.67. The value of S from Eq. (12.5.1) is 0.4119, giving us a value of ˆσ = 13.03. The simulation standard error of Z is then ˆσ/50001/2 = 0.1842. The true posterior mean of μ + 1.645/τ 1/2 can be computed exactly in this example, and it is μ1 + 1.645β 1/2 1 (α1 − .5) (α1) = 299.88, a bit more than 1 simulation standard error away from our simulated value of Z. Suppose that we want our estimator of θ to be within 0.01 of the true value with 12.5 Markov Chain Monte Carlo 829 probability 0.99. Substituting these values and ˆσ = 13.03 into Eq. (12.2.5), we find that we need v = 12,358,425 total simulations. Each of our five Markov chains would have to be run for 2,251,685 iterations. The true value of Gibbs sampling begins to emerge in problems with more than two parameters. The general Gibbs sampling algorithm for p random variables (X1, . . . , Xp) with p.d.f. f (x) = cg(x) is as follows. First, verify that g looks like an easy-to-simulate p.d.f. as a function of each variable for fixed values of all the others. Then perform these steps: 1. Pick starting values x(0) 2 , . . . , x(0) p for X2, . . . , Xp, and set i = 0. 2. Simulate a new value x(i+1) 1 from the conditional distribution of X1 given X2 = x(i) 2 , . . . , Xp = x(i) p . 3. Simulate a new value x(i+1) 2 from the conditional distribution of X2 given X1 = x(i+1) 1 , X3 = x(i) 3 , . . . , Xp = x(i) p . … p + 1. Simulate a new value x(i+1) p from the conditional distribution of Xp given X1 = x(i+1) 1 , . . . , Xp−1 = x(i+1) p−1 . p + 2. Replace i by i + 1, and return to step 2. The sequence of successive p-tuples of (X1, . . . , Xp) values produced by this algorithm is a Markov chain in the same sense as before. The stationary distribution of this Markov chain has the p.d.f. f , and the distribution of an iteration many steps after the start should be approximately the stationary distribution. Example 12.5.4 Multiple Regression with an Improper Prior. Consider a problem in which we observe data consisting of triples (Yi, x1i, x2i) for i = 1, . . . , n.We assume that the xji values are known, and we model the distribution of Yi as the normal distribution with mean β0 + β1x1i + β2x2i and precision τ . This is the multiple regression model introduced in Sec. 11.5 with the variance replaced by 1 over the precision. Suppose that we use the improper prior ξ(β0, β1, β2, τ) = 1/τ for the parameters. The posterior p.d.f. of the parameters is then proportional to the likelihood times 1/τ , which is a constant times τ n/2−1 exp −τ 2 n i=1 (yi − β0 − β1x1i − β2x2i)2 . (12.5.2) To simplify the ensuing formulas, we shall define some summaries of the data: x1 = 1 n n i=1 x1i, x2 = 1 n n i=1 x2i, y = 1 n n i=1 yi, s11 = n i=1 x2 1i, s22 = n i=1 x2 2i, s12 = n i=1 x1ix2i, s1y = n i=1 x1iyi, s2y = n i=1 x2iyi, syy = n i=1 y2 i . Looking at (12.5.2) as a function of τ for fixed values of β0, β1, and β2, it looks like the p.d.f. of the gamma distribution with parameters n/2 and n i=1(yi − β0 − β1x1i − β2x2i)2/2. Looking at (12.5.2) as a function of βj for fixed values of the other parameters, it is e to the power of a quadratic in βj with negative coefficient on the 830 Chapter 12 Simulation β2 j term. As such, it looks like the p.d.f. of a normal random variable with a mean that depends on the data and the other β’s and a variance that equals 1/τ times some function of the data.We can be more specific if we complete the square in the expression n i=1(yi − β0 − β1x1i − β2x2i)2 three times, each time treating a different βj as the variable of interest. For example, treating β0 as the variable of interest, we get n i=1 (yi − β0 − β1x1i − β2x2i)2 = n β0 − [y − β1x1 − β2x2] 2 , plus a term that does not depend on β0. So, the conditional distribution of β0 given the remaining parameters is the normal distribution with mean y − β1x1 − β2x2 and variance 1/[nτ]. Treating β1 as the variable of interest, we get n i=1 (yi − β0 − β1x1i − β2x2i)2 = s11(β1 − w1)2, plus a term that does not depend on β1, where w1 = 1 s11 s1y − β0nx1 − β2s12 . This means that the conditional distribution of β1 given the other parameters is the normal distribution with mean w1 and variance (τ s11) −1. Similarly, the conditional distribution of β2 given the other parameters is the normal distribution with mean w2 and variance (τ s22) −1, where w2 = 1 s22 s2y − β0nx2 − β1s12 . Example 12.5.5 Unemployment in the 1950s. In Example 11.5.9, we saw that unemployment data from the years 1951–1959 appeared to satisfy the assumptions of the multiple regression model better than the data that included the year 1950. Let us use just the last nine years of data from this example (in Table 11.12).We shall use an improper prior and Gibbs sampling to obtain samples from the posterior distribution of the parameters. The necessary conditional distributions were all given in Example 12.5.4. We just need the values of the summary statistics and n = 9: x1 = 140.7778, x2 = 6, y = 2.789, s11 = 179585, s22 = 384, s12 = 7837, s1y = 3580.9, s2y = 169.2, syy = 78.29. Once again, we shall run k = 5 Markov chains. In this problem, there are four coordinates to the parameter: βi for i = 0, 1, 2 and τ . So, we compute four F statistics and burn-in until the largest F is less than 1+ 0.44m. Suppose that this occurs at m = 4546. We then sample 10,000 more iterations from each Markov chain. Suppose that we want an interval [a, b] that contains 90 percent of the posterior distribution of β1. The numbers a and b will be the sample 0.05 and 0.95 quantiles. Based on our combined sample of 50,000 values of β1, the interval is [−0.1178, −0.0553]. In order to assess the uncertainty in the endpoints, we compute the 0.05 and 0.95 sample quantiles for each of the five Markov chains. Those values are 0.05 quantiles: − 0.1452, −0.1067, −0.1181, −0.1079, −0.1142 0.95 quantiles: − 0.0684, −0.0610, −0.0486, −0.0594, −0.0430. 12.5 Markov Chain Monte Carlo 831 The value of S based on the sample 0.05 quantiles is 0.01567, and the value of S based on the sample 0.95 quantiles is 0.01142. To be safe, we shall use the larger of these two to estimate the simulation standard errors of our interval endpoints. Since each chain was run for m0 =10,000 iterations, we have ˆσ = Sm 1/2 0 = 1.567. Suppose that we want each endpoint of the interval to be within 0.01 of the corresponding true quantile of the distribution of β1 with probability 0.95. (The probability that both endpoints are within 0.01 would be a bit smaller, but is harder to compute.) We could use Eq. (12.2.5) to compute how many simulations we would need. That equation yields v = 94,386, which means that each of our five chains would need to be run 18,878 iterations, about twice what we already have. For comparison, a 90 percent confidence interval for β1 constructed using the methods of Sec. 11.3 is [−0.1124, −0.0579]. This is quite close to the posterior probability interval. Although we did not do so in this text, we could have found the posterior distribution for Example 12.5.5 in closed form. Indeed, the 90 percent confidence interval calculated at the end of the example contains 90 percent of the posterior distribution in much the same way that coefficient 1− α0 confidence intervals contain posterior probability 1− α0 in Sec. 11.4 when we use improper priors. The next example is one in which a closed-form solution is not available. Example 12.5.6 Bayesian Analysis of One-Way Layout with Unequal Variances. Consider the one-way layout that was introduced in Sec. 11.6. There, we assumed that data would be observed from each of p normal distributions with possibly different means but the same variance. In order to illustrate the added power of Gibbs sampling, we shall drop the assumption that each normal distribution has the same variance. That is, for i = 1, . . . , p, we shall assume that Yi1, . . . , Yini have the normal distribution with mean μi and precision τi , and all observations are independent conditional on all parameters. Our prior distribution for the parameters will be the following: Let μ1, . . . , μp be conditionally independent given all other parameters with μi having the normal distribution with meanψ and precision λ0τi . Here, ψ is another parameter that also needs a distribution.We introduce this parameter ψ as a way of saying that we think that the μi ’s all come from a common distribution, but we are not willing to say for sure where that distribution is located.We then say that ψ has the normal distribution with meanψ0 and precision u0. For an improper prior, we could set u0 = 0 in what follows, and then ψ0 would not be needed either. Next, we model τ1, . . . , τp as i.i.d. having the gamma distribution with parameters α0 and β0. We model ψ and the τi ’s as independent. For an improper prior, we could set α0 = β0 = 0. The type of model just described is called a hierarchical model because of the way that the distributions fall into a hierarchy of levels. Figure 12.8 illustrates the levels of the hierarchy in this example. Figure 12.8 Diagram of hierarchical model in Example 12.5.6. The parameter ψ influences the distributions of the μi ’s, while the (μi, τi) parameters influence the distributions of the Yij ’s. Y11Y12 …Y1n1 m1 t1 Y21Y22 …Y2n2 m2 t2 Y31Y32 …Y3n3 m3 t3 Y41Y42 …Y4n4Y11 m4 t4 C 832 Chapter 12 Simulation The joint p.d.f. of the observations and the parameters is the product of the likelihood function (the p.d.f. of the observations given the μi ’s and τi ’s) times the product of the conditional prior p.d.f.’s of the μi ’s given the τi ’s and ψ, times the prior p.d.f.’s of the τi ’s times the prior p.d.f. for ψ. Aside from constants that depend neither on the data nor on the parameters, this product has the form exp −u0(ψ − ψ0)2 2 − p i=1 τi β0 + ni(μi − yi)2 + wi + λ0(μi − ψ)2 2 × !p i=1 τ α0+[ni +1]/2−1 i , (12.5.3) where wi = ni j=1(yij − yi)2 for i = 1, . . . , p. We have arranged terms in (12.5.3) so that the terms involving each parameter are close together. This will facilitate describing the Gibbs sampling algorithm. In order to set up Gibbs sampling, we need to examine (12.5.3) as a function of each parameter separately. The parameters are μ1, . . . , μp; τ1, . . . , τp; and ψ. As a function of τi , (12.5.3) looks like the p.d.f. of the gamma distribution with parameters α0+ (ni + 1)/2 and β0 + [ni(μi − yi)2 + wi + λ0(μi − ψ)2]/2. As a function of ψ, it looks like the p.d.f. of the normal distribution with mean [u0ψ0 + λ0 p i=1 τiμi]/[u0 + λ0 p i=1 τi] and precision u0 + λ0 p i=1 τi . This is obtained by completing the square for all terms involving ψ. Similarly, by completing the square for all terms involving μi , we find that (12.5.3) looks like the normal p.d.f. with mean [niyi + λ0ψ]/[ni + λ0] and precision τi(ni + λ0) as a function of μi . All of these distributions are easy to simulate. As an example, use the hot dog calorie data from Example 11.6.2. In this example, p = 4.We shall use a prior distribution in which λ0 = α0 = 1, β0 = 0.1, u0 = 0.001, and ψ0 = 170. We use k = 6 Markov chains and do m = 100 burn-in simulations, which turn out to be more than enough to make the maximum of all nine F statistics less than 1+ 0.44m.We then run each of the six Markov chains another 10,000 iterations. The samples from the posterior distribution allow us to answer any questions that we might have about the parameters, including some that we would not have been able to answer using the analysis done in Chapter 11. For example, the posterior means and standard deviations of some of the parameters are listed in Table 12.6. Notice how much different the variances 1/τi seem to be in the four groups. For example, we can compute the probability that 1/τ4 > 4/τ1 by counting up the number of iterations in which 1/τ ( ) 4 > 4/τ ( ) 1 and dividing by 60,000. The result is 0.5087, indicating that there is a large chance that at least some of the variances are quite different. If the variances are different, the ANOVA calculations in Chapter 11 are not justified. We can also address the question of how much difference there is between the μi ’s. For comparison, we shall do the same calculations that we did in Example 12.3.7. In 99 percent of the 60,000 simulations, at least one |μ( ) i − μ( ) j | > 24.66. In about onehalf of the simulations, all |μ( ) i − μ( ) j | > 2.268. And in 99 percent of the simulations, the average of the differences was at least 13.07. Figure 12.9 contains a plot of the sample c.d.f.’s of the largest, smallest, and average of the six |μi − μj | differences. Careful examination of the results in this example shows that the four μi ’s appear to be closer together than we would have thought after the analysis of Example 12.3.7. This is typical of what occurs when we use a proper prior in a hierarchical model. In Example 12.3.7, the μi ’s were all independent, and they did not have a common unknown mean in the prior. In Example 12.5.6, the μi ’s all have a common prior distribution with meanψ, which is an additional unknown parameter. The estimation 12.5 Markov Chain Monte Carlo 833 Table 12.6 Posterior means and standard deviations for some parameters in Example 12.5.6 Type Beef Meat Poultry Specialty i 1 2 3 4 E(μi |y) 156.7 158.4 120.7 159.7 (V ar(μi |y))1/2 3.498 5.241 6.160 10.55 E(1/τi |y) 252.3 487.3 670.8 1100 (V ar(1/τi |y))1/2 84.70 179.1 250.6 586.9 E(ψ|y) = 152.8 (V ar(ψ|y))1/2 = 10.42 Figure 12.9 Sample c.d.f.’s of the maximum, average, and minimum of the six |μi − μj | differences for Example 12.5.6. Difference Sample d.f. 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1.0 Maximum difference Average difference Minimum difference of this additional parameter allows the posterior distributions of the μi ’s to be pulled toward a location that is near the average of all of the samples. With these data, the overall sample average is 147.60. Prediction All of the calculations done in the examples of this section have concerned functions of the parameters. The sample from the posterior distribution that we obtain from Gibbs sampling can also be used to make predictions and form prediction intervals for future observations. The most straightforward way to make predictions is to simulate the future data conditional on each value of the parameter from the posterior sample. Although there are more efficient methods for predicting, this method is easy to describe and evaluate. Example 12.5.7 Calories in Hot Dogs. In Example 12.5.6, we might be concerned with how different we should expect the calorie counts of two hot dogs to be. For example, let Y1 and Y3 be future calorie counts for hot dogs of the beef and poultry varieties, respectively. We can form a prediction interval for D = Y1 − Y3 as follows: For each iteration , let 834 Chapter 12 Simulation the simulated parameter vector be θ( ) = μ( ) 1 , μ( ) 2 , μ( ) 3 , μ( ) 4 , τ( ) 1 , τ( ) 2 , τ( ) 3 , τ( ) 4 , ψ( ), β( ) . For each , simulate a beef hot dog calorie count Y ( ) 1 having the normal distribution with mean μ( ) 1 and variance 1/τ ( ) 1 . Also simulate a poultry hot dog calorie count Y ( ) 3 having the normal distribution with mean μ( ) 3 and variance 1/τ ( ) 3 . Then compute D( ) = Y ( ) 1 − Y ( ) 3 . Sample quantiles of the values D(1), . . . , D(60,000) can be used to estimate quantiles of the distribution of D. For example, suppose that we want a 90 percent prediction interval for D. We simulate 60,000D( ) values as above and find the 0.05 and 0.95 sample quantiles to be −14.86 and 87.35, which are then the endpoints of our prediction interval. To assess how close the simulation estimators are to the actual quantiles of the distribution of D, we compute the simulation standard errors of the two endpoints. For the samples from each of the k = 6 Markov chains, we can compute the sample 0.05 quantiles of ourD values.We can then use these values as Z1, . . . , Z6 in Eq. (12.5.1) to compute a value S. Our simulation standard error is then S/61/2.We can then repeat this for the sample 0.95 quantiles. For the two endpoints of our interval, the simulation standard errors are 0.2447 and 0.3255, respectively. These simulation standard errors are fairly small compared to the length of the prediction interval. Example 12.5.8 Censored Arsenic Measurements. Frey and Edwards (1997) describe the National Arsenic Occurrence Survey (NAOS). Several hundred community water systems submitted samples of their untreated water in an attempt to help characterize the distribution of arsenic across the nation. Arsenic is one of several contaminants that the Environmental Protection Agency (EPA) is required to regulate. One difficulty in modeling the occurrence of a substance like arsenic is that concentrations are often too low to be measured accurately. In such cases, the measurements are censored. That is, we only know that the concentration of arsenic is less than some censoring point, but not how much less. In the NAOS data set, the censoring point is 0.5 microgram per liter. Each concentration less than 0.5 microgram per liter is censored. Gibbs sampling can help us to estimate the distribution of arsenic in spite of the censored observations. Lockwood et al. (2001) do an extensive analysis of the NAOS and other data and show how the distribution of arsenic differs from one state to the next and from one type of water source to the next. For convenience, let us focus our attention on the 24 observations from one state, Ohio. Of those 24 observations, 11 were taken from groundwater sources (wells). The other 13 came from surface water sources (e.g., rivers and lakes). The following are seven uncensored groundwater observations from Ohio: 9.62, 10.50, 2.30, 0.80, 17.04, 9.90, 1.32. The other four groundwater observations were censored. Suppose that we model groundwater arsenic concentrations in Ohio as having the lognormal distribution with parameters μ and σ2. One popular way to deal with censored observations is to treat them like unknown parameters. That is, let Y1, . . . , Y4 be the four unknown concentrations from the four wells where the measurements were censored. Let X1, . . . , X7 stand for the seven uncensored values. Suppose that μ and τ = 1/σ 2 have the normal-gamma prior distribution with hyperparameters μ0, λ0, α0, and β0. The joint p.d.f. of X1, . . . , X7, Y1, . . . , Y4, and μ and τ is proportional to 12.5 Markov Chain Monte Carlo 835 τ β0+(7+4+1)/2−1 exp ⎛ ⎝−τ 2 λ0(μ − μ0)2 + 7 i=1 (log(xi) − μ)2 + 4 j=1 (log(yi) − μ)2 + 2β0 ⎞ ⎠. The observed data consist of the values x1, . . . , x7 of X1, . . . , X7 together with the fact that Yj ≤ 0.5 for j = 1, . . . , 4. The conditional distributions of μ and τ given the data and the other parameters are just like what we obtained in Example 12.5.1. To be precise, μ has the normal distribution with mean λ0μ0 + 7 i=1 log(xi) + 4 j=1 log(yj ) λ0 + 11 and precision τ(λ0 + 11) conditional on τ , the Yj ’s, and the data. Also, τ has the gamma distribution with parameters α0 + (11+ 1)/2 and β0 + 1 2 ⎛ ⎝ 7 i=1 (log(xi) − μ)2 + 4 j=1 (log(yi) − μ)2 + λ0(μ − μ0)2 ⎞ ⎠, conditional on μ, the Yj ’s, and the data. The conditional distribution of the Yj ’s given μ, τ , and the data is that of i.i.d. random variables with the lognormal distribution having parameters μ and 1/τ but conditional on Yj < 0.5. That is, the conditional c.d.f. of each Yj is F(y) = ([log(y) − μ]τ 1/2) ([log(0.5) − μ]τ 1/2) , for y <0.5. We can simulate random variables with c.d.f. F so long as we can compute the standard normal c.d.f. and quantile function. Let U have the uniform distribution on the interval [0, 1]. Then Y = exp μ + τ −1/2 −1[U ([log(0.5) − μ]τ 1/2)] has the desired c.d.f., F. One example of the type of inference that is needed in an analysis of this sort is to predict arsenic concentrations for different water systems. Knowing the likely sizes of arsenic measurements can help water systems choose economical treatments that will meet the standards set by the EPA. For simplicity, we shall simulate one arsenic concentration at each iteration of the Markov chain. For example, suppose that (μ(i), τ(i)) are the simulated values of μ and τ at the ith iteration of the Markov chain. Then we can simulate Y (i) = exp(μ(i) + Z(τ (i)) −1/2), where Z is a standard normal random variable. Figure 12.10 shows a histogram of the simulated log(Y (i)) values from 10 Markov chains of length 10,000 each. The proportion of predicted values that are below the censoring point of log(0.5) is 0.335, with a simulation standard error of 0.001. The median predicted value on the logarithmic scale is 0.208 with a simulation standard error of 0.007.We can transform this back to the original scale of measurement using the delta method as described in Example 12.2.8. The median predicted arsenic concentration is exp(0.208) = 1.231 micrograms per liter with a simulation standard error of 0.007 exp(0.208) = 0.009. Note: There Are More-General Markov Chain Monte Carlo Algorithms. Gibbs sampling requires a special structure for the distribution we wish to simulate. We need to be able to simulate the conditional distribution of each coordinate given the other coordinates. In many problems, this is not possible for at least some, if not all, of the coordinates. If only one coordinate is difficult to simulate, one might try using an acceptance/rejection simulator for that one coordinate. If even this does not work, 836 Chapter 12 Simulation Figure 12.10 Histogram of simulated log(arsenic) values for 10,000 iterations from each of 10 Markov chains in Example 12.5.8. The vertical line is at the censoring point, log(0.5). 210 25 0 5 10 5000 10000 15000 Predicted log(arsenic) values (micrograms per liter) Count there are more-general Markov chain Monte Carlo algorithms that can be used. The simplest of these is the Metropolis algorithm introduced by Metropolis et al. (1953). An introduction to the Metropolis algorithm can be found in chapter 11 of Gelman et al. (1995) together with a further generalization due to Hastings (1970). Summary We introduced the Gibbs sampling algorithm that produces a Markov chain of observations from a joint distribution of interest. The joint distribution must have a special form. As a function of each variable, the joint p.d.f. must look like a p.d.f. from which it is easy to simulate pseudo-random variables. The Gibbs sampling algorithm cycles through the coordinates, simulating each one conditional on the values of the others.The algorithm requires a burn-in period during which the distribution of states in the Markov chain converges to the desired distribution. Assessing convergence and computing simulation standard errors of simulated values are both facilitated by running several independent Markov chains simultaneously. Exercises 1. Let f (x1, x2) = cg(x1, x2) be a joint p.d.f. for (X1, X2). For each x2, let h2(x1) = g(x1, x2). That is, h2 is what we get by considering g(x1, x2) as a function of x1 for fixed x2. Show that there is a multiplicative factor c2 that does not depend on x1 such that h2(x1)c2 is the conditional p.d.f. of X1 given X2 = x2. 2. Let f (x1, x2) be a joint p.d.f. Suppose that (X(i) 1 , X(i) 2 ) has the joint p.d.f. f . Let (X(i+1) 1 , X(i+1) 2 ) be the result of applying steps 2 and 3 of the Gibbs sampling algorithm on page 824. Prove that (X(i+1) 1 , X(i) 2 ) and (X(i+1) 1 , X(i+1) 2 ) also have the joint p.d.f. f . 3. Let Z1, Z2, . . . form a Markov chain, and assume that the distribution of Z1 is the stationary distribution. Show that the joint distribution of (Z1, Z2) is the same as the joint distribution of (Zi, Zi+1) for all i > 1. For convenience, you may assume that the Markov chain has finite state space, but the result holds in general. 4. LetX1, . . . ,Xn be uncorrelated, each with variance σ2. Let Y1, . . . , Yn be positively correlated, each with variance σ2. Prove that the variance of X is smaller than the variance of Y . 5. Use the data consisting of 30 lactic acid concentrations in cheese, 10 from Example 8.5.4 and 20 from Exercise 16 in Sec. 8.6. Fit the same model used in Example 8.6.2 with the same prior distributon, but this time use the Gibbs sampling algorithm described in Example 12.5.1. Simulate 10,000 pairs of (μ, τ ) parameters. Estimate the posterior mean of ( √ τμ) −1, and compute the simulation standard error of the estimator. 12.5 Markov Chain Monte Carlo 837 6. Use the data on dishwasher shipments inTable 11.13 on page 744. Suppose that we wish to fit a multiple linear regression model for predicting dishwasher shipments from time (year minus 1960) and private residential investment. Suppose that the parameters have the improper prior proportional to 1/τ . Use the Gibbs sampling algorithm to obtain a sample of size 10,000 from the joint posterior distribution of the parameters. a. Let β1 be the coefficient of time. Draw a plot of the sample c.d.f. of |β1| using your posterior sample. b. We are interested in predicting dishwasher shipments for 1986. i. Draw a histogram of the values of β0 + 26β1 + 67.2β2 from your posterior distribution. ii. For each of your simulated parameters, simulate a dishwasher sales figure for 1986 (time = 26 and private residential investment = 67.2). Compute a 90 percent prediction interval from the simulated values and compare it to the interval found in Example 11.5.7. iii. Draw a histogram of the simulated 1986 sales figures, and compare it to the histogram in part i. Can you explain why one sample seems to have larger variance than the other? 7. Use the data inTable 11.19 on page 762.This time fit the model developed in Example 12.5.6. Use the prior hyperparameters λ0 = α0 = 1, β0 = 0.1, u0 = 0.001, andψ0 = 800. Obtain a sample of 10,000 from the posterior joint distribution of the parameters. Estimate the posterior means of the three parameters μ1, μ2, and μ3. 8. In this problem, we shall outline a form of robust linear regression. Assume throughout the exercise that the data consist of pairs (Yi, xi) for i = 1, . . . , n. Assume also that the xi ’s are all known and the Yi ’s are independent random variables. We shall only deal with simple regression here, but the method easily extends to multiple regression. a. Let β0, β1, and σ stand for unknown parameters, and let a be a known positive constant. Prove that the following two models are equivalent. That is, prove that the joint distribution of (Y1, . . . , Yn) is the same in both models. Model 1: For each i, [Yi − (β0 + β1xi)]/σ has the t distribution with a degrees of freedom. Model 2: For each i, Yi has the normal distribution with mean β0 + β1xi and variance 1/τi conditional on τi . Also, τ1, . . . , τn are i.i.d. having the gamma distribution with parameters a/2 and aσ2/2. Hint: Use the same argument that produced the marginal distribution of μ in Sec. 8.6 when μ and τ had a normal-gamma distribution. b. Now consider Model 2 from part (a). Let η = σ2, and assume that η has a prior distribution that is the gamma distribution with parameters b/2 and f/2, where b and f are known constants. Assume that the parameters β0 and β1 have an improper prior with “p.d.f.” 1. Show that the product of likelihood and prior “p.d.f.” is a constant times η(na+b)/2−1 !n i=1 τ (a+1)/2−1 i exp
−1 2 [fη + n i=1 τi 9 aη + (yi − β0 − β1xi)2 : . (12.5.4) c. Consider (12.5.4) as a function of each parameter for fixed values of the others. Show that Table 12.7 specifies the appropriate conditional distribution for each parameter given all of the others. Table 12.7 Parameters and conditional distributions for Exercise 8 Parameter (12.5.4) looks like the p.d.f. of this distribution η gamma distribution with parameters (na + b)/2 and f + a n i=1 τi /2 τi gamma distribution with parameters (a + 1)/2 and [aη + (yi − β0 − β1xi)2]/2 β0 nor mal distribution with mean n i=1 τi(yi − β1xi)/ n i=1 τi and precision n i=1 τi β1 nor mal distribution with mean n i=1 τixi(yi − β0)/ n i=1 τix2 i and precision n i=1 τix2 i 9. Use the data in Table 11.5 on page 699. Suppose that Yi is the logarithm of pressure and xi is the boiling point for the ith observation, i = 1, . . . , 17. Use the robust regression scheme described in Exercise 8 with a = 5, b = 0.1, and f = 0.1. Estimate the posterior means and standard deviations of the parameters β0, β1, and η. 10. In this problem, we shall outline a Bayesian solution to the problem described in Example 7.5.10 on page 423. Let τ = 1/σ 2 and use a proper normal-gamma prior of the form described in Sec. 8.6. In addition to the two parameters μ and τ , introduce n additional parameters. For i = 1, . . . , n, let Yi = 1 ifXi came from the normal distribution with mean μ and precision τ , and let Yi = 0 if Xi came from the standard normal distribution. a. Find the conditional distribution of μ given τ ; Y1, . . . , Yn; and X1, . . . , Xn. b. Find the conditional distribution of τ given μ; Y1, . . . , Yn; and X1, . . . , Xn. c. Find the conditional distribution of Yi given μ; τ ; X1, . . . , Xn; and the other Yj ’s. 838 Chapter 12 Simulation d. Describe how to find the posterior distribution of μ and τ using Gibbs sampling. e. Prove that the posterior mean of Yi is the posterior probability that Xi came from the normal distribution with unknown mean and variance. 11. Consider, once again, the model described in Example 7.5.10. Assume that n = 10 and the observed values of X1, . . . , X10 are − 0.92, −0.33, −0.09, 0.27, 0.50, −0.60, 1.66, −1.86, 3.29, 2.30. a. Fit the model to the observed data using the Gibbs sampling algorithm developed in Exercise 10. Use the following prior hyperparameters: α0 = 1, β0 = 1, μ0 = 0, and λ0 = 1. b. For each i, estimate the posterior probability that Xi came from the normal distribution with unknown mean and variance. 12. Let X1, . . . , Xn be i.i.d. with the normal distribution having mean μ and precision τ . Gibbs sampling allows one to use a prior distribution for (μ, τ ) in which μ and τ are independent. Let the prior distribution of μ be the normal distribution with mean μ0 and variance γ0. Let the prior distribution of τ be the gamma distribution with parameters α0 and β0. a. Show that Table 12.8 specifies the appropriate conditional distribution for each parameter given the other. b. Use the New Mexico nursing home data (Examples 12.5.2 and 12.5.3). Let the prior hyperparameters be α0 = 2, β0 = 6300, μ0 = 200, and γ0 = 6.35× 10−4. Implement a Gibbs sampler to find the posterior distribution of (μ, τ ). In particular, calculate an interval containing 95 percent of the posterior distribution of μ. Table 12.8 Parameters and conditional distributions for Exercise 12 Parameter Prior times likelihood looks like the p.d.f. of this distribution τ gamma distribution with parameters α0 + n/2 and β0 + 0.5 n i=1(xi − x)2 + 0.5n(x − μ)2, μ normal distribution with mean (γ0μ0 + nτx)/(γ0 + nτ) and precision γ0 + nτ. 13. Consider again the situation described in Exercise 12. This time, we shall let the prior distribution of μ be more like it was in the conjugate prior. Introduce another parameter γ , whose prior distribution is the gamma distribution with parameters a0 and b0. Let the prior distribution of μ conditional on γ be the normal distribution with mean μ0 and precision γ . a. Prove that the marginal prior distribution of μ specifies that
b0 a0 1/2 (μ − μ0) has the t distribution with 2a0 degrees of freedom. Hint: Look at the derivation of the marginal distribution of μ in Sec. 8.6. b. Suppose that we want the marginal prior distributions of both μ and τ to be the same as they were with the conjugate prior in Sec. 8.6. How must the prior hyperparameters be related in order to make this happen? c. Show that Table 12.9 specifies the appropriate conditional distribution for each parameter given the others. Table 12.9 Parameters and conditional distributions for Exercise 13 Parameter Prior times likelihood looks like the p.d.f. of this distribution τ gamma distribution with parameters α0 + n/2 and β0 + 0.5 n i=1(xi − x)2 + 0.5n(x − μ)2, μ normal distribution with mean (γμ0 + nτx)/(γ + nτ) and precision γ + nτ, γ gamma distribution with parameters a0 + 1/2 and b0 + 0.5(μ − μ0)2. d. Use the New Mexico nursing home data (Examples 12.5.2 and 12.5.3). Let the prior hyperparameters be α0 = 2, β0 = 6300, μ0 = 200, a0 = 2, and b0 = 3150. Implement a Gibbs sampler to find the posterior distribution of (μ, τ, γ). In particular, calculate an interval containing 95 percent of the posterior distribution of μ. 14. Consider the situation described in Example 12.5.8. In addition to the 11 groundwater sources, there are 13 observations taken from surface water sources in Ohio. Of the 13 surface water measurements, only one was censored. The 12 uncensored surface water arsenic concentrations from Ohio are 1.93, 0.99, 2.21, 2.29, 1.15, 1.81, 2.26, 3.10, 1.18, 1.00, 2.67, 2.15. a. Fit the same model as described in Example 12.5.8, and predict a logarithm of surface water concentration for each iteration of the Markov chain.
- Compare a histogram of your predicted measurements to the histogram of the underground well predictions in Fig. 12.10. Describe the main differences.
- Estimate the median of the distribution of predicted surface water arsenic concentration and compare it to the median of the distribution of predicted groundwater concentration.
- Let X1, . . . , Xn+m be a random sample from the exponential distribution with parameter θ. Suppose that θ has the gamma prior distribution with known parameters α and β. Assume that we get to observe X1, . . . , Xn, but Xn+1, . . . , Xn+m are censored.
- First, suppose that the censoring works as follows: For i = 1, . . . , m, if Xn+i ≤ c, then we learn only that Xn+i ≤ c, but not the precise value of Xn+i . Set up a Gibbs sampling algorithm that will allow us to simulate the posterior distribution of θ in spite of the censoring.
- Next, suppose that the censoring works as follows: For i = 1, . . . , m, if Xn+i ≥ c, then we learn only that Xn+i ≥ c, but not the precise value of Xn+i . Set up a Gibbs sampling algorithm that will allow us to simulate the posterior distribution of θ in spite of the censoring.
- Suppose that the time to complete a task is the sum of two parts X and Y . Let (Xi, Yi) for i = 1, . . . , n be a random sample of the times to complete the two parts of the task. However, for some observations, we get to observe only Zi = Xi
- Yi . To be precise, suppose that we observe (Xi, Yi) for i = 1, . . . , k and we observe Zi for i = k + 1, . . . , n. Suppose that all Xi and Yj are independent with every Xi having the exponential distribution with parameter λ and every Yj having the exponential distribution with parameter μ.
- Prove that the conditional distribution of Xi given Zi = z has the c.d.f. G(x|z) = 1− exp(−x[λ − μ]) 1− exp(−z[λ − μ]) , for 0 < x < z.
- Suppose that the prior distribution of (λ, μ) is as follows: The two parameters are independent with λ having the gamma distribution with parameters a and b, and μ having the gamma distribution with parameters c and d. The four numbers a, b, c, and d are all known constants. Set up a Gibbs sampling algorithm that allows us to simulate the posterior distribution of (λ, μ).
12.6 The Bootstrap The parametric and nonparametric bootstraps are methods for replacing an unknown distribution F with a known distribution in a probability calculation. If we have a sample of data from the distribution F, we first approximate F by ˆ F and then perform the desired calculation. If ˆ F is a good approximation to F, the bootstrap can be successful. If the desired calculation is sufficiently difficult, we typically resort to simulation. Introduction Assume that we have a sample X = (X1, . . . , Xn) of data from some unknown distribution F. Suppose that we are interested in some quantity that involves both F and X, for example, the bias of a statistic g(X) as an estimator of the median of F. The main idea behind bootstrap analysis in the simplest cases is the following: First, replace the unknown distribution F with a known distribution ˆ F. Next, let X∗ be a sample from the distribution ˆ F. Finally, compute the quantity of interest based on ˆ F and X∗, for example, the bias of g(X∗ ) as an estimator of the median of ˆ F. Consider the following overly simple example. Example 12.6.1 The Variance of the Sample Mean. Let X = (X1, . . . , Xn) be a random sample from a distribution with a continuous c.d.f. F. For the moment, we shall assume nothing more aboutF than that it has finite meanμand finite variance σ2. Suppose that we are interested in the variance of the sample mean X.We already know that this variance equals σ2/n, but we do not know σ2. In order to estimate σ2/n, the bootstrap replaces 840 Chapter 12 Simulation the unknown distribution F with a known distribution ˆ F, which also has finite mean ˆμ and finite variance ˆσ 2. If X∗ = (X ∗ 1, . . . , X ∗ n) is a random sample from ˆ F, then the variance of the sample mean X ∗ = 1 n n i=1 X ∗ i is ˆσ 2/n. Since the distribution ˆ F is known, we should be able to compute ˆσ 2/n, and we can then use this value to estimate σ2/n. One popular choice of known distribution ˆ F is the sample c.d.f. Fn defined in Sec. 10.6. This sample c.d.f. is the discrete c.d.f. that has jumps of size 1/n at each of the observed values x1, . . . , xn of the random sample X1, . . . , Xn. So, if X∗ = (X ∗ 1, . . . , X ∗ n) is a random sample from ˆ F, then each X ∗ i is a discrete random variable with the p.f. f (x) = 1 n if x ∈ {x1, . . . , xn }, 0 otherwise. It is relatively simple to compute the variance ˆσ 2 of a random variable X ∗ i whose p.f. is f . The variance is ˆσ 2 = 1 n n i=1 (xi − x)2, where x is the average of the observed values x1, . . . , xn.Thus, our bootstrap estimate of the variance of X is ˆσ 2/n. The key step in a bootstrap analysis is the choice of the known distribution ˆ F. The particular choice made in Example 12.6.1, namely, the sample c.d.f., leads to what is commonly called the nonparametric bootstrap. The reason for this name is that we do not assume that the distribution belongs to a parametric family when choosing ˆ F = Fn. If we are willing to assume that F belongs to a parametric family, then we can choose ˆ F to be a member of that family and perform a parametric bootstrap analysis as illustrated next. Example 12.6.2 The Variance of the Sample Mean. Let X = (X1, . . . , Xn) be a random sample from the normal distribution with mean μ and variance σ2. Suppose, as in Example 12.6.1, that we are interested in estimating σ2/n, the variance of the sample mean X. To apply the parametric bootstrap, we replace F by ˆ F, a member of the family of normal distributions. For this example, we shall choose ˆ F to be the normal distribution with mean and variance equal to the M.L.E.’s x and ˆσ 2, respectively, although other choices could be made.We then estimate σ2/n by the variance of the sample meanX ∗ of a random sample from the distribution ˆ F. The variance of X ∗ is easily computed as ˆσ 2/n. In this case, the parametric bootstrap yields precisely the same answer as the nonparametric bootstrap. In Examples 12.6.1 and 12.6.2, it was very simple to compute the variance of the sample mean of a random sample from the distribution ˆ F. In typical applications of the bootstrap, it is not so simple to compute the quantity of interest. For example, there is no simple formula for the variance of the sample median of a sample X∗ from ˆ F in Examples 12.6.1 and 12.6.2. In such cases, one resorts to simulation techniques in order to approximate the desired calculation. Before presenting examples of the use of simulation in the bootstrap, we shall first describe the general class of situations in which bootstrap analysis is used. 12.6 The Bootstrap 841 Table 12.10 Correspondence between statistical model and bootstrap analysis Statistical model Bootstrap Distribution Unknown F Known ˆ F Data i.i.d. sample X from F i.i.d. sample X∗ from ˆ F Function of interest η(X, F) η(X∗ , ˆ F) Parameter/estimate Mean, median, variance, etc. of η(X, F) Mean, median, variance, etc. of η(X∗ , ˆ F) The Bootstrap in General Let η(X, F) be a quantity of interest that possibly depends on both a distribution F and a sample X drawn from F. For example, if the distribution F has the p.d.f. f, we might be interested in η(X, F) = 1 n n i=1 Xi − xf (x) dx 2 . (12.6.1) In Examples 12.6.1 and 12.6.2, we wanted the variance of the sample average, which equals the mean of the quantity in Eq. (12.6.1). In general, we might wish to estimate the mean or a quantile or some other probabilistic feature of η(X, F). The bootstrap estimates the mean or a quantile or some other feature of η(X, F) by the mean or quantile or the other feature of η(X∗ , ˆ F), where X∗ is a random sample drawn from the distribution ˆ F, and ˆ F is some distribution that we hope is close to F. Table 12.10 shows the correspondence between the original statistical model for the data and the quantities that are involved in a bootstrap analysis. The function η of interest must be something that exists for all distributions under consideration and all samples from those distributions. Other quantities that might be of interest include the quantiles of the distribution of a statistic, the M.A.E. or M.S.E. of an estimator, the bias of an estimator, probabilities that statistics lie in various intervals, and the like. In the simple examples considered so far, the distribution of η(X∗ , ˆ F) was both known and easy to compute. It will often be the case that the distribution of η(X∗ , ˆ F) is too complicated to allow analytic computation of its features. In such cases, one approximates the bootstrap estimate using simulation. First, draw a large number (say, v) of random samplesX∗(1), . . . , X∗(v) from the distribution ˆ F and then compute T (i) = η(X∗(i), ˆ F) for i = 1, . . . , v. Finally, compute the desired feature of the sample c.d.f. of the values T (1), . . . , T (v). Example 12.6.3 The M.S.E. of the Sample Median. Suppose that we model our data X = (X1, . . . , Xn) as coming from some continuous distribution with the c.d.f. F having median θ. Suppose also that we are interested in using the sample median M as an estimator of θ. We would like to estimate the M.S.E. of M as an estimator of θ. That is, let η(X, F) = (M − θ)2, and try to estimate the mean of η(X, F). Let ˆ F be a known distribution that we hope is similar to F, and letX∗ be a random sample of size n from ˆ F. Regardless of what distribution ˆ F we choose, it is very difficult to compute the bootstrap estimate, the mean of η(X∗ , ˆ F). Instead, we would simulate a large number v of samples X∗(1), . . . , X∗(v) with the distribution ˆ F and then compute the sample 842 Chapter 12 Simulation Figure 12.11 Sample medians of 10,000 bootstrap samples in Example 12.6.3. 20.5 0 0.5 1.0 500 1000 1500 2000 2500 Sample median Count median of each sample M(1), . . . , M(v). Then we compute T (i) = (M(i) − ˆ θ)2 for i = 1, . . . , v, where ˆ θ is the median of the distribution ˆ F. Our simulation approximation to the bootstrap estimate is then the average of the values T (1), . . . , T (v). As an example, suppose that our sample consists of the n = 25 values y1, . . . , y25 listed in Table 10.33 on page 662. For a nonparametric bootstrap analysis, we would use ˆ F = Fn, which is also listed in Table 10.33. Notice that the median of the distribution ˆ F is the sample median of the original sample, ˆ θ = 0.40. Next, we simulate v =10,000 random samples of size 25 from the distribution ˆ F. This is done by selecting 25 numbers with replacement from the yi values and repeating for a total of 10,000 samples of size 25. (Solve Exercise 2 to show why this provides the desired samples X∗(1), . . . , X∗(v).) For example, here is one of the 10,000 bootstrap samples: 1.64 0.88 0.70 −1.23 −0.15 1.40 −0.07 −2.46 −2.46 −0.10 −0.15 1.62 0.27 0.44 −0.42 −2.46 1.40 −0.10 0.88 0.44 −1.23 1.07 0.81 −0.02 1.62 If we sort the numbers in this sample, we find that the sample median is 0.27. In fact, there were 1485 bootstrap samples out of 10,000 that had sample median equal to 0.27. Figure 12.11 contains a histogram of all 10,000 sample medians from the bootstrap samples. The four largest and four smallest observations in the original sample never appeared as sample medians in the 10,000 bootstrap samples. For each of the 10,000 bootstrap samples i, we compute the sample median M(i) and its squared error T (i) = (M(i) − ˆ θ)2, where ˆ θ = 0.40 is the median of the distribution ˆ F.We then average all of these values over the 10,000 samples and obtain the value 0.0887.This is our simulation approximation to the nonparametric bootstrap estimate of the M.S.E. of the sample median. The sample variance of the simulated T (i) values is ˆσ 2 = 0.0135, and the simulation standard error of the bootstrap estimate is ˆσ/ √ 10, 000 = 1.163 × 10−3. Note: Simulation Approximation of Bootstrap Estimates. The bootstrap is an estimation technique. As such, it produces estimates of parameters of interest. When a bootstrap estimate is too difficult to compute, we resort to simulation. Simulation provides an estimator of the bootstrap estimate. In this text, we shall refer to the simulation estimator of a bootstrap estimate as an approximation.We do this merely to avoid having to refer to estimators of estimates. 12.6 The Bootstrap 843 The bootstrap was introduced by Efron (1979), and there have been many applications since then. Readers interested in more detail about the bootstrap should see Efron and Tibshirani (1993) or Davison and Hinkley (1997). Young (1994) gives a review of much of the literature on the bootstrap and contains many useful references. In the remainder of this section, we shall present several examples of both the parametric and nonparametric bootstraps and illustrate how simulation is used to approximate the desired bootstrap estimates. The Nonparametric Bootstrap Example 12.6.4 Confidence Interval for the Interquartile Range. The interquartile range (IQR) of a distribution was introduced in Definition 4.3.2. It is defined to be the difference between the upper and lower quartiles, the 0.75 and 0.25 quantiles. The central 50 percent of the distribution lies between the lower and upper quartiles, so the IQR is the length of the interval that contains the middle half of the distribution. For example, if F is the normal distribution with variance σ2, then the IQR is 1.35σ. Suppose that we desire a 90 percent confidence interval for the IQR θ of the unknown distribution F from which we have a random sampleX1, . . . , Xn. There are many ways to form confidence intervals, sowe shall restrict attention to those that are based on the relationship between θ and the sample IQR ˆ θ. Since the IQR is a scale feature, it might be reasonable to base our confidence interval on the distribution of ˆ θ/θ. That is, let the 0.05 and 0.95 quantiles of the distribution of ˆ θ/θ be a and b, so that Pr a ≤ ˆ θ θ ≤ b = 0.9. Because a ≤ ˆ θ/θ ≤ b is equivalent to ˆ θ/b ≤ θ ≤ ˆ θ/a, we conclude that ( ˆ θ/b, ˆ θ/a) is a 90 percent confidence interval for θ. The nonparametric bootstrap can be used to estimate the quantiles a and b as follows: Let η(X, F) = ˆ θ/θ be the ratio of the sample IQR of the sample X to the IQR of the distribution F. Let ˆ F = Fn, and notice that the IQR of ˆ F is ˆ θ, the sample IQR. Next, let X∗ be a sample of size n from ˆ F. Let ˆ θ ∗ be the sample IQR calculated from X∗, so that η(X∗ , ˆ F) = ˆ θ ∗ / ˆ θ. The 0.05 and 0.95 quantiles of the distribution of η(X, F) are estimated by the 0.05 and 0.95 quantiles of the distribution of η(X∗ , ˆ F). These last quantiles, in turn, are typically approximated by simulation.We simulate a large number, say, v, of bootstrap samples X∗(i) for i = 1, . . . , v. For each bootstrap sample i, we compute the sample IQR ˆ θ ∗(i) and divide it by ˆ θ. Call the ratio T (i). The q quantile of ˆ θ ∗ / ˆ θ is approximated by the sample q quantile of the sample T (1), . . . , T (v). The confidence interval constructed by this method is called a percentile bootstrap confidence interval. We can illustrate this with the data in Table 10.33 on page 662. The IQR of the distribution Fn is 1.46, the difference between the 19th and 6th observations. We simulate 10,000 random samples of size 25 from the distribution Fn. For the ith sample, we compute the sample IQR ˆ θ ∗(i) and divide it by 1.46 to obtain T (i). The 500th and 9500th ordered values from T (1), . . . , T (10,000) are 0.5822 and 1.6301. We then compute the percentile bootstrap confidence interval (1.46/1.6301, 1.46/0.5822) = (0.8956, 2.5077). Example 12.6.5 Confidence Interval for a Location Parameter. Let X1, . . . , Xn be a random sample from the distribution F. Suppose that we want a confidence interval for the median θ of F. We can base a confidence interval on the sample median M. For example, 844 Chapter 12 Simulation our interval could be of the form [M − c1, M + c2]. Since M − c1 ≤ θ ≤ M + c2 is equivalent to −c2 ≤ M − θ ≤ c1, we might want −c2 and c1 to be quantiles of the distribution of M − θ. Without making assumptions about the distribution F, it might be very difficult to approximate quantiles of the distribution of M − θ. To compute a percentile bootstrap confidence interval, let η(X, F) = M − θ and then approximate quantiles (such as α0/2 and 1− α0/2) of the distribution of η(X, F) by the corresponding quantiles of η(X∗ , ˆ F). Here, ˆ F is the sample c.d.f., Fn, whose median is M, and X∗ is a random sample from ˆ F. We then choose a large number v and simulate many samples X∗(i) for i = 1, . . . , v. For each sample, we compute the sample median M ∗(i) and then find the sample quantiles of the values M ∗(i) −M for i = 1, . . . , v. How well the percentile bootstrap interval performs in Example 12.6.5 depends on how closely the distribution of M ∗ − M approximates the distribution of M − θ. (Here, M ∗ is the median of a sample X∗ of size n from ˆ F.) The situation of Example 12.6.5 is one in which there is a possible improvement to the approximation. One thing that can make the distribution of M ∗ −M different from the distribution of M − θ is that one of these distributions is more or less spread out than the other. We can use a different bootstrap approximation that suffers less from differences in spread. Instead of constructing an interval of the form [M − c1, M + c2], we could let our interval be [M − d1Y, M + d2Y ], where Y is a statistic that measures the spread of the data. One possibility for Y is the sample IQR. Another possible spread measure is the sample median absolute deviation (the sample median of the values |X1 −M|, . . . , |Xn −M|). Now, we see thatM − d1Y ≤ θ ≤M + d2Y is equivalent to −d2 ≤ M − θ Y ≤ d1. So, we want −d2 and d1 to be quantiles of the distribution of (M − θ)/Y. This type of interval resembles the t confidence interval developed in Sec. 8.5. Indeed, the interval we are constructing is called a percentile-t bootstrap confidence interval. To construct the percentile-t bootstrap confidence interval, we would use each bootstrap sample X∗ as follows: Compute the sample median M ∗ and the scale statistic Y ∗ from the bootstrap sampleX∗. Then calculate T = (M ∗ − M)/Y ∗. Repeat this procedure many times producing T (1), . . . , T (v) from a large number v of bootstrap samples. Then let −d2 and d1 be sample quantiles (such as α0/2 and 1− α0/2) of the T (i) values. Example 12.6.6 Percentile-t Confidence Interval for a Median. Consider the n = 10 lactic acid concentrations in cheese from Example 8.5.4. We shall do v =10,000 bootstrap simulations to find a coefficient 1− α0 = 0.90 confidence interval for the median lactic acid concentration θ. The median of the sample values is M = 1.41, and the median absolute deviation is Y = 0.245. The 0.05 and 0.95 sample quantiles of the (M ∗(i) − M)/Y ∗(i) values are −2.133 and 1.581. This makes the percentile-t bootstrap confidence interval (1.41− 1.581×0.245, 1.41+ 2.133×0.245) = (1.023, 1.933). For comparison, the 0.05 and 0.95 sample quantiles of the values of M ∗(i) − M are −0.32 and 0.16, respectively. This makes the percentile bootstrap interval equal to (1.41− 0.16, 1.41+ 0.32) = (1.25, 1.73). The percentile-t interval in Example 11.5.6 is considerably wider than the percentile interval. This reflects the fact that the Y ∗ values from the bootstrap samples are quite spread out. This in turn suggests that the spread that we should expect to see in a sample has substantial variability. Hence, it is probably not a good idea to assume that the spread of the distribution of M ∗ − M is the same as the spread of 12.6 The Bootstrap 845 the distribution of M − θ. The percentile-t bootstrap interval is generally preferred to the percentile bootstrap interval when both are available. This is due to the fact that the distribution of (M ∗ − M)/Y ∗ depends less on ˆ F than does the distribution of M ∗ −M. In particular, (M ∗ − M)/Y ∗ does not depend on any scale parameter of the distribution ˆ F. For this reason, we expect more similarity between the distributions of (M ∗ − M)/Y ∗ and (M − θ)/Y than we expect between the distributions of M ∗ −M and M − θ. Example 12.6.7 Features of the Distribution of a Sample Correlation. Let (X, Y ) have a bivariate joint distribution F with finite variances for both coordinates, so that it makes sense to talk about correlation. Suppose that we observe a random sample (X1, Y1), . . . , (Xn, Yn) from the distribution F. Suppose further that we are interested in the distribution of the sample correlation: R = n i=1(Xi − X)(Yi − Y) # n i=1(Xi − X)2 $ # n i=1(Yi − Y)2 $ 1/2 . (12.6.2) We might be interested in the variance of R, or the bias of R, or some other feature of R as an estimator of the correlation ρ between X and Y .Whatever our goal is, we can make use of the nonparametric bootstrap. For example, consider the bias of R as an estimator of ρ. This bias equals the mean of η(X, F) = R − ρ.We begin by replacing the joint distribution F by the sample distribution Fn of the observed pairs. This Fn is a discrete joint distribution on pairs of real numbers, and it assigns probability 1/n to each of the n observed sample pairs. If (X ∗ , Y ∗ ) has the distribution Fn, it is easy to check (see Exercise 8) that the correlation between X ∗ and Y ∗ is R.We then choose a large number v and simulate v samples of size n from Fn. For each i, we compute the sample correlation R(i) by plugging the ith bootstrap sample into Eq. (12.6.2). For each i, we compute T (i) = R(i) − R, and we estimate the mean of R − ρ by the average 1 v v i=1 T (i). As a numerical example, consider the flea beetle data from Example 5.10.2. The sample correlation is R = 0.6401. We sample v = 10,000 bootstrap samples of size n = 31. The average sample correlation in the 10,000 bootstrap samples is 0.6354 with a simulation standard error of 0.001. We then estimate the bias of the sample correlation to be 0.6354 − 0.6401=−0.0047. The Parametric Bootstrap Example 12.6.8 Correcting the Bias in the Coefficient of Variation. The coefficient of variation of a distribution is the ratio of the standard deviation to the mean. (Typically, people only compute the coefficient of variation for distributions of positive random variables.) If we believe that our data X1, . . . ,Xn come from a lognormal distribution with parameters μ and σ2, then the coefficient of variation is θ = (eσ2 − 1)1/2. The M.L.E. of the coefficient of variation is ˆ θ = (e ˆσ 2 − 1)1/2, where ˆσ is the M.L.E. of σ. We expect the M.L.E. of the coefficient of variation to be a biased estimator because it so nonlinear. Computing the bias is a difficult task. However, we can use the parametric bootstrap to estimate the bias. The M.L.E. ˆσ of σ is the square root of the sample variance of log(X1), . . . , log(Xn). The M.L.E. ˆμ of μ is the sample average of log(X1), . . . , log(Xn). We can simulate a large number of random samples of size n from the lognormal distribution with parameters ˆμ and ˆσ 2. For each i, we compute 846 Chapter 12 Simulation ˆσ ∗(i), the sample standard deviation of the ith bootstrap sample.We estimate the bias of ˆ θ by the sample average of the values T (i) = (e[ˆσ ∗(i)]2 − 1)1/2 − ˆ θ. As an example, consider the failure times of ball bearings introduced in Example 5.6.9. If we model these data as lognormal, the M.L.E.’s of the parameters are ˆμ = 4.150 and ˆσ = 0.5217. The M.L.E. of θ is ˆ θ = 0.5593. We could draw 10,000 random samples of size 23 from a lognormal distribution and compute the sample variances of the logarithms. However, there is an easier way to do this simulation. The distribution of [ˆσ ∗(i)]2 is that of a χ2 random variable with 22 degrees of freedom times 0.52172/23. Hence, we shall just sample 10,000 χ2 random variables with 22 degrees of freedom, multiply each one by 0.52172/23, and call the ith one [ˆσ ∗(i)]2. After doing this, the sample average of the 10,000 T (i) values is −0.01825, which is our parametric bootstrap estimate of the bias of ˆ θ. (The simulation standard error is 9.47 × 10−4.) Because our estimate of the bias is negative, this means that we expect ˆ θ to be smaller than θ. To “correct” the bias, we could add 0.01825 to our original estimate ˆ θ and produce the new estimate 0.5593 + 0.01825 = 0.5776. Example 12.6.9 Estimating the Standard Deviation of a Statistic. Suppose that X1, . . . , Xn is a random sample from the normal distribution with mean μ and variance σ2.We are interested in the probability that a random variable having this same distribution is at most c. That is, we are interested in estimating θ = ([c − μ]/σ ). The M.L.E. of θ is ˆ θ = ([c − X]/ ˆσ). It is not easy to calculate the standard deviation of ˆ θ in closed form. However, we can draw many, say, v, bootstrap samples of size n from the normal distribution with mean x and variance ˆσ 2. For the ith bootstrap sample, we compute a sample average x ∗(i), a sample standard deviation ˆσ ∗(i), and, finally, ˆ θ ∗(i) = ([c − x ∗(i)]/ˆσ ∗(i)). We estimate the mean of ˆ θ by θ ∗ = 1 v v i=1 ˆ θ ∗(i). (This can also be used, as in Example 12.6.8, to estimate the bias of ˆ θ.) The standard deviation of ˆ θ can then be estimated by the sample standard deviation of the ˆ θ ∗(i) values, Z = 1 v v i=1 ( ˆ θ ∗(i) − θ ∗ )2 1/2 . For example, we can use the nursing home data from Sec. 8.6. There are n = 18 observations, and we might be interested in ([200 − μ]/σ ). The M.L.E.’s of μ and σ are ˆμ = 182.17 and ˆσ = 72.22. The observed value of ˆ θ is ([200 − 182.17]/72.22) = 0.5975. We simulate 10,000 samples of size 18 from the normal distribution with mean 182.17 and variance (72.22)2. For the ith sample, we find the value ˆ θ ∗(i) for i = 1, . . . , 10,000, and the average of these is θ ∗ = 0.6020 with sample standard deviation Z = 0.09768. We can compute the simulation standard error of the approximation to the bootstrap estimate in two steps.First, apply the method of Example 12.2.10.This gives the simulation standard error ofZ2, the sample variance of the ˆ θ ∗(i)’s. In our example, this yields the value 1.365× 10−4. Second, use the delta method, as in Example 12.2.8, to find the simulation standard error of the square root of Z2. In our example, this second step yields the value 6.986 × 10−4. 12.6 The Bootstrap 847 Example 12.6.10 Comparing Means When Variances Are Unequal. Suppose that we have two samples X1, . . . , Xm and Y1, . . . , Yn from two possibly different normal distributions. That is, X1, . . . , Xm are i.i.d. from the normal distribution with mean μ1 and variance σ2 1 , while Y1, . . . , Yn are i.i.d. from the normal distribution with mean μ2 and variance σ2 2. In Sec. 9.6, we saw how to test the null hypothesis H0 : μ1 = μ2 versus the alternative hypothesis H1 : μ1 = μ2 if we are willing to assume that we know the ratio k = σ2 2/σ 2 1 . If we are not willing to assume that we know the ratio k, we have seen only approximate tests. Suppose that we choose to use the usual two-sample t test even though we do not claim to know k. That is, suppose that we choose to reject H0 when |U| > c, where U is the statistic defined in Eq. (9.6.3) and c is the 1− α0/2 quantile of the t distribution withm + n − 2 degrees of freedom. This test will not necessarily have level α0 if k = 1. We can use the parametric bootstrap to try to compute the level of this test. In fact, we can use the parametric bootstrap to help us choose a different critical value c ∗ for the test so that we at least estimate the type I error probability to be α0. As an example, we shall use the data from Example 9.6.5 again. The M.L.E.’s of the variances of the two distributions were ˆσ 2 1 = 0.04 (for the X data) and ˆσ 2 2 = 0.022 (for the Y data). The probability of type I error is the probability of rejecting the null hypothesis given that the null hypothesis is true, that is, given that μ1 = μ2. Hence, we must simulate bootstrap samples in which the X data and Y data have the same mean. Since the sample averages of the X and Y data are subtracted from each other in the calculation of U, it will not matter what common mean we choose for the two samples. So, the parametric bootstrap can proceed as follows: First, choose a large number v, and for i = 1, . . . , v, simulate (X ∗(i) , Y ∗(i) , S 2∗(i) X , S 2∗(i) Y ) where all four random variables are independent with the following distributions: . X ∗(i) has the normal distribution with mean 0 and variance ˆσ 2 1/m. . Y ∗(i) has the normal distribution with mean 0 and variance ˆσ 2 2/n. . S 2∗(i) X is ˆσ 2 1 times a random variable having the χ2 distribution with m − 1 degrees of freedom. . S 2∗(i) Y is ˆσ 2 2 times a random variable having the χ2 distribution with n − 1degrees of freedom. Then compute U(i) = (m + n − 2)1/2(X ∗(i) − Y ∗(i) ) 1 m + 1 n 1/2 S 2∗(i) X + S 2∗(i) Y 1/2 for each i. Our simulation approximation to the bootstrap estimate of the probability of type I error for the usual two-sample t test would be the proportion of simulations in which |U(i)| > c. With v = 10,000, we shall perform the analysis described above for several different c values. We set c equal to the 1 − α0/2 quantile of the t distribution with 16 degrees of freedom with α0 = j/1000 for each j = 1, . . . , 999. Figure 12.12 shows a plot of the simulation approximation to the bootstrap estimate of the type I error probability against the nominal level α0 of the test. There is remarkably close agreement between the two, although the bootstrap estimate is generally slightly larger. For example, when α0 = 0.05, the bootstrap estimate is 0.065. 848 Chapter 12 Simulation Figure 12.12 Plots of bootstrap estimated type I error probability of t test versus nominal type I error probability in Example 12.6.10. The dashed line is the diagonal along which the two error probabilities would be equal. Nominal type I error probability Bootstrap estimated type I error probability 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0 Next, we use the bootstrap analysis to correct the level of the two-sample t test in this example. To do this, let Z be the sample 1− α0 quantile of our simulated |U(i)| values. If we want a level α0 test, we can replace the critical value c in the two-sample t test with Z and reject the null hypothesis if |U|>Z. For example, with α0 = 0.05, the 0.975 quantile of the t distribution is 2.12, while in our simulation Z = 2.277. The simulation standard error of Z (based on splitting the 10,000 bootstrap samples into 10 subsamples of 1000 each) is 0.0089. Example 12.6.11 The Bias of the Sample Correlation. In Example 12.6.7, we made no assumptions about the distribution F of (X, Y ) except that X and Y have finite variances. Now suppose that we also assume that (X, Y ) has a bivariate normal distribution.We can compute the M.L.E.’s of all of the parameters as in Exercise 24 in Sec. 7.6. We could then simulate v samples of size n from the bivariate normal distribution with parameters equal to the M.L.E.’s, as in Example 12.3.6. For sample i for i = 1, . . . , v, we could compute the sample correlation R(i) by substituting the ith sample into Eq. (12.6.2). Our estimate of the bias would be R − ρˆ. Note that ρˆ, the M.L.E. of ρ, is the same as R. As a numerical example, consider the flea beetle data from Example 5.10.2. The sample correlation isR = 0.6401.We construct v = 10,000 samples of size n = 31from a bivariate normal distribution with correlation 0.6401. The means and variances do not affect the distribution of R. (See Exercise 12.) The average sample correlation in the 10,000 bootstrap samples is 0.6352 with a simulation standard error of 0.001. We then estimate the bias of the sample correlation to be 0.6352 − 0.6401=−0.0049. This is pretty much the same as we obtained using the nonparametric bootstrap in Example 12.6.7. Summary The bootstrap is a method for estimating probabilistic features of a function η of our data X and their unknown distribution F. That is, suppose that we are interested in the mean, a quantile, or some other feature of η(X, F). The first step in the bootstrap is to replace F by a known distribution ˆ F that is like F in some way. Next, replace X by data X∗ sampled from ˆ F. Finally, compute the mean, quantile, or other feature of η(X∗ , ˆ F) as the bootstrap estimate. This last step generally requires simulation except in the simplest examples. There are two varieties of bootstrap that differ by 12.6 The Bootstrap 849 how ˆ F is chosen. In the nonparametric bootstrap, the sample c.d.f. is used as ˆ F. In the parametric bootstrap, F is assumed to be a member of some parametric family and ˆ F is chosen by replacing the unknown parameter by its M.L.E. or some other estimate.
12.4 Exercises
- Suppose that X1, . . . , Xn form a random sample from an exponential distribution with parameter θ. Explain how to use the parametric bootstrap to estimate the variance of the sample averageX. (No simulation is required.)
- Let x1, . . . , xn be the observed values of a random sample X = (X1, . . . , Xn). Let Fn be the sample c.d.f. Let J1, . . . , Jn be a random sample with replacement from the numbers {1, . . . , n}. Define X ∗ i = xJi for i = 1, . . . , n. Show that X∗ = (X ∗ 1, . . . , X ∗
- is an i.i.d. sample from the distribution Fn.
- Let n be odd, and let X = (X1, . . . , Xn) be a sample of size n from some distribution. Suppose that we wish to use the nonparametric bootstrap to estimate some feature of the sample median. Compute the probability that the sample median of a nonparametric bootstrap sample will be the smallest observation from the original data X.
- Use the data in the first column of Table 11.5 on page 699. These data give the boiling points of water at 17 different locations from Forbes’ experiment. Let F be the distribution from which these boiling points were drawn. We might not be willing to make many assumptions about F. Suppose that we are interested in the bias of the sample median as an estimator of the median of the distribution F. Use the nonparametric bootstrap to estimate this bias. First, do a pilot run to compute the simulation standard error of the simulation approximation, and then see how many bootstrap samples you need in order for your bias estimate (for distribution ˆ F) to be within 0.02 of the true bias (for distribution ˆ F) with probability at least 0.9.
- Use the data in Table 10.6 on page 640. We are interested in the bias of the sample median as an estimator of the median of the distribution.
- Use the nonparametric bootstrap to estimate this bias.
- Howmany bootstrap samples does it appear that you need in order to estimate the bias to within .05 with probability 0.99?
- Use the data in Exercise 16 of Sec. 10.7.
- Use the nonparametric bootstrap to estimate the variance of the sample median.
- Howmany bootstrap samples does it appear that you need in order to estimate the variance to within .005 with probability 0.95?
- Use the blood pressure data in Table 9.2 that was described in Exercise 10 of Sec. 9.6. Suppose now that we are not confident that the variances are the same for the two treatment groups. Perform a parametric bootstrap analysis of the sort done in Example 12.6.10. Use v =10,000 bootstrap simulations.
- Estimate the probability of type I error for a twosample t test whose nominal level is α0 = 0.1.
- Correct the level of the two-sample t test by computing the appropriate quantile of the bootstrap distribution of |U(i)|.
- Compute the simulation standard error for the quantile in part (b).
- In Example 12.6.7, let (X ∗ , Y ∗ ) be a random draw from the sample distribution Fn. Prove that the correlation between X ∗ and Y ∗ is R in Eq. (12.6.2).
- Use the data on fish prices in Table 11.6 on page 707. Suppose that we assume only that the distribution of fish prices in 1970 and 1980 is a continuous joint distribution with finite variances. We are interested in the properties of the sample correlation coefficient. Construct 1000 nonparametric bootstrap samples for solving this exercise.
- Approximate the bootstrap estimate of the variance of the sample correlation.
- Approximate the bootstrap estimate of the bias of the sample correlation.
- Compute simulation standard errors of each of the above bootstrap estimates.
- Use the beef hot dog data in Exercise 7 of Sec. 8.5. Form 10,000 nonparametric bootstrap samples to solve this exercise.
- Approximate a 90 percent percentile bootstrap confidence interval for the median calorie count in beef hot dogs.
- Approximate a 90 percent percentile-t bootstrap confidence interval for the median calorie count in beef hot dogs.
- Compare these intervals to the 90 percent interval formed using the assumption that the data came from a normal distribution. 850 Chapter 12 Simulation
- The skewness of a random variable was defined in Definition 4.4.1. Suppose that X1, . . . , Xn form a random sample from a distribution F. The sample skewness is defined as M3 = 1 n n i=1(Xi − X)3 # 1 n n i=1(Xi − X)2 $3/2 . One might useM3 as an estimator of the skewness θ of the distribution F. The bootstrap can be used to estimate the bias and standard deviation of the sample skewness as an estimator of θ.
- Prove thatM3 is the skewness of the sample distribution Fn.
- Use the 1970 fish price data in Table 11.6 on page 707. Compute the sample skewness, and then simulate 1000 bootstrap samples. Use the bootstrap samples to estimate the bias and standard deviation of the sample skewness.
- Suppose that (X1, Y1), . . . , (Xn, Yn) form a random sample from a bivariate normal distribution with means μx and μy, variances σ2 x and σ2 y, and correlation ρ. Let R be the sample correlation. Prove that the distribution of R depends only on ρ, not on μx, μy, σ2 x, or σ2
- 12.7 Supplementary Exercises
- Test the standard normal pseudo-random number generator on your computer by generating a sample of size 10,000 and drawing a normal quantile plot. How straight does the plot appear to be?
- Test the gamma pseudo-random number generator on your computer. Simulate 10,000 gamma pseudo-random variables with parameters a and 1 for a = 0.5, 1, 1.5, 2, 5,
- Then draw gamma quantile plots.
- Test the t pseudo-random number generator on your computer. Simulate 10,000 t pseudo-random variables with m degrees of freedom for m = 1, 2, 5, 10, 20. Then draw t quantile plots.
- Let X and Y be independent random variables with X having the t distribution with five degrees of freedom and Y having the t distribution with three degrees of freedom. We are interested in E(|X − Y |).
- Simulate 1000 pairs of (Xi, Yi) each with the above joint distribution and estimate E(|X − Y |).
- Use your 1000 simulated pairs to estimate the variance of |X − Y | also.
- Based on your estimated variance, how many simulations would you need to be 99 percent confident that your estimator of E(|X − Y |) is within 0.01 of the actual mean?
- Consider the power calculation done in Example 9.5.5.
- Simulate v0 = 1000 i.i.d. noncentral t pseudo-random variables with 14 degrees of freedom and noncentrality parameter 1.936.
- Estimate the probability that a noncentral t random variable with 14 degrees of freedom and noncentrality parameter 1.936 is at least 1.761. Also, compute the simulation standard error.
- Suppose that we want our estimator of the noncentral t probability in part (b) to be closer than 0.01 to the true value with probability 0.99. How many noncentral t random variables do we need to simulate?
- The χ2 goodness-of-fit test (see Chapter 10) is based on an asymptotic approximation to the distribution of the test statistic. For small to medium samples, the asymptotic approximation might not be very good. Simulation can be used to assess how good the approximation is. Simulation can also be used to estimate the power function of a goodness-of-fit test. For this exercise, assume that we are performing the test that was done in Example 10.1.6. The idea illustrated in this exercise applies in all such problems.
- Simulate v =10,000 samples of size n = 23 from the normal distribution with mean 3.912 and variance 0.25. For each sample, compute the χ2 goodnessof- fit statistic Q using the same four intervals that were used in Example 10.1.6. Use the simulations to estimate the probability that Q is greater than or equal to the 0.9, 0.95, and 0.99 quantiles of the χ2 distribution with three degrees of freedom.
- Suppose that we are interested in the power function of a χ2 goodness-of-fit test when the actual distribution of the data is the normal distribution with mean 4.2 and variance 0.8. Use simulation to estimate the power function of the level 0.1, 0.05, and 0.01 tests at the alternative specified.
- In Sec. 10.2, we discussed χ2 goodness-of-fit tests for composite hypotheses. These tests required computing M.L.E.’s based on the numbers of observations that fell into the different intervals used for the test. Suppose instead that we use the M.L.E.’s based on the original observations. In this case, we claimed that the asymptotic distribution of the χ2 test statistic was somewhere between two different χ2 distributions. We can use simulation to better approximate the distribution of the test statistic. In this exercise, assume that we are trying to test 12.7 Supplementary Exercises 851 the same hypotheses as in Example 10.2.5, although the methods will apply in all such cases.
- Simulate v = 1000 samples of size n = 23 from each of 10 different normal distributions. Let the normal distributions have means of 3.8, 3.9, 4.0, 4.1, and 4.2. Let the distributions have variances of 0.25 and 0.8. Use all 10 combinations of mean and variance. For each simulated sample, compute the χ2 statistic Q using the usual M.L.E.’s of μ and σ2. For each of the 10 normal distributions, estimate the 0.9, 0.95, and 0.99 quantiles of the distribution of Q.
- Do the quantiles change much as the distribution of the data changes?
- Consider the test that rejects the null hypothesis if Q ≥ 5.2. Use simulation to estimate the power function of this test at the following alternative: For each i, (Xi − 3.912)/0.5 has the t distribution with five degrees of freedom.
- In Example 12.5.6, we used a hierarchical model. In that model, the parameters μ1, . . . , μp were independent random variables with μi having the normal distribution with mean ψ and precision λ0τi conditional on ψ and τ1, . . . , τp. To make the model more general, we could also replace λ0 by an unknown parameter λ. That is, let the μi ’s be independent with μi having the normal distribution with mean ψ and precision λτi conditional on ψ, λ, and τ1, . . . , τp. Let λ have the gamma distribution with parameters γ0 and δ0, and let λ be independent of ψ and τ1, . . . , τp. The remaining parameters have the prior distributions stated in Example 12.5.6.
- Write the product of the likelihood and the prior as a function of the parameters μ1, . . . , μp, τ1, . . . , τp, ψ, and λ.
- Find the conditional distributions of each parameter given all of the others. Hint: For all the parameters besides λ, the distributions should be almost identical to those given in Example 12.5.6. Wherever λ0 appears, of course, something will have to change.
- Use a prior distribution in which α0 = 1, β0 = 0.1, u0 = 0.001, γ0 = δ0 = 1, and ψ0 = 170. Fit the model to the hot dog calorie data from Example 11.6.2. Compute the posterior means of the four μi ’s and 1/τi’s.
- In Example 12.5.6, we modeled the parameters τ1, . . . , τp as i.i.d. having the gamma distribution with parameters α0 and β0.We could have added a level to the hierarchical model that would allow the τi ’s to come from a distribution with an unknown parameter. For example, suppose that we model the τi ’s as conditionally independent having the gamma distribution with parameters α0 and β given β. Let β be independent of ψ and μ1, . . . , μp with β having the gamma distribution with parameters 0 and φ0. The rest of the prior distributions are as specified in Example 12.5.6.
- Write the product of the likelihood and the prior as a function of the parameters μ1, . . . , μp, τ1, . . . , τp, ψ, and β.
- Find the conditional distributions of each parameter given all of the others. Hint: For all the parameters besides β, the distributions should be almost identical to those given in Example 12.5.6. Wherever β0 appears, of course, something will have to change.
- Use a prior distribution in which α0 = λ0 = 1, u0 = 0.001, 0 = 0.3, φ0 = 3.0, and ψ0 = 170. Fit the model to the hot dog calorie data from Example 11.6.2. Compute the posterior means of the four μi ’s and 1/τi’s.
- Let X1, . . . , Xk be independent random variables such that Xi has the binomial distribution with parameters ni and pi . We wish to test the null hypothesis H0 : p1 = . . . = pk versus the alternative hypothesis H1 that H0 is false. Assume that the numbers n1, . . . , nk are known constants.
- Show that the likelihood ratio test procedure is to reject H0 if the following statistic is greater than or equal to some constant c: “k i=1 # X Xi i (ni − Xi)ni −Xi $ k j=1 Xj k j=1 Xj # k j=1(nj − Xj ) $ k j=1(nj −Xj ) .
- Describe how you could use simulation techniques to estimate the constant c in order to make the likelihood ratio test have a desired level of significance α0. (Assume that you can simulate as many binomial pseudo-random variables as you wish.)
- Consider the depression study in Example 2.1.4. Let pi stand for the probability of success (no relapse) for the subjects in group i of Table 2.1 on page 57, where i = 1 means imipramine, i = 2 means lithium, i = 3 means combination, and i = 4 means placebo. Test the null hypothesis that p1 = p2 = p3 = p4 by computing the p-value for the likelihood ratio test.
- Consider the problem of testing the equality of two normal means when the variances are unequal. This problem was introduced on page 593 in Sec. 9.6. The data are two independent samplesX1, . . . ,Xm and Y1, . . . , Yn. The Xi ’s are i.i.d. having the normal distribution with mean μ1 and variance σ2 1 , while the Yj ’s are i.i.d. having the normal distribution with mean μ2 and variance σ2
- Assume thatμ1=μ2. Prove that the random variable V in Eq. (9.6.14) has a distribution that depends on the parameters only through the ratio σ2/σ1.
- Let ν be the approximate degrees of freedom for Welch’s procedure from Eq. (9.6.17). Prove that the distribution of ν depends on the parameters only through the ratio σ2/σ1. 852 Chapter 12 Simulation
- Use simulation to assess the approximation in Welch’s procedure. In particular, set the ratio σ2/σ1 equal to each of the numbers 1, 1.5, 2, 3, 5, and 10 in succession. For each value of the ratio, simulate 10,000 samples of sizes n = 11 and m = 10 (or the appropriate summary statistics). For each simulated sample, compute the test statistic V and the 0.9, 0.95, and 0.99 quantiles of the approximate t distribution that corresponds to the data in that simulation. Keep track of the proportion of simulations in which V is greater than each of the three quantiles. How do these proportions compare to the nominal values 0.1, 0.05, and 0.01?
- Consider again the situation described in Exercise 11. This time, use simulation to assess the performance of the usual two-sample t test. That is, use the same simulations as in part (c) of Exercise 11 (or ones just like them if you do not have the same simulations). This time, for each simulated sample compute the statistic U in Eq. (9.6.3) and keep track of the proportion of simulations in whichU is greater than each of the nominal t quantiles, T −1 19 (1− α0) for α0 = 0.1, 0.05, and 0.01. How do these proportions compare to the nominal α0 values?
- Suppose that our data comprise a set of pairs (Yi, xi), for i = 1, . . . , n. Here, each Yi is a random variable and each xi is a known constant. Suppose that we use a simple linear regression model in which E(Yi) = β0 + β1xi . Let ˆ β1 stand for the least squares estimator of β1. Suppose, however, that the Yi ’s are actually random variables with translated and scaled t distributions. In particular, suppose that (Yi − β0 − β1xi)/σ are i.i.d. having the t distribution with k ≥ 5 degrees of freedom for i = 1, . . . , n. We can use simulation to estimate the standard deviation of the sampling distribution of ˆ β1.
- Prove that the variance of the sampling distribution of ˆ β1 does not depend on the values of the parameters β0 and β1.
- Prove that the variance of the sampling distribution of ˆ β1 is equal to vσ2, where v does not depend on any of the parameters β0, β1, and σ.
- Describe a simulation scheme to estimate the value v from part (b).
- Use the simulation scheme developed in Exercise 13 and the data in Table 11.5 on page 699. Suppose that we think that the logarithms of pressure are linearly related to boiling point, but that the logarithms of pressure have translated and scaled t distributions with k = 5 degrees of freedom. Estimate the value v from part (b) of Exercise 13 using simulation.
- In Sec. 7.4, we introduced Bayes estimators. For simple loss functions, such as squared error and absolute error, we were able to derive general forms for Bayes estimators. In many real problems, loss functions are not so simple. Simulation can often be used to approximate Bayes estimators. Suppose that we are able to simulate a sample θ(1), . . . , θ(v) (either directly or by Gibbs sampling) from the posterior distribution of some parameter θ given some observed data X = x. Here, θ can be real valued or multidimensional. Suppose that we have a loss function L(θ, a), and we want to choose a so as to minimize the posterior mean E[L(θ, a)|x].
- Describe a general method for approximating the Bayes estimate in the situation described above.
- Suppose that the simulation variance of the approximation to the Bayes estimate is proportional to 1 over the size of the simulation. How could one compute a simulation standard error for the approximation to the Bayes estimate?
- In Example 12.5.2, suppose that the State of New Mexico wishes to estimate the mean number μ of medical in-patient days in nonrural nursing homes. The parameter is θ = (μ, τ ). The loss function will be asymmetric to reflect different costs of underestimating and overestimating. Suppose that the loss function is L(θ, a) = 30(a − μ) if a ≥ μ, (μ − a)2 if μ > a. Use the method developed in your solution to Exercise 15 to approximate the Bayes estimate and compute a simulation standard error.