Bootstrapping Hypotheses Tests David J. Olive∗ Southern Illinois University December 28, 2014 Abstract Consider testing H0 : µ = c versus H1 : µ 6= c where µ is an r × 1 parameter ˆ be a consistent estimator of µ and vector and c is a known r × 1 vector. Let µ ˆ ∗i − c for i = 1, ..., B. Make a prediction region for make a bootstrap sample wi = µ the w i and determine whether 0 is in the prediction region. It will be shown that this prediction region method generalizes the percentile method for r = 1 to r ≥ 1. Bootstrapping test statistics is well known, and the prediction region method can be regarded as a special case where the test statistic is the squared Mahalanobis distance Di2∗ = (Ti∗ − T ∗ )T [S ∗T ]−1 (Ti∗ − T ∗ ) where wi = Ti∗ , and T ∗ and S ∗T are the sample mean and sample covariance matrix of T1∗ , ..., TB∗ . There are many applications, including bootstrapping variable selection models and bootstrapping a correlation matrix. KEY WORDS: bagging, bootstrap, confidence region, multiple linear regression, prediction region, variable selection ∗ David J. Olive is Professor, Department of Mathematics, Southern Illinois University, Carbondale, IL 62901, USA. 1 1 Introduction Consider testing H0 : µ = c versus H1 : µ 6= c where c is a known r × 1 vector. If a confidence region can be constructed for µ − c, then fail to reject H0 if 0 is in the confidence region, and reject H0 if 0 is not in the confidence region. Given training data w 1 , ..., wn , a large sample 100(1 − δ)% prediction region for a future test value w f is a set An such that P (w f ∈ An ) → 1 − δ as n → ∞, while a large sample confidence region for a parameter µ is a set An such that P (µ ∈ An ) → 1 − δ as n → ∞. The duality between hypothesis tests and confidence regions is well known, but for bootstrap samples, the percentile method is simultaneously a large sample confidence interval for µ, and also a large sample prediction interval for a future bootstrap statistic ∗ Tf,n . It is natural to try to extend the percentile method to vector valued statistics and ∗ parameters using a large sample prediction region for a future bootstrap statistic Tf,n as a large sample confidence region for µ, but practical large sample prediction regions where p > 1 and the underlying distribution is unknown have only recently been developed. See Olive (2013, 2014b) and Lei, Robins, and Wasserman (2013). The percentile method, which is an interval that contains dB ≈ kB = dB(1 − δ)e of ∗ ∗ ∗ the Ti,n from a bootstrap sample T1,n , ..., TB,n where the statistic Ti,n is an estimator of µ based on a sample of size n. Often the n is suppressed. Here dxe is the smallest integer ∗ ∗ ∗ ≥ x, e.g. d7.8e = 8. Let T(1) , T(2) , ..., T(B) be the order statistics of the bootstrap sample. Then one version of the percentile method discards the largest and smallest dBδ/2e order statistics, resulting in an interval (LB , UB ) that is a large sample 100(1 − δ)% confidence interval for µ, and also a large sample 100(1 − δ)% prediction interval for a ∗ future bootstrap value Tf,n . Janssen and Pauls (2003) and Mammen (1992) suggest that the bootstrap works if there is a central limit theorem for the statistic Ti,n . Olive (2014a, p. 283) recommends using the shorth(c) estimator for the percentile ∗ method. Let c = kB , and let Wi = Ti,n . Let W(1), ..., W(B) be the order statistics of the Wi . Compute W(c) − W(1), W(c+1) − W(2), ..., W(B) − W(B−c+1). Let [W(s) , W(s+c−1) ] correspond to the closed interval with the smallest distance. Then reject H0 : µ = µ0 if µ0 is not in the interval. The shorth interval tends to be shorter than the interval that deletes the smallest and largest dBδ/2e observations Wi when the Wi do not come from a symmetric distribution. Frey (2013) q showed that for large Bδ and iid data, the shorth(kB ) PI has undercoverage ≈ 1.12 δ/B, and used the shorth(c) estimator as the q large sample 100(1 − δ)% prediction interval where c = dB[1 − δ + 1.12 δ/B ] e. Hence if B = 1000, there may be about 1% undercoverage using c = kB . Some notation is needed to give the Olive (2013) prediction region used to bootstrap a hypothesis test. Suppose w1 , ..., wn are iid p × 1 random vectors with mean µ and nonsingular covariance matrix Σw . Let a future test observation w f be independent of the w i but from the same distribution. Let (w, S) be the sample mean and sample covariance matrix where w= n n 1X 1 X w i and S = S w = (w i − w)(w i − w)T . n i=1 n − 1 i=1 2 (1) Then the ith squared sample Mahalanobis distance is the scalar 2 2 = Dw (w, S) = (w − w)T S −1 (w − w). Dw (2) 2 Let Di2 = Dw for each observation w i . Let D(c) be the cth order statistic of D1 , ..., Dn. i Following Olive (2013), a large sample 100(1 − δ)% prediction region for w f is the hyperellipsoid 2 2 An = {w : Dw (w, S) ≤ D(c) } = {w : Dw (w, S) ≤ D(c) }. (3) If n is large, can use c = kn = dn(1−δ)e. If n is not large, using c = dn where dn decreases to kn , can improve small sample performance. Olive (2013) showed that this prediction region is a large sample 100(1 − δ)% prediction region for a large class of distributions, although regions with smaller volumes may exist. Note that the result follows since if Σw and S are nonsingular, then the Mahalanobis distance is a continuous function of (w, S). D D Let D = D(µ, Σw ). Then Di → D and Di2 → D2 . Hence the sample percentiles of the Di are consistent estimators of the population percentiles of D at continuity points of cumulative distribution function (cdf) of D. The prediction region estimates the highest density region for a large class of elliptically contoured distributions. See Olive (2014b) for more on prediction regions. ˆ ∗i − c for i = 1, ..., B. The prediction region method makes a bootstrap sample wi = µ Make the prediction region (3) for the w i and determine whether 0 is in the prediction region. As shown below, the prediction region method is a special case of the percentile method, and a special case of bootstrapping a test statistic. ˆ − c. Consider testing H0 : µ = c versus H1 : µ 6= c, and the statistic Ti = µ If E(Ti ) = θ and Cov(Ti ) = ΣT were known, then the squared Mahalanobis distance Di2 (θ, ΣT ) = (Ti − θ)T Σ−1 T (Ti − θ) would be a natural statistic to use if the percentile 2 D1−δ (θ, ΣT ) was known. The prediction region method bootstraps the squared Mahaˆ ∗i − c and the squared Malanobis distances, forming the bootstrap sample w i = Ti∗ = µ B 1 X ∗ ∗ −1 ∗ T ∗ 2∗ 2 ∗ ∗ ∗ ∗ Ti∗ halanobis distances Di = Di (T , S T ) = (Ti −T ) [S T ] (Ti −T ) where T = B i=1 B X 1 and S ∗T = (T ∗ − T ∗)(Ti∗ − T ∗ )T are the sample mean and sample covariance maB − 1 i=1 i trix of T1∗, ..., TB∗ . Then the percentile method that contains the smallest dB ≈ B(1 − δ) ∗ distances is used to get the closed interval [0, D(dB ) ] = [0, D(d ]. If H0 is true and B) T 2 ˆ = c, then θ = 0. Let D0 E[µ] = T ∗ [S ∗T ]−1T ∗ and fail to reject H0 if D0 ≤ D(dB ) and reject H0 if D0 > D(dB ) . This percentile method is equivalent to computing the prediction region (3) on the wi = Ti∗ and checking whether 0 is in the prediction region. Note that the percentile method makes an interval that contains dB ≈ B(1 − δ) of the scalar valued Ti∗. The prediction region method makes a hyperellipsoid that contains dB of the r × 1 vectors Ti∗ = w i , and equivalently, makes an interval [0, D(dB ) ] that contains dB of the Di∗ = Di . When r = 1, a hyperellipsoid is an interval, so the prediction region method is a special case of the percentile method. Suppose the parameter of interest is µ, and there ∗ is a bootstrap sample T1∗, ..., TB∗ . Let T and ST2∗ be the sample mean and variance of the 3 ∗ 2 Ti∗. Then the squared Mahalanobis distance Dµ2 = (µ − T )2 /ST2∗ ≤ D(d is equivalent to B) ∗ ∗ ∗ µ ∈ [T − ST∗ D(dB ) , T + ST∗ D(dB ) ], which is an interval centered at T just long enough to cover dB ≈ B(1 − δ) of the Ti∗. Hence this interval is a version of the percentile method. Bootstrapping test statistics is well known, and the prediction region is a special case of this method using Di2∗ = Di2 (T ∗, S ∗T ) as the test statistic. The point of the above discussion is that prediction region method can be thought of as a variant of two widely used methods. The method should be regarded as exploratory until theory proves that the method is a large sample test, but similar remarks apply to other bootstrap methods such as the percentile method. Polansky (2008, p. 73) summarizes a bootstrap percentile confidence region suggested by Efron and Tibshirani (1998). Section 2 explains why the prediction region method works, Section 3 examines the method for multiple linear regression, Section 4 examines the method for variable selection, and Section 5 gives an example and some simulations. 2 Why the Prediction Region Method Works ˆ where θ ˆ is a p × 1 ˆ n , Tn = t(θ) Several estimators of µ could be used, including Tn = µ vector and µ = t(θ), or Tn = t(W ) where all of the data is collected into an n × k matrix W . If the rows of W are iid cases w i , t(W ) = t(F ) denotes that an iid sample from a distribution with cdf F was used, while t(W ∗ ) = t(Fn) = Tn∗ means that t was computed from an iid sample from the empirical distribution: a sample of size n was drawn with replacement from the observed sample w 1 , ..., wn . It is often assumed that P t(F ) − t(Fn) → 0. When the bootstrap is used, a large sample confidence region for a parameter µ is a set An,B such that P (µ ∈ An,B ) → 1 − δ as n, B → ∞. For this section, assume P nS ∗T → ΣT as n, B → ∞ where ΣT and S ∗T are nonsingular r × r matrices, µ is an r × 1 parameter, and Tn is an estimator of µ such that √ D n (Tn − µ) → X (4) as n → ∞. Then √ −1/2 n ΣT n (Tn − D −1/2 (Tn − µ) → ΣT ˆ −1 µ)T Σ T X = Z, D (Tn − µ) → Z T Z = D2 ˆ T is a consistent estimator of ΣT , and as n → ∞ where Σ D (Tn − µ)T [S ∗T ]−1 (Tn − µ) → D2 (5) 2 2 2 {w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D1−δ } = {w : Dw (Tn , S ∗T ) ≤ D1−δ } (6) as n, B → ∞. Hence 2 is a large sample 100(1−δ)% confidence region for µ where P (D2 ≤ D1−δ ) = 1−δ. Often √ D by a central limit theorem or the multivariate delta method, n(Tn − µ) → Nr (0, ΣT ), and D2 ∼ χ2r . 4 The prediction region method uses 2 2 2 {w : (w − T n )T [S ∗T ]−1 (w − T n ) ≤ D(d } = {w : Dw (T n , S ∗T ) ≤ D(d }. B) B) ∗ ∗ ∗ (7) Competitors include 2 2 2 {w : (w − Tn )T [S ∗T ]−1(w − Tn ) ≤ D(d } = {w : Dw (Tn , S ∗T ) ≤ D(d }, B) B) and {w : (w − Tn )T [S ∗T ]−1(w − Tn ) ≤ χ2r,1−δ } = {w : D2w (Tn , S ∗T ) ≤ χ2r,1−δ }. ∗ ∗ ∗ (8) (9) Plugging bootstrap quantities into classical methods is not new, so (6) is not new. 2 Often D2 is unknown, so what is new is using D(d to estimate D2 instead of assuming B) ∗ D2 ∼ χ2r . Suppose the Ti∗ = Ti,n are iid from some distribution with cdf F˜n . For example, ∗ if Ti,n = t(Fn) where iid samples from Fn are used, then F˜n is the cdf of t(Fn). Fix n, and ∗ ∗ let E(Ti,n ) = µn and Cov(Ti,n ) = Σn . With respect to F˜n , µn and Σn are parameters, but with respect to F , µn is a random vector and Σn is a random matrix. For example, using least squares and the residual bootstrap for the multiple linear regression model ˆ Σn = n − p MSE(X T X)−1 , and ΣT = σ 2 limn→∞ (X T X/n)−1 . in Section 3, µn = β, n Then by the multivariate central limit theorem, √ D D B(T ∗ − µn ) → Nr (0, Σn ) and B(T∗ − µn )T [S ∗T ]−1(T∗ − µn) → χ2r as B → ∞. If the Ti∗ are iid and D (10) D (11) DT2 i∗ (T ∗, S ∗T ) = (Ti∗ − T ∗)T [S ∗T ]−1 (Ti∗ − T ∗) → D2 P 2 2 as n, B → ∞, then D(d → D1−δ . Equation (10) holds if B) DT2 i∗ (µn , S ∗T ) = (Ti∗ − µn )T [S∗T ]−1 (Ti∗ − µn ) → D2 √ D P as n, B → ∞ and B −1/2[S ∗T ]−1/2(Ti∗ − µn ) → 0. Note that (11) holds if n(Tn − µ) → X √ D and n(Ti∗ − µn ) → X, which is often assumed with µn = Tn . Following Efron (2014), T ∗ is the bagging or smoothed bootstrap estimator of µ. For the prediction region method (7) to be a large sample 100(1 − δ)% confidence re∗ ∗ P P 2 2 2 2 gion for µ, need D(d → D1−δ and Dµ (T , S ∗T ) → D2 as n, B → ∞. Now Dµ (T , S ∗T ) = B) (µ − Tn + Tn − T )T [S ∗T ]−1 (µ − Tn + Tn − T )T = (µ − Tn )T [S ∗T ]−1 (µ − Tn ) ∗ ∗ P +2(µ − Tn )T [S ∗T ]−1(Tn − T ) + (Tn − T )T [S∗T ]−1 (Tn − T ) → D2 ∗ ∗ P ∗ if the above two terms → 0 as n, B → ∞. This will occur, for example, if [S ∗T ]−1/2(Tn − ∗ P T ) → 0 or if the fourth term converges to 0 in probability as n, B → ∞. The fourth term is often OP (B −1 ), for example if µn = Tn , as is often the case. For r = 1, Efron (2014) uses confidence intervals T ∗ ± z1−δ SE(T ∗) where P (Z ≤ z1−δ ) = 1 − δ if Z ∼ N(0, 1). Efron uses a delta method estimate of SE(T ∗) to avoid using the computationally expensive double bootstrap. The projection region method avoids assuming a normal limiting distribution and estimates the cutoff using quantiles ∗ of the Mahalanobis distances of the Ti,n from T ∗. 5 3 Bootstrap Tests for Multiple Linear Regression Consider the multiple linear regression model Yi = xTi β + ei for i = 1, ..., n, written in matrix form as Y = Xβ + e where Y is n × 1 and X is n × p. Consider testing H0 : Aβ = c where A is an r × p matrix with full rank r and µ = Aβ. To perform ˆ ∗ , ..., β ˆ ∗ has been generated. Form the prediction the test, suppose a bootstrap sample β 1 B ˆ ∗ − c, ..., wB = Aβ ˆ ∗ − c. If 0 is in the prediction region, fail to region (3) for w 1 = Aβ 1 B reject H0 , otherwise reject H0 . It is useful to compare the bootstrap tests with classical tests. Methods for bootstrapping this model are well known. The estimated covariance matrix of the (ordinary) least squares estimator is ˆ Cov(β ) = MSE(X T X)−1 . OLS The residual bootstrap computes the least squares estimator and obtains the n residuals and fitted values r1 , ..., rn and Yˆ1 , ..., Yˆn. Then a sample of size n is selected with replace∗ ∗ ment from the residuals resulting in r11 , ..., r1n . Hence the empirical distribution of the ∗ ∗ ∗ T ∗ residuals is used. Then a vector Y 1 = (Y11, ..., Y1n ) is formed where Y1j∗ = Yˆj + r1j . ∗ ∗ ˆ . This process is repeated B Then Y 1 is regressed on X resulting in the estimator β 1 ∗ ∗ ˆ , ..., β ˆ . This method should have n > 10p so that times resulting in the estimators β 1 B the residuals ri are close to the errors ei. Efron (1982, p. 36) notes that for the residual bootstrap, the sample covariance ˆ ∗ is estimating the population bootstrap matrix n − p MSE(X T X)−1 as matrix of the β i sn n−p B → ∞. Hence the residual bootstrap standard error SE(βˆi ) ≈ SE(βˆi,OLS ). n If the z i = (Yi , xTi )T are iid observations from some population, then a sample of size n can be drawn with replacement from z 1 , ..., zn . Then the response and predictor variables can be formed into vector Y ∗1 and design matrix X ∗1 . Then Y ∗1 is regressed ˆ ∗ . This process is repeated B times resulting in the on X ∗1 resulting in the estimator β 1 ˆ ∗ , ..., β ˆ ∗ . If the z i are the rows of a matrix Z, then this rowwise bootstrap estimators β 1 B uses the empirical distribution of the z i . This method appears to need a larger sample size n than the residual bootstrap if n > 10p, but may be useful if n is large but n < 5p. Following Seber and Lee (2003, p. 100), the classical test statistic for testing H0 is ˆ − c)T [MSE A(X T X)−1 AT ]−1(Aβ ˆ − c) (Aβ , F = r D and when H0 is true, rFR → χ2r for a large class of error distributions. The sample n−p covariance matrix S w of the w i is estimating MSE A(X T X)−1 AT , and w ≈ 0 n when H0 is true. Thus under H0 , the squared distance Di2 = (w i − w)T S −1 w (w i − w) ≈ ∗ ∗ n ˆ − c)T [MSE A(X T X)−1 AT ]−1 (Aβ ˆ − c), (Aβ n−p n 2 and expect D(d ≈ n−p χ2r,1−δ , for large n and B and small p. Hence the prediction B) region method is closely related to the Mammen (1993) suggestion of bootstrapping the test statistic F for this model. 6 4 Bootstrapping the Variable Selection Estimator Variable selection, also called subset or model selection, is the search for a subset of predictor variables that can be deleted without important loss of information. By treating ˆ of β as a shrinkage estimator, the bootstrap can be a variable selection estimator β used to examine variable selection. Forward selection, backward elimination, stepwise regression, and all subsets variable selection can be used if there is a criterion that selects the submodel, such as AIC or Cp . Similar ideas can be used to bootstrap other shrinkage estimators. ˆ Consider testing H0 : Aβ = c where A is an r × p matrix with full rank r. Now let β be a variable selection estimator of β. To perform the test, suppose a bootstrap sample ˆ ∗, ..., β ˆ ∗ has been generated. Form the prediction region (3) for w 1 = Aβ ˆ ∗ −c, ..., wB = β 1 B 1 ∗ ˆ Aβ B − c. If 0 is in the prediction region, fail to reject H0 , otherwise reject H0 . A model for variable selection in multiple linear regression can be described by Y = xT β + e = βT x + e = xTS β S + xTE βE + e = xTS βS + e (12) where e is an error, Y is the response variable, x = (xTS , xTE )T is a p × 1 vector of predictors, xS is a kS × 1 vector and xE is a (p − kS ) × 1 vector. Given that xS is in the model, β E = 0 and E denotes the subset of terms that can be eliminated given that the subset S is in the model. Since S is unknown, candidate subsets will be examined. Following Olive and Hawkins (2005), let xI be the vector of k terms from a candidate subset indexed by I, and let xO be the vector of the remaining predictors (out of the candidate submodel). Then Y = xTI β I + xTO β O + e. (13) The model Y = xT β + e that uses all of the predictors is called the full model. A model Y = xTI β I + e that only uses a subset xI of the predictors is called a submodel. Suppose that S is a subset of I and that model (12) holds. Then xT β = xTS β S = xTS βS + xTI/S β(I/S) + xTO 0 = xTI β I (14) where xI/S denotes the predictors in I that are not in S. Since this is true regardless of the values of the predictors, β O = 0 if S ⊆ I. For multiple linear regression, if the candidate model of xI has k terms (including the constant), then the partial F statistic for testing whether the p − k predictor variables in xO can be deleted is " # SSE(I) − SSE SSE n − p SSE(I) FI = / = −1 (n − k) − (n − p) n − p p−k SSE where SSE is the error sum of squares from the full model and SSE(I) is the error sum of squares from the candidate submodel. An important criterion for variable selection is the Cp criterion Cp (I) = SSE(I) + 2k − n = (p − k)(FI − 1) + k MSE 7 where MSE is the error mean square for the full model. Olive and Hawkins (2005) show that submodels with Cp (I) ≤ min(2k, p) are especially interesting. The AIC is criterion similar to Cp. Suppose the variable selection method, such as forward selection or all subsets, produces J models. Let model Imin be the model that minimizes the criterion, e.g. Cp (I) or AIC(I). Following Seber and Lee (2003, p. 448) and Nishi (1984), the probability that model Imin from Cp or AIC underfits goes to zero as n → ∞. Since there are a finite number of regression models I that contain the true model, and each model gives a consistent estimator of β, the probability that Imin picks one of these models goes to ˆ one as n → ∞. Hence β Imin is a consistent estimator of β under model (12). Other automated variable selection methods may work better than Imin . For the Cp criterion, find the submodel II with the fewest number of predictors such that Cp (II ) ≤ Cp (Imin ) + 1. For AIC, Burnham and Anderson (2004) suggest that if ∆(I) = AIC(I) − AIC(Imin), then models with ∆(I) ≤ 2 are good. Find the submodel II with the smallest number of predictors such that ∆(II ) ≤ 2. It is possible that II = Imin or that II is the full model. Do not use more predictors than model II to avoid overfitting. Suppose model I is selected after variable selection. Then least squares output for the model Y = X I βI + e can be obtained, but the least squares output is not correct for inference. In particular, MSE(I)(X TI X I )−1 is not the correct estimated covariance ˆ . The selected model tends to fit the data too well, so SE(βˆi) from the matrix of β I incorrect estimated covariance matrix is too small. Hence the confidence intervals for βi are too short, and hypotheses tests reject H0 : βi = 0 too often. Hastie, Tibshirani, and Friedman (2009, p. 57) note that variable selection is a shrinkage estimator: the coefficients are shrunk to 0 for the omitted variables. Suppose ˆ is k × 1, form β ˆ from β ˆ by adding 0s corresponding to the omitted n > 10p. If β I I ˆ is a nonlinear estimator of β, and the residual bootstrap method variables. Then β ˆ is formed from model Imin that minimizes can be applied. For example, suppose β Cp from some variable selection method such as forward selection, backward elimination, stepwise selection, or all subsets variable selection. Instead of computing the least squares estimator from regression Y ∗i on X, perform variable selection on Y ∗i and X, fit the model that minimizes the criterion, and add 0s corresponding to the omitted variables, ˆ ∗, ..., β ˆ∗ . resulting in estimators β 1 B Before Efron (2014), inference techniques for the variable selection model have not had much success. Efron (2014) lets t(Z) be a scalar valued statistic, based on all of the data Z, that estimates a parameter of interest µ. Form a bootstrap sample Z ∗i and t(Z ∗i ) n 1 X for i = 1, ..., B. Then µ ˜ = s(Z) = t(Z ∗i ), a “bootstrap smoothing” or “bagging” B i=1 estimator. In the regression setting with variable selection, Z ∗i can be formed with the rowwise or residual bootstrap using the full model. The prediction region method can also be applied to t(Z). For example, when A is 1 × p, the prediction region method ˆ − c and T ∗ = µ uses µ = Aβ − c, t(Z) = Aβ ˜ . Efron (2014) uses the confidence interval ∗ ∗ ∗ T ± z1−δ SE(T ) which is symmetric about T . The prediction region method uses ∗ ∗ T ± ST∗ D(dB ) which is also a symmetric interval centered at T . If both the prediction region method and Efron’s method are large sample confidence intervals for µ, then they 8 √ have the same asymptotic length (perhaps scaled by multiplying by n), since otherwise the shorter interval will have lower asymptotic coverage. Since the prediction region interval is a percentile interval, the shorth(kB ) interval could have much shorter length than the Efron interval and the prediction region interval if the bootstrap distribution is not symmetric. The prediction region method can be used for vector valued statistics and parameters, and does not need the statistic to be asymptotically normal. These features are likely useful for variable selection models. Prediction intervals and regions can have higher than the nominal coverage 1 − δ if the distribution is discrete or a mixture of a discrete distribution and some other distribution. In particular, coverage can be high if the w i distribution is a mixture of a point mass at 0 and the method checks whether 0 is in the prediction region. Such a mixture often occurs for variable selection methods and lasso. ˆ ∗ can contain many zeroes and be highly skewed if The bootstrap sample for the Wi = β ij the jth predictor is weak. Then the program may fail because S w is singular, but if all ˆ ∗ = 0, then there is strong evidence that the jth predictor is not or nearly all of the β ij needed given that the other predictors are in the variable selection method. ˆ ∗ = 0 for i = 1, ..., B and for each run in As an extreme simulation case, suppose β ij the simulation. Consider testing H0 : βj = 0. Then regardless of the nominal coverage 1 − δ, the closed interval [0,0] will contain 0 for each run and the observed coverage will be 1 > 1− δ. Using the open interval (0,0) would give observed coverage 0. Also intervals [0, b] and [a, 0] correctly suggest failing to reject βj = 0, while intervals (0, b) and (a, 0) incorrectly suggest rejecting H0 : βj = 0. Hence closed regions and intervals make sense. 5 Example and Simulations Example. Cook and Weisberg (1999, pp. 351, 433, 447) gives a data set on 82 mussels sampled off the coast of New Zealand. Let the response variable be the logarithm log(M) of the muscle mass M, and the predictors are the length L and height H of the shell in mm, the logarithm log(W ) of the shell width W, the logarithm log(S) of the shell mass S, and a constant. Inference for the full model is shown along with the shorth(c) nominal 95% confidence intervals for βi computed using the rowwise and residual bootstraps. As expected, the residual bootstrap intervals are close to the classical least squares confidence intervals ≈ βˆi ± 2SE(βˆi ). The minimum Cp model used a constant, H, and log(S). The shorth(c) nominal 95% confidence intervals for βi using the residual bootstrap are shown. Note that the intervals for H and log(W ) are right skewed and contain 0 when closed intervals are used instead of open intervals. It was expected that log(S) may be the only predictor needed, along with a constant, since log(S) and log(M) are both log(mass) measurements and likely highly correlated. Hence want to test H0 : β2 = β3 = β4 = 0 with the Imin model selected by all subsets variable selection. (Of course this test would be easy to do with the full model using least squares theory.) Then H0 : Aβ = (β2, β3, β4 )T = 0. Using the prediction q region method with least squares gave an interval [0,2.937] with D0 = 1.594. Note that χ23,0.95 = 2.795. So fail to reject H0 . The prediction region method using Imin had [0, D(dB ) ] = [0, 3.282] while D0 = 1.137. So fail to reject H0 . 9 large sample full model inference Estimate Std.Err t-value Pr(>|t|) rowboot resboot constant -1.2493 0.8388 -1.4894 0.1405 [-2.720,-0.015] [-3.065,0.110] L -0.0006 0.0023 -0.2829 0.7780 [-0.005,0.003] [-0.005,0.003] logW 0.1298 0.3738 0.3471 0.7295 [-0.390,0.710] [-0.549,0.885] H 0.0075 0.0050 1.5044 0.1366 [-0.001,0.017] [-0.002,0.016] logS 0.6404 0.1686 3.7989 0.0003 [ 0.209,1.025] [ 0.337,0.947] incorrect min Cp submodel inference Estimate Std.Err t-value Pr(>|t|) 95% shorth bootstrap CI constant -0.9573 0.1519 -6.3018 0.0000 [-3.214,-0.593] L 0 [-0.005, 0.003] logW 0 [ 0.000, 0.977] H 0.0072 0.0047 1.5490 0.1254 [ 0.000, 0.015] logS 0.6530 0.1160 5.6297 0.0000 [ 0.358, 0.933] The R code used to produce the above output is shown below. library(leaps) y <- log(mussels[,5]) x <- mussels[,1:4] x[,4] <- log(x[,4]) x[,2] <- log(x[,2]) out <- regboot(x,y,B=1000) tem <- rowboot(x,y,B=1000) outvs <- vselboot(x,y,B=1000) #get bootstrap CIs, apply(out$betas,2,shorth2); apply(tem$betas,2,shorth2); apply(outvs$betas,2,shorth2) ls.print(outvs$full) ls.print(outvs$sub) #test if beta_2 = beta_3 = beta_4 = 0 Abeta <- out$betas[,2:4] #method with residual bootstrap predreg(Abeta) Abeta <- outvs$betas[,2:4] #prediction region method with Imin predreg(Abeta) A small simulation study was done in R using B = max(1000, n) and 5000 runs. The regression model used β = (1, 1, 0, 0)T with n = 100, p = 4 and various zero mean iid error distributions. The design matrix X consisted of iid N(0,1) random variables. Hence the full model least squares confidence intervals for βi should have length near √ 2t99,0.975σ/ n ≈ 2(1.96)σ/10 = 0.392σ when the iid zero mean errors have variance σ 2. The simulation computed the shorth(kB ) interval for each βi and used the prediction region method to test H0 : β3 = β4 = 0. The nominal coverage was 0.95 with δ = 0.05. Observed coverage between 0.94 and 0.96 would suggest coverage is close to the nominal value. Observed coverage near 0.94 would not be surprising since with B = 1000, expect about 1% undercoverage. The regression models used the residual bootstrap on the full model least squares estimator and on the all subsets variable selection estimator for the model Imin . The 10 residuals were from least squares applied to the full model in both cases. Results are shown for when the iid errors ei ∼ N(0, 1). Table 1 shows two rows for each model giving the observed confidence interval coverages and average lengths of the confidence intervals. The term “reg” is for the full model regression, and the term “vs” is for the all subsets variable selection. The column for the “test” gives the length and coverage = P(fail to reject H0 ) for the interval [0, D(dn ) ] where D(dn ) is the cutoff for the prediction region. The volume cutoff will often q of the prediction region will decrease to 0 as n → ∞. The q 2 be near χr,0.95 if the statistic T is asymptotically normal. Note that χ22,0.95 = 2.448 is very close to 2.4491 for the full model regression bootstrap test. The coverages were near 0.94 for the regression bootstrap on the full model. For Imin the coverages were near 0.94 for β1 and β2, but higher for the other 3 tests since zeroes often occurred for βˆj∗ for j = 3, 4. The average lengths and coverages were similar for the full model and all subsets variable selection Imin for β1 and β2 , but the lengths are shorter for Imin for β3 and β4 . Volumes of the hyperellipsoids were not computed, but the average cutoff of 2.69 for the variable selection test suggests that the test statistic was not multivariate normal, which is not surprising since many zeroes were produced for βˆj∗ for j = 3, 4. Table 1: Bootstrapping Regression and Variable Selection model reg vs cov/len cov len cov len β1 0.9370 0.3825 0.9346 0.3819 β2 0.9328 0.3843 0.9352 0.3850 β3 0.9386 0.3855 0.9962 0.3037 β4 0.9328 0.3848 0.9958 0.3033 test 0.9360 2.4491 0.9932 2.6879 Table 2: Bootstrapping the Correlation Matrix n 100 400 400 400 ψ 0 cov/len cov len 0 cov len 0.03 cov len 0.1 cov len ρ12 0.929 0.378 0.940 0.192 0.942 0.191 0.935 0.183 ρ13 0.928 0.378 0.938 0.191 0.946 0.191 0.939 0.183 ρ14 0.923 0.378 0.942 0.192 0.940 0.191 0.944 0.183 ρ23 0.929 0.379 0.942 0.192 0.945 0.191 0.939 0.183 ρ24 0.932 0.378 0.938 0.192 0.941 0.191 0.946 0.183 ρ34 0.927 0.378 0.943 0.192 0.949 0.191 0.944 0.183 test 0.845 3.551 0.930 3.559 0.450 3.559 0.000 3.561 Larger sample sizes n are needed as r increases. Olive (2013) suggested that for iid data xi where xi is p × 1, the coverage started to get close to the nominal when n > 20p, but volume ratios needed n > 50p. Consider testing whether correlations in a correlation matrix are 0. There are r = p(p − 1)/2 correlations ρi,j = cor(Xi , Xj ) where i < j. The 11 simulation simulated iid data w with x = Aw and Aij = ψ for i 6= j and Aii = 1. Hence cor(Xi , Xj ) = [2ψ + (p− 2)ψ 2 ]/[1+ (p− 1)ψ 2 ]. Let µ = (ρ12 , ..., ρ1p, ρ23, ..., ρ2p, ..., ρp−1,p)T . Table 2 shows the results for multivariate normal data with p = 4 so r = 6 for testing H0 : µ = 0. The nominal coverage was 0.95. For n = 100 and ψ = 0, the q test failed to reject H0 84.5% of the time, but 93% of the time for n = 400. Note that χ26,0.95 = 3.548. With n = 400 and ψ > 0, for the test the coverage = 1 - power. For ψ = 0.3 the simulated power was 0.65, but 1.0 for ψ = 0.1. 6 Conclusions Applications of the prediction region method are numerous, but may need n ≥ 50r and B ≥ max(1000, n). A similar technique can be used to estimate the 100(1 − δ)% ˆ from the Bayesian credible region for θ. Generate B = max(100000, n) values of θ posterior distribution, and compute the prediction region (3). Olive (2014a, p. 364) used the shorth estimator to estimate Bayesian credible intervals. The mussels data was obtained from (http://lagrange.math.siu.edu/Olive/lregdata.txt). Simulations were done in R. See R Development Core Team (2011). The collection of R functions lregpack, available at (http://lagrange.math.siu.edu/Olive/lregpack.txt), has some useful functions for the prediction region method. The function vselboot bootstraps the minimum Cp model from all subsets variable selection. The function shorth2 can be used to find the shorth intervals for µ ˆi . The function predreg computes the prediction region and the Mahalanobis distance of the zero vector corresponding to Aθ − c = 0. The functions rowboot and regboot do the rowwise and residual bootstrap for the full model. The functions regbootsim and vsbootsim can be used to simulate the bootstrap tests for multiple linear regression and for the all subsets variable selection model that minimizes Cp . The functions corboot and corbootsim can be used to bootstrap the correlation matrix. 7 References Burnham, K.P., and Anderson, D.R. (2004), “Multimodel Inference Understanding AIC and BIC in Model Selection,” Sociological Methods & Research, 33, 261-304. Cook, R.D., and Weisberg, S. (1999), Applied Regression Including Computing and Graphics, Wiley, New York, NY. Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans, SIAM, Philadelphia, PA. Efron, B. (2014), “Estimation and Accuracy After Model Selection,” (with discussion), Journal of the American Statistical Association, 109, 991-1007. Efron, B. and Tibshirani, R.J. (1998), “The Problem of Regions,” The Annals of Statistics, 26, 1687-1718. Frey, J. (2013), “Data-Driven Nonparametric Prediction Intervals,” Journal of Statistical Planning and Inference, 143, 1039-1048. 12 Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd ed., Springer, New York, NY. Janssen, A., and Pauls, T. (2003), “How Do Bootstrap and Permutation Tests Work?,” The Annals of Statistics, 31, 768-806. Lei, J., Robins, J., and Wasserman, L. (2013), “Distribution Free Prediction Sets,” Journal of the American Statistical Association, 108, 278-287. Mammen, E. (1992), “Bootstrap, Wild Bootstrap, and Asymptotic Normality,” Probability Theory and Related Fields, 93, 439455. Mammen, E. (1993), “Bootstrap and Wild Bootstrap for High Dimensional Linear Models,” The Annals of Statistics, 21, 255-285. Nishi, R. (1984), “Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression,” Annals of Statistics, 12, 758-765. Olive, D.J. (2013), “Asymptotically Optimal Regression Prediction Intervals and Prediction Regions for Multivariate Data,” International Journal of Statistics and Probability, 2, 90-100. Olive, D.J. (2014a), Statistical Theory and Inference, Springer, New York, NY. Olive, D.J. (2014b), “Highest Density Region Prediction,” unpublished manuscript at (http://lagrange.math.siu.edu/Olive/pphdrpred.pdf). Olive, D.J., and Hawkins, D.M. (2005), “Variable Selection for 1D Regression Models,” Technometrics, 47, 43-50. Polanski, A. M. (2008), Observed Confidence Levels: Theory and Applications. Chapman & Hall/CRC, Boca Raton, FL. R Development Core Team (2011), “R: a Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria, (www.Rproject.org). Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd ed., Wiley, New York, NY. 13
© Copyright 2024