Download Report

Bootstrapping Hypotheses Tests
David J. Olive∗
Southern Illinois University
December 28, 2014
Abstract
Consider testing H0 : µ = c versus H1 : µ 6= c where µ is an r × 1 parameter
ˆ be a consistent estimator of µ and
vector and c is a known r × 1 vector. Let µ
ˆ ∗i − c for i = 1, ..., B. Make a prediction region for
make a bootstrap sample wi = µ
the w i and determine whether 0 is in the prediction region. It will be shown that
this prediction region method generalizes the percentile method for r = 1 to r ≥ 1.
Bootstrapping test statistics is well known, and the prediction region method
can be regarded as a special case where the test statistic is the squared Mahalanobis
distance Di2∗ = (Ti∗ − T ∗ )T [S ∗T ]−1 (Ti∗ − T ∗ ) where wi = Ti∗ , and T ∗ and S ∗T are
the sample mean and sample covariance matrix of T1∗ , ..., TB∗ .
There are many applications, including bootstrapping variable selection models
and bootstrapping a correlation matrix.
KEY WORDS: bagging, bootstrap, confidence region, multiple linear
regression, prediction region, variable selection
∗
David J. Olive is Professor, Department of Mathematics, Southern Illinois University, Carbondale,
IL 62901, USA.
1
1
Introduction
Consider testing H0 : µ = c versus H1 : µ 6= c where c is a known r × 1 vector. If
a confidence region can be constructed for µ − c, then fail to reject H0 if 0 is in the
confidence region, and reject H0 if 0 is not in the confidence region. Given training data
w 1 , ..., wn , a large sample 100(1 − δ)% prediction region for a future test value w f is a
set An such that P (w f ∈ An ) → 1 − δ as n → ∞, while a large sample confidence region
for a parameter µ is a set An such that P (µ ∈ An ) → 1 − δ as n → ∞.
The duality between hypothesis tests and confidence regions is well known, but for
bootstrap samples, the percentile method is simultaneously a large sample confidence
interval for µ, and also a large sample prediction interval for a future bootstrap statistic
∗
Tf,n
. It is natural to try to extend the percentile method to vector valued statistics and
∗
parameters using a large sample prediction region for a future bootstrap statistic Tf,n
as a
large sample confidence region for µ, but practical large sample prediction regions where
p > 1 and the underlying distribution is unknown have only recently been developed.
See Olive (2013, 2014b) and Lei, Robins, and Wasserman (2013).
The percentile method, which is an interval that contains dB ≈ kB = dB(1 − δ)e of
∗
∗
∗
the Ti,n
from a bootstrap sample T1,n
, ..., TB,n
where the statistic Ti,n is an estimator of µ
based on a sample of size n. Often the n is suppressed. Here dxe is the smallest integer
∗
∗
∗
≥ x, e.g. d7.8e = 8. Let T(1)
, T(2)
, ..., T(B)
be the order statistics of the bootstrap sample.
Then one version of the percentile method discards the largest and smallest dBδ/2e
order statistics, resulting in an interval (LB , UB ) that is a large sample 100(1 − δ)%
confidence interval for µ, and also a large sample 100(1 − δ)% prediction interval for a
∗
future bootstrap value Tf,n
. Janssen and Pauls (2003) and Mammen (1992) suggest that
the bootstrap works if there is a central limit theorem for the statistic Ti,n .
Olive (2014a, p. 283) recommends using the shorth(c) estimator for the percentile
∗
method. Let c = kB , and let Wi = Ti,n
. Let W(1), ..., W(B) be the order statistics of
the Wi . Compute W(c) − W(1), W(c+1) − W(2), ..., W(B) − W(B−c+1). Let [W(s) , W(s+c−1) ]
correspond to the closed interval with the smallest distance. Then reject H0 : µ = µ0
if µ0 is not in the interval. The shorth interval tends to be shorter than the interval
that deletes the smallest and largest dBδ/2e observations Wi when the Wi do not come
from a symmetric distribution. Frey (2013)
q showed that for large Bδ and iid data, the
shorth(kB ) PI has undercoverage ≈ 1.12 δ/B, and used the shorth(c) estimator as the
q
large sample 100(1 − δ)% prediction interval where c = dB[1 − δ + 1.12 δ/B ] e. Hence
if B = 1000, there may be about 1% undercoverage using c = kB .
Some notation is needed to give the Olive (2013) prediction region used to bootstrap
a hypothesis test. Suppose w1 , ..., wn are iid p × 1 random vectors with mean µ and
nonsingular covariance matrix Σw . Let a future test observation w f be independent
of the w i but from the same distribution. Let (w, S) be the sample mean and sample
covariance matrix where
w=
n
n
1X
1 X
w i and S = S w =
(w i − w)(w i − w)T .
n i=1
n − 1 i=1
2
(1)
Then the ith squared sample Mahalanobis distance is the scalar
2
2
= Dw
(w, S) = (w − w)T S −1 (w − w).
Dw
(2)
2
Let Di2 = Dw
for each observation w i . Let D(c) be the cth order statistic of D1 , ..., Dn.
i
Following Olive (2013), a large sample 100(1 − δ)% prediction region for w f is the hyperellipsoid
2
2
An = {w : Dw
(w, S) ≤ D(c)
} = {w : Dw (w, S) ≤ D(c) }.
(3)
If n is large, can use c = kn = dn(1−δ)e. If n is not large, using c = dn where dn decreases
to kn , can improve small sample performance. Olive (2013) showed that this prediction
region is a large sample 100(1 − δ)% prediction region for a large class of distributions,
although regions with smaller volumes may exist. Note that the result follows since if Σw
and S are nonsingular, then the Mahalanobis distance is a continuous function of (w, S).
D
D
Let D = D(µ, Σw ). Then Di → D and Di2 → D2 . Hence the sample percentiles of the
Di are consistent estimators of the population percentiles of D at continuity points of
cumulative distribution function (cdf) of D. The prediction region estimates the highest
density region for a large class of elliptically contoured distributions. See Olive (2014b)
for more on prediction regions.
ˆ ∗i − c for i = 1, ..., B.
The prediction region method makes a bootstrap sample wi = µ
Make the prediction region (3) for the w i and determine whether 0 is in the prediction
region. As shown below, the prediction region method is a special case of the percentile
method, and a special case of bootstrapping a test statistic.
ˆ − c.
Consider testing H0 : µ = c versus H1 : µ 6= c, and the statistic Ti = µ
If E(Ti ) = θ and Cov(Ti ) = ΣT were known, then the squared Mahalanobis distance
Di2 (θ, ΣT ) = (Ti − θ)T Σ−1
T (Ti − θ) would be a natural statistic to use if the percentile
2
D1−δ
(θ, ΣT ) was known. The prediction region method bootstraps the squared Mahaˆ ∗i − c and the squared Malanobis distances, forming the bootstrap sample w i = Ti∗ = µ
B
1 X
∗
∗ −1
∗
T
∗
2∗
2
∗
∗
∗
∗
Ti∗
halanobis distances Di = Di (T , S T ) = (Ti −T ) [S T ] (Ti −T ) where T =
B i=1
B
X
1
and S ∗T =
(T ∗ − T ∗)(Ti∗ − T ∗ )T are the sample mean and sample covariance maB − 1 i=1 i
trix of T1∗, ..., TB∗ . Then the percentile method that contains the smallest dB ≈ B(1 − δ)
∗
distances is used to get the closed interval [0, D(dB ) ] = [0, D(d
]. If H0 is true and
B)
T
2
ˆ = c, then θ = 0. Let D0
E[µ]
= T ∗ [S ∗T ]−1T ∗ and fail to reject H0 if D0 ≤ D(dB )
and reject H0 if D0 > D(dB ) . This percentile method is equivalent to computing the
prediction region (3) on the wi = Ti∗ and checking whether 0 is in the prediction region.
Note that the percentile method makes an interval that contains dB ≈ B(1 − δ) of the
scalar valued Ti∗. The prediction region method makes a hyperellipsoid that contains dB
of the r × 1 vectors Ti∗ = w i , and equivalently, makes an interval [0, D(dB ) ] that contains
dB of the Di∗ = Di .
When r = 1, a hyperellipsoid is an interval, so the prediction region method is a
special case of the percentile method. Suppose the parameter of interest is µ, and there
∗
is a bootstrap sample T1∗, ..., TB∗ . Let T and ST2∗ be the sample mean and variance of the
3
∗
2
Ti∗. Then the squared Mahalanobis distance Dµ2 = (µ − T )2 /ST2∗ ≤ D(d
is equivalent to
B)
∗
∗
∗
µ ∈ [T − ST∗ D(dB ) , T + ST∗ D(dB ) ], which is an interval centered at T just long enough to
cover dB ≈ B(1 − δ) of the Ti∗. Hence this interval is a version of the percentile method.
Bootstrapping test statistics is well known, and the prediction region is a special case
of this method using Di2∗ = Di2 (T ∗, S ∗T ) as the test statistic.
The point of the above discussion is that prediction region method can be thought of
as a variant of two widely used methods. The method should be regarded as exploratory
until theory proves that the method is a large sample test, but similar remarks apply
to other bootstrap methods such as the percentile method. Polansky (2008, p. 73)
summarizes a bootstrap percentile confidence region suggested by Efron and Tibshirani
(1998).
Section 2 explains why the prediction region method works, Section 3 examines the
method for multiple linear regression, Section 4 examines the method for variable selection, and Section 5 gives an example and some simulations.
2
Why the Prediction Region Method Works
ˆ where θ
ˆ is a p × 1
ˆ n , Tn = t(θ)
Several estimators of µ could be used, including Tn = µ
vector and µ = t(θ), or Tn = t(W ) where all of the data is collected into an n × k
matrix W . If the rows of W are iid cases w i , t(W ) = t(F ) denotes that an iid sample
from a distribution with cdf F was used, while t(W ∗ ) = t(Fn) = Tn∗ means that t was
computed from an iid sample from the empirical distribution: a sample of size n was
drawn with replacement from the observed sample w 1 , ..., wn . It is often assumed that
P
t(F ) − t(Fn) → 0.
When the bootstrap is used, a large sample confidence region for a parameter µ is
a set An,B such that P (µ ∈ An,B ) → 1 − δ as n, B → ∞. For this section, assume
P
nS ∗T → ΣT as n, B → ∞ where ΣT and S ∗T are nonsingular r × r matrices, µ is an r × 1
parameter, and Tn is an estimator of µ such that
√
D
n (Tn − µ) → X
(4)
as n → ∞. Then
√
−1/2
n ΣT
n (Tn −
D
−1/2
(Tn − µ) → ΣT
ˆ −1
µ)T Σ
T
X = Z,
D
(Tn − µ) → Z T Z = D2
ˆ T is a consistent estimator of ΣT , and
as n → ∞ where Σ
D
(Tn − µ)T [S ∗T ]−1 (Tn − µ) → D2
(5)
2
2
2
{w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D1−δ
} = {w : Dw
(Tn , S ∗T ) ≤ D1−δ
}
(6)
as n, B → ∞. Hence
2
is a large sample 100(1−δ)% confidence region for µ where P (D2 ≤ D1−δ
) = 1−δ. Often
√
D
by a central limit theorem or the multivariate delta method, n(Tn − µ) → Nr (0, ΣT ),
and D2 ∼ χ2r .
4
The prediction region method uses
2
2
2
{w : (w − T n )T [S ∗T ]−1 (w − T n ) ≤ D(d
} = {w : Dw
(T n , S ∗T ) ≤ D(d
}.
B)
B)
∗
∗
∗
(7)
Competitors include
2
2
2
{w : (w − Tn )T [S ∗T ]−1(w − Tn ) ≤ D(d
} = {w : Dw
(Tn , S ∗T ) ≤ D(d
},
B)
B)
and {w : (w − Tn )T [S ∗T ]−1(w − Tn ) ≤ χ2r,1−δ } = {w : D2w (Tn , S ∗T ) ≤ χ2r,1−δ }.
∗
∗
∗
(8)
(9)
Plugging bootstrap quantities into classical methods is not new, so (6) is not new.
2
Often D2 is unknown, so what is new is using D(d
to estimate D2 instead of assuming
B)
∗
D2 ∼ χ2r . Suppose the Ti∗ = Ti,n
are iid from some distribution with cdf F˜n . For example,
∗
if Ti,n = t(Fn) where iid samples from Fn are used, then F˜n is the cdf of t(Fn). Fix n, and
∗
∗
let E(Ti,n
) = µn and Cov(Ti,n
) = Σn . With respect to F˜n , µn and Σn are parameters,
but with respect to F , µn is a random vector and Σn is a random matrix. For example,
using least squares and the residual bootstrap for the multiple linear regression model
ˆ Σn = n − p MSE(X T X)−1 , and ΣT = σ 2 limn→∞ (X T X/n)−1 .
in Section 3, µn = β,
n
Then by the multivariate central limit theorem,
√
D
D
B(T ∗ − µn ) → Nr (0, Σn ) and B(T∗ − µn )T [S ∗T ]−1(T∗ − µn) → χ2r
as B → ∞.
If the Ti∗ are iid and
D
(10)
D
(11)
DT2 i∗ (T ∗, S ∗T ) = (Ti∗ − T ∗)T [S ∗T ]−1 (Ti∗ − T ∗) → D2
P
2
2
as n, B → ∞, then D(d
→ D1−δ
. Equation (10) holds if
B)
DT2 i∗ (µn , S ∗T ) = (Ti∗ − µn )T [S∗T ]−1 (Ti∗ − µn ) → D2
√
D
P
as n, B → ∞ and B −1/2[S ∗T ]−1/2(Ti∗ − µn ) → 0. Note that (11) holds if n(Tn − µ) → X
√
D
and n(Ti∗ − µn ) → X, which is often assumed with µn = Tn . Following Efron (2014),
T ∗ is the bagging or smoothed bootstrap estimator of µ.
For the prediction region method (7) to be a large sample 100(1 − δ)% confidence re∗
∗
P
P
2
2
2
2
gion for µ, need D(d
→ D1−δ
and Dµ
(T , S ∗T ) → D2 as n, B → ∞. Now Dµ
(T , S ∗T ) =
B)
(µ − Tn + Tn − T )T [S ∗T ]−1 (µ − Tn + Tn − T )T = (µ − Tn )T [S ∗T ]−1 (µ − Tn )
∗
∗
P
+2(µ − Tn )T [S ∗T ]−1(Tn − T ) + (Tn − T )T [S∗T ]−1 (Tn − T ) → D2
∗
∗
P
∗
if the above two terms → 0 as n, B → ∞. This will occur, for example, if [S ∗T ]−1/2(Tn −
∗
P
T ) → 0 or if the fourth term converges to 0 in probability as n, B → ∞. The fourth
term is often OP (B −1 ), for example if µn = Tn , as is often the case.
For r = 1, Efron (2014) uses confidence intervals T ∗ ± z1−δ SE(T ∗) where P (Z ≤
z1−δ ) = 1 − δ if Z ∼ N(0, 1). Efron uses a delta method estimate of SE(T ∗) to avoid
using the computationally expensive double bootstrap. The projection region method
avoids assuming a normal limiting distribution and estimates the cutoff using quantiles
∗
of the Mahalanobis distances of the Ti,n
from T ∗.
5
3
Bootstrap Tests for Multiple Linear Regression
Consider the multiple linear regression model Yi = xTi β + ei for i = 1, ..., n, written
in matrix form as Y = Xβ + e where Y is n × 1 and X is n × p. Consider testing
H0 : Aβ = c where A is an r × p matrix with full rank r and µ = Aβ. To perform
ˆ ∗ , ..., β
ˆ ∗ has been generated. Form the prediction
the test, suppose a bootstrap sample β
1
B
ˆ ∗ − c, ..., wB = Aβ
ˆ ∗ − c. If 0 is in the prediction region, fail to
region (3) for w 1 = Aβ
1
B
reject H0 , otherwise reject H0 .
It is useful to compare the bootstrap tests with classical tests. Methods for bootstrapping this model are well known. The estimated covariance matrix of the (ordinary)
least squares estimator is
ˆ
Cov(β
) = MSE(X T X)−1 .
OLS
The residual bootstrap computes the least squares estimator and obtains the n residuals
and fitted values r1 , ..., rn and Yˆ1 , ..., Yˆn. Then a sample of size n is selected with replace∗
∗
ment from the residuals resulting in r11
, ..., r1n
. Hence the empirical distribution of the
∗
∗
∗ T
∗
residuals is used. Then a vector Y 1 = (Y11, ..., Y1n
) is formed where Y1j∗ = Yˆj + r1j
.
∗
∗
ˆ . This process is repeated B
Then Y 1 is regressed on X resulting in the estimator β
1
∗
∗
ˆ , ..., β
ˆ . This method should have n > 10p so that
times resulting in the estimators β
1
B
the residuals ri are close to the errors ei.
Efron (1982, p. 36) notes that for the residual bootstrap, the sample covariance
ˆ ∗ is estimating the population bootstrap matrix n − p MSE(X T X)−1 as
matrix of the β
i
sn
n−p
B → ∞. Hence the residual bootstrap standard error SE(βˆi ) ≈
SE(βˆi,OLS ).
n
If the z i = (Yi , xTi )T are iid observations from some population, then a sample of
size n can be drawn with replacement from z 1 , ..., zn . Then the response and predictor
variables can be formed into vector Y ∗1 and design matrix X ∗1 . Then Y ∗1 is regressed
ˆ ∗ . This process is repeated B times resulting in the
on X ∗1 resulting in the estimator β
1
ˆ ∗ , ..., β
ˆ ∗ . If the z i are the rows of a matrix Z, then this rowwise bootstrap
estimators β
1
B
uses the empirical distribution of the z i . This method appears to need a larger sample
size n than the residual bootstrap if n > 10p, but may be useful if n is large but n < 5p.
Following Seber and Lee (2003, p. 100), the classical test statistic for testing H0 is
ˆ − c)T [MSE A(X T X)−1 AT ]−1(Aβ
ˆ − c)
(Aβ
,
F =
r
D
and when H0 is true, rFR → χ2r for a large class of error distributions. The sample
n−p
covariance matrix S w of the w i is estimating
MSE A(X T X)−1 AT , and w ≈ 0
n
when H0 is true. Thus under H0 , the squared distance Di2 = (w i − w)T S −1
w (w i − w) ≈
∗
∗
n
ˆ − c)T [MSE A(X T X)−1 AT ]−1 (Aβ
ˆ − c),
(Aβ
n−p
n
2
and expect D(d
≈ n−p
χ2r,1−δ , for large n and B and small p. Hence the prediction
B)
region method is closely related to the Mammen (1993) suggestion of bootstrapping the
test statistic F for this model.
6
4
Bootstrapping the Variable Selection Estimator
Variable selection, also called subset or model selection, is the search for a subset of
predictor variables that can be deleted without important loss of information. By treating
ˆ of β as a shrinkage estimator, the bootstrap can be
a variable selection estimator β
used to examine variable selection. Forward selection, backward elimination, stepwise
regression, and all subsets variable selection can be used if there is a criterion that selects
the submodel, such as AIC or Cp . Similar ideas can be used to bootstrap other shrinkage
estimators.
ˆ
Consider testing H0 : Aβ = c where A is an r × p matrix with full rank r. Now let β
be a variable selection estimator of β. To perform the test, suppose a bootstrap sample
ˆ ∗, ..., β
ˆ ∗ has been generated. Form the prediction region (3) for w 1 = Aβ
ˆ ∗ −c, ..., wB =
β
1
B
1
∗
ˆ
Aβ B − c. If 0 is in the prediction region, fail to reject H0 , otherwise reject H0 .
A model for variable selection in multiple linear regression can be described by
Y = xT β + e = βT x + e = xTS β S + xTE βE + e = xTS βS + e
(12)
where e is an error, Y is the response variable, x = (xTS , xTE )T is a p × 1 vector of
predictors, xS is a kS × 1 vector and xE is a (p − kS ) × 1 vector. Given that xS is in the
model, β E = 0 and E denotes the subset of terms that can be eliminated given that the
subset S is in the model.
Since S is unknown, candidate subsets will be examined. Following Olive and Hawkins
(2005), let xI be the vector of k terms from a candidate subset indexed by I, and let xO
be the vector of the remaining predictors (out of the candidate submodel). Then
Y = xTI β I + xTO β O + e.
(13)
The model Y = xT β + e that uses all of the predictors is called the full model. A model
Y = xTI β I + e that only uses a subset xI of the predictors is called a submodel.
Suppose that S is a subset of I and that model (12) holds. Then
xT β = xTS β S = xTS βS + xTI/S β(I/S) + xTO 0 = xTI β I
(14)
where xI/S denotes the predictors in I that are not in S. Since this is true regardless of
the values of the predictors, β O = 0 if S ⊆ I.
For multiple linear regression, if the candidate model of xI has k terms (including the
constant), then the partial F statistic for testing whether the p − k predictor variables
in xO can be deleted is
"
#
SSE(I) − SSE SSE
n − p SSE(I)
FI =
/
=
−1
(n − k) − (n − p) n − p
p−k
SSE
where SSE is the error sum of squares from the full model and SSE(I) is the error sum
of squares from the candidate submodel. An important criterion for variable selection is
the Cp criterion
Cp (I) =
SSE(I)
+ 2k − n = (p − k)(FI − 1) + k
MSE
7
where MSE is the error mean square for the full model. Olive and Hawkins (2005) show
that submodels with Cp (I) ≤ min(2k, p) are especially interesting. The AIC is criterion
similar to Cp.
Suppose the variable selection method, such as forward selection or all subsets, produces J models. Let model Imin be the model that minimizes the criterion, e.g. Cp (I)
or AIC(I). Following Seber and Lee (2003, p. 448) and Nishi (1984), the probability
that model Imin from Cp or AIC underfits goes to zero as n → ∞. Since there are a
finite number of regression models I that contain the true model, and each model gives
a consistent estimator of β, the probability that Imin picks one of these models goes to
ˆ
one as n → ∞. Hence β
Imin is a consistent estimator of β under model (12).
Other automated variable selection methods may work better than Imin . For the Cp
criterion, find the submodel II with the fewest number of predictors such that Cp (II ) ≤
Cp (Imin ) + 1. For AIC, Burnham and Anderson (2004) suggest that if ∆(I) = AIC(I) −
AIC(Imin), then models with ∆(I) ≤ 2 are good. Find the submodel II with the smallest
number of predictors such that ∆(II ) ≤ 2. It is possible that II = Imin or that II is the
full model. Do not use more predictors than model II to avoid overfitting.
Suppose model I is selected after variable selection. Then least squares output for
the model Y = X I βI + e can be obtained, but the least squares output is not correct
for inference. In particular, MSE(I)(X TI X I )−1 is not the correct estimated covariance
ˆ . The selected model tends to fit the data too well, so SE(βˆi) from the
matrix of β
I
incorrect estimated covariance matrix is too small. Hence the confidence intervals for βi
are too short, and hypotheses tests reject H0 : βi = 0 too often.
Hastie, Tibshirani, and Friedman (2009, p. 57) note that variable selection is a
shrinkage estimator: the coefficients are shrunk to 0 for the omitted variables. Suppose
ˆ is k × 1, form β
ˆ from β
ˆ by adding 0s corresponding to the omitted
n > 10p. If β
I
I
ˆ is a nonlinear estimator of β, and the residual bootstrap method
variables. Then β
ˆ is formed from model Imin that minimizes
can be applied. For example, suppose β
Cp from some variable selection method such as forward selection, backward elimination,
stepwise selection, or all subsets variable selection. Instead of computing the least squares
estimator from regression Y ∗i on X, perform variable selection on Y ∗i and X, fit the
model that minimizes the criterion, and add 0s corresponding to the omitted variables,
ˆ ∗, ..., β
ˆ∗ .
resulting in estimators β
1
B
Before Efron (2014), inference techniques for the variable selection model have not
had much success. Efron (2014) lets t(Z) be a scalar valued statistic, based on all of the
data Z, that estimates a parameter of interest µ. Form a bootstrap sample Z ∗i and t(Z ∗i )
n
1 X
for i = 1, ..., B. Then µ
˜ = s(Z) =
t(Z ∗i ), a “bootstrap smoothing” or “bagging”
B i=1
estimator. In the regression setting with variable selection, Z ∗i can be formed with the
rowwise or residual bootstrap using the full model. The prediction region method can
also be applied to t(Z). For example, when A is 1 × p, the prediction region method
ˆ − c and T ∗ = µ
uses µ = Aβ − c, t(Z) = Aβ
˜ . Efron (2014) uses the confidence interval
∗
∗
∗
T ± z1−δ SE(T ) which is symmetric about T . The prediction region method uses
∗
∗
T ± ST∗ D(dB ) which is also a symmetric interval centered at T . If both the prediction
region method and Efron’s method are large sample confidence intervals for µ, then they
8
√
have the same asymptotic length (perhaps scaled by multiplying by n), since otherwise
the shorter interval will have lower asymptotic coverage. Since the prediction region
interval is a percentile interval, the shorth(kB ) interval could have much shorter length
than the Efron interval and the prediction region interval if the bootstrap distribution is
not symmetric.
The prediction region method can be used for vector valued statistics and parameters,
and does not need the statistic to be asymptotically normal. These features are likely
useful for variable selection models. Prediction intervals and regions can have higher
than the nominal coverage 1 − δ if the distribution is discrete or a mixture of a discrete
distribution and some other distribution. In particular, coverage can be high if the w i
distribution is a mixture of a point mass at 0 and the method checks whether 0 is in the
prediction region. Such a mixture often occurs for variable selection methods and lasso.
ˆ ∗ can contain many zeroes and be highly skewed if
The bootstrap sample for the Wi = β
ij
the jth predictor is weak. Then the program may fail because S w is singular, but if all
ˆ ∗ = 0, then there is strong evidence that the jth predictor is not
or nearly all of the β
ij
needed given that the other predictors are in the variable selection method.
ˆ ∗ = 0 for i = 1, ..., B and for each run in
As an extreme simulation case, suppose β
ij
the simulation. Consider testing H0 : βj = 0. Then regardless of the nominal coverage
1 − δ, the closed interval [0,0] will contain 0 for each run and the observed coverage will
be 1 > 1− δ. Using the open interval (0,0) would give observed coverage 0. Also intervals
[0, b] and [a, 0] correctly suggest failing to reject βj = 0, while intervals (0, b) and (a, 0)
incorrectly suggest rejecting H0 : βj = 0. Hence closed regions and intervals make sense.
5
Example and Simulations
Example. Cook and Weisberg (1999, pp. 351, 433, 447) gives a data set on 82 mussels
sampled off the coast of New Zealand. Let the response variable be the logarithm log(M)
of the muscle mass M, and the predictors are the length L and height H of the shell in
mm, the logarithm log(W ) of the shell width W, the logarithm log(S) of the shell mass S,
and a constant. Inference for the full model is shown along with the shorth(c) nominal
95% confidence intervals for βi computed using the rowwise and residual bootstraps. As
expected, the residual bootstrap intervals are close to the classical least squares confidence
intervals ≈ βˆi ± 2SE(βˆi ). The minimum Cp model used a constant, H, and log(S). The
shorth(c) nominal 95% confidence intervals for βi using the residual bootstrap are shown.
Note that the intervals for H and log(W ) are right skewed and contain 0 when closed
intervals are used instead of open intervals.
It was expected that log(S) may be the only predictor needed, along with a constant,
since log(S) and log(M) are both log(mass) measurements and likely highly correlated.
Hence want to test H0 : β2 = β3 = β4 = 0 with the Imin model selected by all subsets
variable selection. (Of course this test would be easy to do with the full model using least
squares theory.) Then H0 : Aβ = (β2, β3, β4 )T = 0. Using the prediction
q region method
with least squares gave an interval [0,2.937] with D0 = 1.594. Note that χ23,0.95 = 2.795.
So fail to reject H0 . The prediction region method using Imin had [0, D(dB ) ] = [0, 3.282]
while D0 = 1.137. So fail to reject H0 .
9
large sample full model inference
Estimate Std.Err t-value Pr(>|t|) rowboot
resboot
constant -1.2493 0.8388 -1.4894 0.1405 [-2.720,-0.015] [-3.065,0.110]
L
-0.0006 0.0023 -0.2829 0.7780 [-0.005,0.003] [-0.005,0.003]
logW
0.1298 0.3738 0.3471
0.7295 [-0.390,0.710] [-0.549,0.885]
H
0.0075 0.0050 1.5044
0.1366 [-0.001,0.017] [-0.002,0.016]
logS
0.6404 0.1686 3.7989
0.0003 [ 0.209,1.025] [ 0.337,0.947]
incorrect min Cp submodel inference
Estimate Std.Err t-value Pr(>|t|) 95% shorth bootstrap CI
constant -0.9573 0.1519 -6.3018 0.0000
[-3.214,-0.593]
L
0
[-0.005, 0.003]
logW
0
[ 0.000, 0.977]
H
0.0072 0.0047 1.5490 0.1254
[ 0.000, 0.015]
logS
0.6530 0.1160 5.6297 0.0000
[ 0.358, 0.933]
The R code used to produce the above output is shown below.
library(leaps)
y <- log(mussels[,5])
x <- mussels[,1:4]
x[,4] <- log(x[,4])
x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs,
apply(out$betas,2,shorth2); apply(tem$betas,2,shorth2);
apply(outvs$betas,2,shorth2)
ls.print(outvs$full)
ls.print(outvs$sub) #test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4] #method with residual bootstrap
predreg(Abeta)
Abeta <- outvs$betas[,2:4] #prediction region method with Imin
predreg(Abeta)
A small simulation study was done in R using B = max(1000, n) and 5000 runs.
The regression model used β = (1, 1, 0, 0)T with n = 100, p = 4 and various zero mean
iid error distributions. The design matrix X consisted of iid N(0,1) random variables.
Hence the full model least squares confidence intervals for βi should have length near
√
2t99,0.975σ/ n ≈ 2(1.96)σ/10 = 0.392σ when the iid zero mean errors have variance σ 2.
The simulation computed the shorth(kB ) interval for each βi and used the prediction
region method to test H0 : β3 = β4 = 0. The nominal coverage was 0.95 with δ = 0.05.
Observed coverage between 0.94 and 0.96 would suggest coverage is close to the nominal
value. Observed coverage near 0.94 would not be surprising since with B = 1000, expect
about 1% undercoverage.
The regression models used the residual bootstrap on the full model least squares
estimator and on the all subsets variable selection estimator for the model Imin . The
10
residuals were from least squares applied to the full model in both cases. Results are
shown for when the iid errors ei ∼ N(0, 1). Table 1 shows two rows for each model giving
the observed confidence interval coverages and average lengths of the confidence intervals.
The term “reg” is for the full model regression, and the term “vs” is for the all subsets
variable selection. The column for the “test” gives the length and coverage = P(fail to
reject H0 ) for the interval [0, D(dn ) ] where D(dn ) is the cutoff for the prediction region.
The volume
cutoff will often
q of the prediction region will decrease to 0 as n → ∞. The q
2
be near χr,0.95 if the statistic T is asymptotically normal. Note that χ22,0.95 = 2.448
is very close to 2.4491 for the full model regression bootstrap test. The coverages were
near 0.94 for the regression bootstrap on the full model. For Imin the coverages were
near 0.94 for β1 and β2, but higher for the other 3 tests since zeroes often occurred for
βˆj∗ for j = 3, 4. The average lengths and coverages were similar for the full model and
all subsets variable selection Imin for β1 and β2 , but the lengths are shorter for Imin for
β3 and β4 . Volumes of the hyperellipsoids were not computed, but the average cutoff of
2.69 for the variable selection test suggests that the test statistic was not multivariate
normal, which is not surprising since many zeroes were produced for βˆj∗ for j = 3, 4.
Table 1: Bootstrapping Regression and Variable Selection
model
reg
vs
cov/len
cov
len
cov
len
β1
0.9370
0.3825
0.9346
0.3819
β2
0.9328
0.3843
0.9352
0.3850
β3
0.9386
0.3855
0.9962
0.3037
β4
0.9328
0.3848
0.9958
0.3033
test
0.9360
2.4491
0.9932
2.6879
Table 2: Bootstrapping the Correlation Matrix
n
100
400
400
400
ψ
0
cov/len
cov
len
0
cov
len
0.03
cov
len
0.1
cov
len
ρ12
0.929
0.378
0.940
0.192
0.942
0.191
0.935
0.183
ρ13
0.928
0.378
0.938
0.191
0.946
0.191
0.939
0.183
ρ14
0.923
0.378
0.942
0.192
0.940
0.191
0.944
0.183
ρ23
0.929
0.379
0.942
0.192
0.945
0.191
0.939
0.183
ρ24
0.932
0.378
0.938
0.192
0.941
0.191
0.946
0.183
ρ34
0.927
0.378
0.943
0.192
0.949
0.191
0.944
0.183
test
0.845
3.551
0.930
3.559
0.450
3.559
0.000
3.561
Larger sample sizes n are needed as r increases. Olive (2013) suggested that for iid
data xi where xi is p × 1, the coverage started to get close to the nominal when n > 20p,
but volume ratios needed n > 50p. Consider testing whether correlations in a correlation
matrix are 0. There are r = p(p − 1)/2 correlations ρi,j = cor(Xi , Xj ) where i < j. The
11
simulation simulated iid data w with x = Aw and Aij = ψ for i 6= j and Aii = 1. Hence
cor(Xi , Xj ) = [2ψ + (p− 2)ψ 2 ]/[1+ (p− 1)ψ 2 ]. Let µ = (ρ12 , ..., ρ1p, ρ23, ..., ρ2p, ..., ρp−1,p)T .
Table 2 shows the results for multivariate normal data with p = 4 so r = 6 for testing
H0 : µ = 0. The nominal coverage was 0.95. For n = 100 and ψ = 0, the
q test failed to
reject H0 84.5% of the time, but 93% of the time for n = 400. Note that χ26,0.95 = 3.548.
With n = 400 and ψ > 0, for the test the coverage = 1 - power. For ψ = 0.3 the simulated
power was 0.65, but 1.0 for ψ = 0.1.
6
Conclusions
Applications of the prediction region method are numerous, but may need n ≥ 50r
and B ≥ max(1000, n). A similar technique can be used to estimate the 100(1 − δ)%
ˆ from the
Bayesian credible region for θ. Generate B = max(100000, n) values of θ
posterior distribution, and compute the prediction region (3). Olive (2014a, p. 364)
used the shorth estimator to estimate Bayesian credible intervals. The mussels data was
obtained from (http://lagrange.math.siu.edu/Olive/lregdata.txt).
Simulations were done in R. See R Development Core Team (2011). The collection
of R functions lregpack, available at (http://lagrange.math.siu.edu/Olive/lregpack.txt),
has some useful functions for the prediction region method. The function vselboot
bootstraps the minimum Cp model from all subsets variable selection. The function
shorth2 can be used to find the shorth intervals for µ
ˆi . The function predreg computes
the prediction region and the Mahalanobis distance of the zero vector corresponding
to Aθ − c = 0. The functions rowboot and regboot do the rowwise and residual
bootstrap for the full model. The functions regbootsim and vsbootsim can be used to
simulate the bootstrap tests for multiple linear regression and for the all subsets variable
selection model that minimizes Cp . The functions corboot and corbootsim can be used
to bootstrap the correlation matrix.
7
References
Burnham, K.P., and Anderson, D.R. (2004), “Multimodel Inference Understanding AIC
and BIC in Model Selection,” Sociological Methods & Research, 33, 261-304.
Cook, R.D., and Weisberg, S. (1999), Applied Regression Including Computing and
Graphics, Wiley, New York, NY.
Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans, SIAM,
Philadelphia, PA.
Efron, B. (2014), “Estimation and Accuracy After Model Selection,” (with discussion),
Journal of the American Statistical Association, 109, 991-1007.
Efron, B. and Tibshirani, R.J. (1998), “The Problem of Regions,” The Annals of Statistics, 26, 1687-1718.
Frey, J. (2013), “Data-Driven Nonparametric Prediction Intervals,” Journal of Statistical Planning and Inference, 143, 1039-1048.
12
Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd ed., Springer, New York, NY.
Janssen, A., and Pauls, T. (2003), “How Do Bootstrap and Permutation Tests Work?,”
The Annals of Statistics, 31, 768-806.
Lei, J., Robins, J., and Wasserman, L. (2013), “Distribution Free Prediction Sets,”
Journal of the American Statistical Association, 108, 278-287.
Mammen, E. (1992), “Bootstrap, Wild Bootstrap, and Asymptotic Normality,” Probability Theory and Related Fields, 93, 439455.
Mammen, E. (1993), “Bootstrap and Wild Bootstrap for High Dimensional Linear Models,” The Annals of Statistics, 21, 255-285.
Nishi, R. (1984), “Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression,” Annals of Statistics, 12, 758-765.
Olive, D.J. (2013), “Asymptotically Optimal Regression Prediction Intervals and Prediction Regions for Multivariate Data,” International Journal of Statistics and Probability, 2, 90-100.
Olive, D.J. (2014a), Statistical Theory and Inference, Springer, New York, NY.
Olive, D.J. (2014b), “Highest Density Region Prediction,” unpublished manuscript at
(http://lagrange.math.siu.edu/Olive/pphdrpred.pdf).
Olive, D.J., and Hawkins, D.M. (2005), “Variable Selection for 1D Regression Models,”
Technometrics, 47, 43-50.
Polanski, A. M. (2008), Observed Confidence Levels: Theory and Applications. Chapman & Hall/CRC, Boca Raton, FL.
R Development Core Team (2011), “R: a Language and Environment for Statistical
Computing,” R Foundation for Statistical Computing, Vienna, Austria, (www.Rproject.org).
Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd ed., Wiley, New
York, NY.
13