Large sample results for varying kernel regression estimates , 2013

Journal of Nonparametric Statistics, 2013
Vol. 25, No. 4, 829–853, http://dx.doi.org/10.1080/10485252.2013.810742
Large sample results for varying kernel regression estimates
Hira L. Koula and Weixing Songb *
Downloaded by [98.239.145.180] at 20:47 16 April 2014
a Department of Statistics and Probability, Michigan State University, East Lansing,
b Department of Statistics, Kansas State University, Manhattan, KS, USA
MI, USA;
(Received 16 December 2012; accepted 23 May 2013)
The varying kernel density estimates are particularly designed for positive random variables. Unlike the
commonly used symmetric kernel density estimates, the varying kernel density estimates do not suffer from
the boundary problem. This paper establishes asymptotic normality and uniform almost sure convergence
results for a varying kernel density estimate when the underlying random variable is positive. Similar
results are also obtained for a varying kernel nonparametric estimate of the regression function when the
covariate is positive. Pros and cons of the varying kernel regression estimate are also discussed via a
simulation study.
Keywords: varying kernel regression; inverse gamma distribution; almost sure convergence; central limit
theorem
AMS Subject Classifications: 62G08; 62G20
1.
Introduction
In this paper, we investigate consistency and asymptotic normality of a varying kernel density
estimator when the density function is supported on (0, ∞). We also propose a varying kernel
regression function estimator when a covariate in the underlying regression model is non-negative
and investigate its similar asymptotic properties.
The problem of estimating density function of a random variable X taking values on the real
line has been of a long-lasting research interest among statisticians, and numerous interesting
and fundamental results have been obtained. Kernel density estimation method no doubt is the
most popular among all the proposed nonparametric procedures. In the commonly used kernel
estimation setup, the kernel function K is often chosen to be a density function symmetric around 0
satisfying some moment conditions.
With a random sample X1 , X2 , . . . , Xn of X, and
!a bandwidth h depending on n, the kernel estimate of density function f of X is fˆ (x) = (nh)−1 ni=1 K((x − Xi )/h). The contribution from each
sample point Xi to fˆ (x) is mainly controlled by how far Xi is from x on the h-scale. Moreover,
because of the symmetry of K, the sample points on either sides of x with the same distance
from x make the same contribution to the estimate. Consequently, symmetric kernel assigns
positive weights outside the density support set near the boundaries, which is also the very reason why the commonly used symmetric kernel density estimates have the unpleasant boundary
*Corresponding author. Email: [email protected]
© American Statistical Association and Taylor & Francis 2013
Downloaded by [98.239.145.180] at 20:47 16 April 2014
830
H.L. Koul and W. Song
problem. This boundary problem is also present in the Nadaraya–Watson (N–W) estimators of a
nonparametric regression function. Numerous ways have been proposed to remove the boundary
effect. In the context of density estimation, see Schuster (1985), Marron and Ruppert (1994), Jones
(1993), Fan and Gijbels (1992), Cowling and Hall (1996), etc.; for nonparametric regression, see
Gasser and Müller (1979), Müller (1991), Müller and Wang (1994), John (1984), and references
therein.
The research on estimating the density functions not supported on the entire real line using
asymmetric density kernels started from late 1990s. When density has a compact support, motivated by the Bernstein polynomial approximation theorem in mathematical function analysis,
Chen (1999) proposed Beta kernel density estimators and analysed bias and variance of these
estimators. By reversing the role of estimation point and data point in Chen’s (1999) estimation
procedure, and using the Gaussian copula kernel, Jones and Henderson (2007) proposed two
density estimators. When density functions are supported on (0, ∞), Chen (2000b) constructed
a Gamma kernel density estimate and Scaillet (2004) proposed an inverse Gaussian kernel and a
reciprocal inverse Gaussian kernel density estimate. Chaubey, Sen and Sen (2012) also proposed
a density estimator for non-negative random variables via smoothing of the empirical distribution
function using a generalisation of Hille’s lemma. A varying kernel density estimate, which is an
asymmetric kernel density estimate and based on a modification of Chen’s Gamma kernel density
estimate, was recently proposed by Mnatsakanov and Sarkisian (2012) (M–S).
Compared to the traditional symmetric kernel estimation procedures, there are two unique
features about all of the above asymmetric kernel methods: (1) the smoothness of the density
estimate is controlled by the shape or scale parameter of the asymmetric kernel and the location
where the estimation is made; and (2) the asymmetric kernels have the same support as the
density functions to be estimated, thus the kernels do not allocate any weight outside the support.
As a consequence, all of the above asymmetric kernel density estimators can effectively reduce
the boundary bias and they all achieve the optimal rate of convergence for the mean integrated
squared error.
Some asymmetric density estimators are bona fide densities, such as the ones proposed by Jones
and Henderson (2007). Most of them are not, but they become one after a slight modification, for
example, the M–S varying kernel density estimate. In principle, the commonly used symmetric
kernel density estimate, in which the kernel is supported over a symmetric interval around 0,
can still be used for estimating the density function of a random variable with some restricted
range, but the resulting estimate itself may not be a density function any more. For example,
using standard normal kernel to estimate the density function of a positive random variable, the
resulting kernel density estimate does not integrate to 1 over (0, ∞).
Most of the research on asymmetric kernel estimation methodology has been focused on the
density estimation, and the asymptotic theories are limited to the bias, variance or mean square
error (MSE) derivations. To the best of our knowledge, literature is scant on the investigation of
the consistency of asymmetric kernel density estimators except for Bouezmarni and Rolin (2003)
and Chaubey et al. (2012). Nothing is available on their asymptotic distributions. The situation in
the context of nonparametric regression is also surprising. Using Beta or Gamma kernel function,
Chen (2000a, 2002) proposed the local linear estimators for regression function and derived their
asymptotic bias and variance, but did not analyse their asymptotic distributions.
The present paper makes an attempt at filling this void by investigating the large sample properties of the M–S kernel procedure in the fields of both density and regression function estimations.
First, in the context of density estimation, we investigate the asymptotic normality and uniform
almost sure convergence of the M–S kernel density estimate. Second, in the context of nonparametric regression, we investigate the asymptotic behaviour of the M–S kernel regression function
estimate. We derive its asymptotic conditional bias and conditional variance, and establish its uniform almost sure consistency and the asymptotic normality. Third, bandwidth selection is explored
831
Journal of Nonparametric Statistics
for the sake of implementing the methodology. As a byproduct, the paper provides a theoretical
framework for investigating the similar properties of other asymmetric kernel estimators.
2.
M–S kernel regression estimation
Suppose X1 , X2 , . . . , Xn is a random sample from a population X supported on (0, ∞). Let
" αx #
1 " αx #α
exp −
,
t"(α) t
t
Kα∗ (x, t) =
α > 0, t > 0, x > 0.
(1)
Downloaded by [98.239.145.180] at 20:47 16 April 2014
For a sequence of positive real numbers αn , M–S proposed the following nonparametric estimate
for the density of X:
n
fα∗n (x)
n
1$ ∗
1$
1
=
Kαn (x, Xi ) =
n i=1
n i=1 Xi "(αn )
%
αn x
Xi
&α n
%
&
αn x
exp −
.
Xi
(2)
The estimate (2) is constructed using the technique of recovering a function from its Mellin transform applied in the moment-identifiable problem. There is a close connection between Kα∗ (x, t)
and the Gamma and inverse Gamma density functions. For each fixed t, Kα∗ (·, t) is a Gamma density function with scale parameter t/α and shape parameter α + 1; for each fixed x, Kα∗ (x, ·) is the
density function of an inverse Gamma distribution with scale parameter α and shape parameter
αx. Unfortunately, as seen in M–S, the asymptotic bias of fα∗n (x) depends on the first derivative of
the underlying density function of X, which is due to the fact that αx/(α − 1), instead of x, is the
mean of the density Kα∗ (x, t) viewed as a function of t. To reduce the bias, M–S used a modified
version of Kα∗ (x, t), viz,
Kα (x, t) =
" αx #α+1
" αx #
1
exp −
,
t"(α + 1) t
t
(3)
to construct the density estimate. For fixed x, Kα (x, ·) now is the density function of an inverse
Gamma distribution with shape parameter α + 1 and scale parameter αx, the mean of which is
exactly x; for fixed t, Kα (·, t) is not a Gamma density function anymore, but αKα (x, t)/(α + 1) is
a Gamma density function with shape parameter α + 2 and scale parameter t/α. These properties
indeed imply a very interesting connection of the M–S kernel Kα (x, t) and the normal kernel used
in the commonly used density estimate for large values of α. In fact, for a fixed x, let Tα be a
random variable having density function Kα (x, ·), and for a fixed t, let Xα be a random variable
having density function αKα (·, t)/(α + 1). Then, one can verify that
%
%
&
&
√
√
Tα
Xα
α
α
− 1 →d N(0, 1),
− 1 →d N(0, 1), as α → ∞.
x
t
√
Here, and in the following, →d denotes the convergence in distribution. If we let h = 1/ α, then
from the above facts it follows that as α → ∞,
%
&
%
&
1
x/t − 1
1
t/x − 1
Kα (x, t) ≈ φ
or Kα (x, t) ≈ φ
,
h
h
h
h
where φ(·) denotes the standard normal density function. Therefore, the M–S kernel Kα approximately behaves like the standard normal kernel, while the distance between x and t is not the usual
Euclidean distance |x − t|, but rather the relative distance |x − t|/t or |x − t|/x; for the commonly
used kernel function, x and t are symmetric in the sense of difference, while in the kernel function
832
0.8
0.4
0.0
Downloaded by [98.239.145.180] at 20:47 16 April 2014
1.2
1.6
2.0
2.4
2.8
3.2
3.6
H.L. Koul and W. Song
0.5
1
2
3
x
Figure 1. The kernel function Kα (x, t) for the four pseudo-data points listed in the text and two choices of α. The solid
curves are for α = 5, and the dotted curves for α = 20. The curve with the highest peak is for the data point 0.5, with the
second highest is for the data point 1, and so on.
√
Kα (x, t), x and t are asymptotically symmetric in the sense of division; the parameter 1/ α plays
the role of bandwidth as in commonly used kernel setup. To have a better understanding about
the smoothing effect of the kernel function Kα (x, t), we plot the functions for a pseudo-data set
0.5, 1, 2, 3 and α = 5, 20 over the range x ∈ (0, 7) which are shown in Figure 1. Clearly, all the
curves are skewed to the right implying that more weights are put on the values to the right of the
observed data points; as α gets larger, all the curves shrink towards the data points and the shape
of the kernels changes according to the values of the data points.
For a random sample X1 , X2 , . . . , Xn from the population X supported on (0, ∞), the M–S kernel
density estimate based on the modified kernel Kα (x, t) is
n
1$
Kα (x, Xi ).
fˆn (x) =
n i=1 n
(4)
The expression for the MSE of fˆn (x) is derived in M–S, as well as the L1 -consistency. Different
from the commonly used symmetric kernel density estimates, the M–S kernel density estimates
do not suffer from the boundary effect, which is confirmed both by the theories developed and
simulation studies conducted in M–S. Although f (x) is not defined at x = 0, it is clear that
fˆn (0) = 0 almost surely. This intrinsic constraint is only desirable if limx→0 f (x) = 0. Some other
asymmetric kernel estimates also suffer from the similar disturbance, such as the inverse and
reciprocal inverse Gaussian kernel estimates proposed in Scaillet (2004) and the copula-based
kernel estimate suggested by Jones and Henderson (2007). If limx→0 f (x) > 0, then to analyse
the boundary behaviour of fˆn (x) around 0, similar to the symmetric kernel case, we analyse the
limiting behaviour of the bias in fˆn (x) at x = u/αn , where 0 < u < 1. This is done in Section 6.
There is no discussion in the literature on the asymptotic normality of the M–S kernel density
estimate (4). This paper will try to fill this void, not just because this topic itself is very interesting,
but also because it has some very practical implications, for example, knowing the asymptotic
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Journal of Nonparametric Statistics
833
distribution of fˆn (x) enables us to construct confidence interval for the density function f (x).
Parallel to the commonly used symmetric kernel estimation methodology, and also as a further
development, we also investigate the large sample behaviour of the nonparametric estimators of
regression function using the M–S kernel, when the covariate is positive.
The relationship between a scalar response Y and a covariate X is often investigated through
the regression model Y = m(X) + ε, where ε is the random error and X is one dimensional
and a positive random variable. Furthermore, we assume that E(ε|X = x) = 0 and σ 2 (x) :=
E(ε2 |X = x) > 0, for almost all x > 0. Let {(Xi , Yi ), i = 1, 2, . . . , n} be a random sample from
this regression model. Inspired by the construction of the N–W kernel regression estimate, the
M–S kernel regression estimate of m(x) is defined to be
!n
Kαn (x, Xi )Yi
m
ˆ n (x) = !i=1
.
(5)
n
i=1 Kαn (x, Xi )
In spite of the similarity between this estimate and its N–W kernel counterpart, the very different
characteristics of the M–S kernel function from the commonly used symmetric kernel imply that
many technical challenges encountered in the development of asymptotic theories for the new
estimates are different. Under some regularity conditions on the underlying density function f (x)
and the regression function m(x), asymptotic normality of the M–S kernel estimate m
ˆ n (x), as well
as its uniform consistency, is established in the paper.
From the definition of Kαn , one can derive a much simpler expression for m
ˆ n (x). In fact, after
some cancelation, we have
!n
Xi−αn −2 exp(−αn x/Xi )Yi
.
m
ˆ n (x) = !i=1
n
−αn −2
exp(−αn x/Xi )
i=1 Xi
This formula is mainly useful for the computation of m
ˆ n , while Equation (5) is convenient for
theoretical development.
Being an asymmetric kernel, Kα∗ (x, t) defined in Equation (1) and Kα (x, t) defined in Equation (3)
are rather different from the asymmetric kernels discussed in Cline (1988) and Abadir and Lawford
(2004). In Equations (1) and (3), at each x > 0, the data points Xi behave like the scale parameter
of x, while in Cline (1988) and Abadir and Lawford (2004), Xi s appear as the location parameter
of x. Therefore, the inadmissibility of the asymmetric kernel proved in Cline (1988) does not
apply to the varying kernels defined in Equations (1) and (3).
The proposed estimation procedure is mainly developed for a univariate X. It is desirable to
seek its extensions for higher dimensional positive covariates. Similar to the commonly used
symmetric kernel regression, one way to proceed is to use the product kernel in the definition
of the regression function estimate. Another way is to use a multivariate extension of Gamma or
inverse Gamma density function as the kernel function. The product kernel method is the most
straightforward and natural choice, and similar theoretical results as in one dimension can be easily
derived. However, using multivariate extensions of Gamma or inverse Gamma density as the kernel
may not be practical, since the multivariate Gamma density functions proposed in the literature
all have complicated forms, which makes the computation and theoretical development of the
corresponding varying kernel estimates much challenging. For some definitions of multivariate
Gamma distribution, see Kotz, Balakrishnan and Johnson (2000).
The paper is organised as follows. Section 3 discusses the large sample results about m
ˆ n (x)
along with the needed technical assumptions. In particular, it contains an approximate expression
for the conditional MSE, and a central limit theorem, and a uniform consistency result about m
ˆ n.
Section 4 contains a discussion on the selection of the smoothing parameter αn . Findings of a
simulation study are presented in Section 5, and the proofs of the main results appear in Section 6.
Unless specified otherwise, all limits are taken as n → ∞.
834
3.
H.L. Koul and W. Song
ˆ n (x)
Large sample results of m
Downloaded by [98.239.145.180] at 20:47 16 April 2014
We start with analysing the asymptotic properties of the conditional bias and conditional variance,
hence the conditional MSE, of m
ˆ n (x) defined in Equation (5). Then, a typical application of
Lindeberg–Feller central limit theorem will lead to the asymptotic normality of m
ˆ n (x). As a
byproduct, the asymptotic normality of the M–S kernel estimate fˆn (x) is also a natural consequence.
Thus, confidence intervals for the true density function and regression function can be constructed.
Finally, uniform almost sure convergence results of fˆn (x) and m
ˆ n (x) over any bounded sub-intervals
of (0, ∞) are developed by using the Borel–Cantelli lemma after verifying the Cram´er condition
for the M–S kernel function.
The following is a list of technical assumptions used for deriving these results:
(A1) The second-order derivative of f (x) is continuous and bounded on (0, ∞).
(A2) The second-order derivative of f (x)m(x) is continuous and bounded on (0, ∞).
(A3) The second-order derivative of σ 2 (x) = E(ε2 |X = x) is continuous and bounded for all
x > 0.
(A4) For some δ > 0, the second-order derivative of E(|ε|2+δ |X = x) is continuous and bounded
in x ∈ (0, ∞).
√
(A5) αn → ∞, αn /n → 0.
Condition (A1) on f (x) is the same as the one adopted by M–S when deriving the bias and
variance of fˆn (x). Condition (A3) is required for dealing with the large sample argument pertaining
to the random error and is not needed if one is willing to assume the homoscedasticity. Condition
(A4) is needed in proving the asymptotic normality of the proposed estimators, while (A5) is a
minimal condition needed for the smoothing parameter. Additional assumptions on αn as needed
are stated in various theorems presented below.
In the following, for any function g(x), g( (x) and g(( (x) denote the first and second derivatives
of g(x), respectively.
3.1. Bias and variance
The following theorem presents the asymptotic expansions of the conditional biases and the
variances, hence the conditional MSE, of m
ˆ n (x). Let
' (
(
(
m(( (x)
σ 2 (x)
2 m (x)f (x)
b(x) := x
+
, v(x) :=
(6)
√ ,
f (x)
2
2xf (x) π
and X := {X1 , X2 , . . . , Xn }.
Theorem 3.1 Suppose the assumptions (A1), (A2), (A3), and (A5) hold. Then, for any x ∈ (0, ∞)
with f (x) > 0,
)
+
% &
1
1
b(x)
Bias(m
ˆ n (x)|X) =
+ Op * √
,
(7)
+ op
αn
αn
n αn
%√ &
√
αn
v(x) αn
Var(m
ˆ n (x)|X) =
+ op
.
(8)
n
n
Thus, the conditional MSE of m
ˆ n (x) has the asymptotic expansion
)
+
% &
%√ &
√
αn
1
1
b2 (x) v(x) αn
MSE(m
ˆ n (x)|X) =
+ op
+ op √ 5/4 .
+ op
+
n
n
αn2
αn2
nαn
Journal of Nonparametric Statistics
835
Remark The unconditional version of Theorem 3.1 is very hard to derive. This is also true for
N–W kernel regression. Although Härdle, Müller, Sperlich and Werwatz (2004) indicated that
the conditional MSE of N–W kernel regression estimate could be derived from a linearisation
technique, and the result is summarised in Theorem 4.1 in Härdle et al. (2004), the rigorous proof
is not provided. But we can show that the unconditional version of Theorem 3.1 remains valid for
m
ˆ n∗ (x)
=
!n
Kα (x, Xi )Yi
!n n
,
+ i=1 Kαn (x, Xi )
i=1
n−2
Downloaded by [98.239.145.180] at 20:47 16 April 2014
a slightly modified version of m
ˆ n (x). The similar idea was used in Fan (1993) when dealing with
the local linear regression, and a proof of the unconditional MSE of m
ˆ n∗ (x) can follow the same
thread as the proof of Theorem 3 in Fan (1993).
Recalling the above discussion on the analogy between αn and the bandwidth in the commonly
used symmetric kernel density estimate, one can easily see the similarity of the bias and variance
expressions between the M–S kernel estimate and the N–W kernel estimate.
Similar to the N–W kernel regression case, one can choose the optimal smoothing parameter
αn,opt by minimising the leading term in the conditional MSE of m
ˆ n with respect to αn . We can
verify that αn,opt has the order of n2/5 , with the corresponding MSE having the order of n−4/5 .
Recall the same order is obtained for the N–W kernel regression estimate based on the same
criterion.
3.2. Asymptotic normality
First, we give the asymptotic normality of the M–S kernel density estimate.
Theorem 3.2 Suppose the assumptions (A1), (A4), and (A5) hold. Then, for any x ∈ (0, ∞)
with f (x) > 0,
%
'
(
√ &
f (x) αn −1/2 ˆ
x 2 f (( (x)
fn (x) − f (x) −
→d N(0, 1).
√
2(αn − 1)
2xn π
The asymptotic normality of fˆn (x) implies that fˆn (x) converges to f (x) in probability, hence
1/fˆn (x) converges to 1/f (x) in probability, whenever f (x) > 0. This result is used in the proof of
the asymptotic normality of m
ˆ n (x), which is stated in the next theorem.
Theorem 3.3 Suppose the assumptions in Theorem 3.1 hold. Then, for any x ∈ (0, ∞) with
f (x) > 0,
%
'
(
√ &
v(x) αn −1/2
b(x)
m
ˆ n (x) − m(x) −
→d N(0, 1),
n
αn − 1
where b(x) and v(x) are defined in Equation (6).
It is noted that there is a non-negligible asymptotic bias appearing in the above results, a characteristic shared with the N–W kernel regression estimate. This bias can be eliminated by
√
5/4
under-smoothing which, in the current setup, is to select a larger αn such that n/αn → 0
√
without violating conditions αn → ∞, αn /n → 0. The large sample confidence intervals for
m(x) thus can be constructed with the help of Theorem 3.3.
836
H.L. Koul and W. Song
3.3. Almost sure uniform convergence
Downloaded by [98.239.145.180] at 20:47 16 April 2014
In this section, we develop an almost sure uniform convergence result for m
ˆ n (x) over an arbitrary
bounded sub-interval of (0, ∞). In the N–W kernel regression estimation scenario, a similar
result is obtained by using the Borel–Cantelli lemma and the Bernstein inequality, but the Cramér
condition must be verified before applying these well-known results. That is, for any fixed x > 0,
k ≥ 2, we have to show that
% √ &k−2
c α
k
E|Kα (x, X)| ≤ k!
EKα2 (x, X)
n
for some positive constant c when α is large.
ˆ n to m
The following two theorems give the almost sure uniform convergence of fˆn to f and m
over bounded sub-intervals of (0, ∞).
1/2
Theorem 3.4 In addition to (A1) and (A5), assume that αn log n/n → 0. Then, for any
constants a and b such that 0 < a < b < ∞,
) 1/4 √
+
% &
1
α
log
n
n
+o
sup |fˆn (x) − f (x)| = O
, a.s.
√
αn
n
x∈[a,b]
1/2
Theorem 3.5 In addition to (A1)–(A5), assume that αn log n/n → 0. Then, for any constants
a and b such that 0 < a < b < ∞,
) 1/4 √
+
% &
1
αn
log n
ˆ n (x) − m(x)| = O
+o
sup |m
, a.s.
√
αn
n
x∈[a,b]
By assuming some stronger conditions on the tails of f and m at the boundaries, the above
uniform almost sure convergence results can be extended to be over some suitable intervals
increasing to (0, ∞). However, we do not pursue it here simply because of the involved technical
details and lack of a useful application.
4.
Selection of smoothing parameters
It is well known that the smoothing parameter plays a crucial role in nonparametric kernel
regression. Abundant research has been conducted for the N–W kernel-type regression estimation methodology (see e.g. Wand and Jones 1994; Hart 1997 for more data-driven choices of the
smoothing parameters in this setup). However, to the best of our knowledge, there is no work
done for the asymmetric kernel regression.
In this section, we propose several smoothing parameter selection procedures for implementing
the M–S kernel technique. First, we recall the least-square cross-validation (LSCV) procedure
from M–S and discuss its extension, k-fold LSCV. Second, we propose the smoothing parameter
selection procedures in the nonparametric regression setup. The k-fold LSCV and the generalised
cross-validation (GCV) will be discussed. These procedures are analogous to the commonly used
data-driven procedures used in the N–W kernel regression estimation context. The theoretical
properties, such as the consistency of these smoothing parameter selectors to some ‘optimal’
smoothing parameter, might be discussed in the similar way as in John (1984), Härdle, Hall and
Marron (1988, 1992) and references therein. However, we will not investigate this important topic
in the current paper, which deserves an independent in-depth study.
837
Journal of Nonparametric Statistics
4.1. Density estimation: k-fold LSCV
The motivation of the LSCV comes from expanding the mean integrated square error (MISE) of
fˆ . Define
,
n
2$ˆ
LSCV(α) = fˆ 2 (x) dx −
f−i (Xi ),
n i=1
where fˆ−i (Xi ) is the leave-one-out M–S kernel density estimate for f (Xi ) without using the ith
observation. Then, the LSCV smoothing parameter is defined by αˆ LSCV = argminα LSCV(α). For
the M–S kernel density estimate (4),
Downloaded by [98.239.145.180] at 20:47 16 April 2014
n
n
"(2α + 3) $ $ (Xi Xj )α+1
LSCV(α) = 2 2
n α" (α + 1) i=1 j=1 (Xi + Xj )2α+3
$ 1
2
−
n(n − 1)"(α + 1) i+=j Xj
%
αXi
Xj
&α+1
%
αXi
exp −
Xj
&
.
An extension of the above leave-one-out LSCV is the k-fold LSCV procedure. First split the
data into k roughly equal-sized parts; then for each part, calculate the prediction error based on
the M–S kernel density estimate constructed from all data from other k − 1 parts; and finally, take
the sum of the k prediction errors as the quantity to be minimised. In particular, for our current
setup, the k-fold LSCV has a similar structure as the leave-one-out LSCV except for the second
term now defined as


%
&α+1
%
&
n
$
$
2
1 αXi
αXi 
 1
exp −
,
n"(α + 1) i=1 n − ni j∈D(i)
Xj Xj
Xj
/
where D(i) is the set of indices of the data part including Xi . For convenience, if we use
D1 , D2 , . . . , Dk to denote the data subscripts in the first part, second part, and so on, then
D(i) = {j : i, j ∈ Dl , l = 1, 2, . . . , k}, and ni is the size of D(i). The k-fold LSCV will reduce
to the leave-one-out LSCV when k = n.
4.2.
M–S kernel regression: k-fold LSCV
The basic idea of LSCV in regression setup is to select the smoothing parameter by minimising
prediction error. For this purpose, let m
ˆ D/D(i) (Xi ) be the M–S kernel estimate of m(x) at x = Xi of
the same type as m
ˆ n (x) except that it is computed without using the data parts including the ith
observation (Xi , Yi ), where D = {1, 2, . . . , n}. The LSCV smoothing parameter αˆ LSCV is the value
of α that minimises the LSCV criterion
CV(α) =
=
n
$
i=1
n
$
i=1
[Yi − m
ˆ D/D(i) (Xi )]2
1
Yi −
!n
Xj−α−2 exp(−αXi /Xj )Yj
j∈D(i)
/
!n
Xj−α−2 exp(−αXi /Xj )
j∈D(i)
/
22
.
The independence between (Xi , Yi ) and m
ˆ D/D(i) (Xi ) indicates that CV(α) will give an accurate
assessment of how well the estimate m
ˆ n (x) will predict future observations.
838
4.3.
H.L. Koul and W. Song
M–S kernel regression: GCV
The GCV procedure from the N–W kernel regression can also be adapted to the current setup.
Define
Xj−α−2 exp(−αXi /Xj )
, i, j = 1, 2, . . . , n.
wij = !n
−α−2
exp(−αXi /Xk )
k=1 Xk
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Then, the GCV smoothing parameter αˆ GCV is the value of α that minimises the GCV criterion
GCV(α) defined as
!
!
n ni=1 [Yi − nj=1 wij Yj ]2
!
.
GCV(α) =
[n − ni=1 wii ]2
There is no one smoothing parameter selection procedure which is uniformly superior to others,
in the sense that the selected smoothing values always produce estimates with smallest MSE.
The simulation study conducted in the next section shows that for some data sets, a selection
procedure might not even work. A common practice is to try several procedures and make an
overall evaluation to decide a proper smoothing value.
5.
Simulation study
To evaluate the finite sample performance of the proposed M–S kernel regression estimates, we
conducted a simulation study. In the simulation, the underlying density function of the design
variable is chosen to be log-normal with µ = 0, σ = 1, and the random error ε to be normal
with mean 0 and standard deviation 0.5. Two simple regression functions, m(x) = 1/x 2 , m(x) =
(x − 1.5)2 , are considered. For m(x) = 1/x 2 , the estimate will be evaluated at 1024 equally spaced
values over the interval (0.1, 1); for m(x) = (x − 1.5)2 , the estimate will be evaluated at 1024
equally spaced values over the interval (0, 3), and the sample sizes used are 100 and 200. Then,
the MSEs between the estimated values and true values of the regression function will be used
for comparison.
It is always controversial when comparing two different nonparametric smoothing procedures,
especially when one or both procedures involve the smoothing parameters which play a crucial
role in determining the smoothness of the fitted regression function, since by selecting a proper
smoothing parameter, one method can often be made to outperform the other. Therefore, for the
sake of fairness in comparison, one should use the same criterion to select the smoothing parameters whenever possible. Unfortunately, sometimes the chosen criterion works for one procedure,
but does not work for another. In this case, one might try different criteria for both procedures
and make an overall comparison. The five-fold LSCV and GCV criteria are tried to select the
bandwidth for both M–S kernel and N–W kernel estimates. The standard normal kernel is used
to construct the N–W kernel estimate.
Table 1 presents the simulation study when m(x) = 1/x 2 . The numbers within the parentheses
are the smoothing values selected by various criteria, and the numbers outside the parentheses
Table 1.
MSE comparison: m(x) = 1/x 2 , x ∈ (0.1, 1).
M–S kernel
n
100
200
N–W kernel
LSCV
GCV
n2/5
(22) 7.944
(29) 5.116
(12) 54.522
(10) 10.530
119.113
14.497
LSCV
GCV
h = n−1/5
×
×
×
(0.009) 2.359
(0.398) 148.185
(0.347) 174.172
839
0
0
20
20
40
40
60
80
60
100
80
120
Journal of Nonparametric Statistics
0.4
0.6
0.8
1.0
0.2
0.6
0.8
1.0
n = 200
n = 100
Estimates of m(x) = 1/x 2 . Smoothing values are selected by LSCV and optimal order of MSE.
20
40
60
80
100
Figure 2.
0.4
0
Downloaded by [98.239.145.180] at 20:47 16 April 2014
0.2
0.2
0.4
0.6
0.8
1.0
n = 200
Figure 3.
Estimates of m(x) = 1/x 2 . Smoothing values are selected by GCV.
are the MSEs. For n = 100, the five-fold LSCV criterion and GCV do not work for the N–W
procedure and a crossed sign is used in the table to indicate this case. For n = 200, LSCV still
does not work for the N–W estimator, while GCV works. Also, h = n−1/5 , the bandwidth based
on the optimal order of the conditional MSE, is used to calculate the N–W estimate. The five-fold
LSCV criterion works for the M–S kernel estimate. We also try α = n2/5 , the smoothing α value
based on the optimal order of the conditional MSE for the M–S kernel estimate, to calculate
the M–S kernel estimate. The values in the parentheses are the values of smoothing parameters.
Figure 2 provides a visual comparison between these two procedures. To keep the figure neat, we
only plot the M–S kernel estimate with an LSCV bandwidth, and the N–W kernel estimate with
h = n−1/5 . Here, and in the subsequent figures, the thick solid curve denotes the true regression
function, the thin solid line denotes the M–S kernel estimate, and the dashed line is for the N–W
estimate. Clearly, with respect to the boundary area, the M–S kernel estimate does better than the
N–W kernel estimate. We also tried GCV criterion to choose the smoothing parameters. Figure 3
provides a visual comparison between M–S and N–W procedures with smoothing values selected
840
H.L. Koul and W. Song
by GCV. The MSEs reported in Table 1 and Figure 3 clearly indicate that the GCV favours the
N–W kernel estimate more than the M–S kernel estimate, although the N–W kernel estimate
possesses a larger variability.
We also tried fitting the regression function using the boundary kernel suggested by Gasser and
Müller (1979). The MSEs are generally smaller than the N–W kernel estimates, but still much
larger than the M–S kernel estimates. For example, for h = n−1/5 , the MSEs using the boundary
kernel are 162.133 when n = 100 and 38.62 when n = 200.
Table 2 reports the MSEs from the simulation study when m(x) = (x − 1.5)2 . Now the five-fold
LSCV and GCV criteria work for both procedures. The M–S kernel estimate with both LSCV
Table 2.
MSE comparison: m(x) = (x − 1.5)2 , x ∈ (0, 3).
n
LSCV
GCV
n2/5
LSCV
GCV
h = n−1/5
(71) 0.025
(54) 0.014
(30) 0.016
(37) 0.012
(6.310) 0.049
(8.326) 0.055
(0.429) 0.086
(0.917) 0.250
(0.089) 0.065
(0.090) 0.067
(0.398) 0.030
(0.347) 0.030
0.0
0.0
0.5
0.5
1.0
1.0
1.5
1.5
2.0
2.0
2.5
100
200
N–W kernel
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
n = 200
n = 100
0.0
0.5
0.5
1.0
1.0
1.5
1.5
2.0
2.0
2.5
Figure 4. Estimates of m(x) = (x − 1.5)2 . Smoothing values are selected by LSCV and optimal order of MSE.
0.0
Downloaded by [98.239.145.180] at 20:47 16 April 2014
M–S kernel
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
n = 100
Figure 5. Estimates of m(x) = (x
1.0
1.5
n = 200
− 1.5)2 .
Smoothing values are selected by GCV.
2.0
2.5
3.0
Journal of Nonparametric Statistics
841
and GCV bandwidths outperforms the N–W kernel estimates with all the selected bandwidths,
but the contrary is true when both procedures use the bandwidth by minimising the asymptotic
integrated mean square error (AIMSE) expression.
Figure 4 shows the fitted curves from the M–S kernel estimate with an LSCV bandwidth, and the
N–W kernel estimate with h = n−1/5 . It is clear that the M–S kernel estimator is a very promising
competitor for the N–W kernel estimator. This point is reconfirmed by Figure 5, which shows the
fitted curves from both kernel estimators with smoothing values selected by the GCV criterion.
Downloaded by [98.239.145.180] at 20:47 16 April 2014
6. Proofs of main results
This section contains the proofs of all the large sample results presented in Section 2. Inverse
Gamma density function and its moments will be repeatedly referred to in the following proofs.
For convenience, we list all the needed results here. Density function of an inverse Gamma
distribution with shape parameter p and rate parameter λ is
%
% &p+1
&
λp
λ
1
g(u, p, λ) =
exp −
, u > 0.
"(p) u
u
Its mean µ, variance τ 2 , and the fourth central moment ν4 , respectively, are
µ=
ν4 =
Let
λ
,
p−1
τ2 =
λ2
,
(p − 1)2 (p − 2)
λ4 (3p + 15)
.
(p − 1)4 (p − 2)(p − 3)(p − 4)
pk = k(αn + 2) − 1,
λk = kαn x,
k = 1, 2, . . . , x > 0.
Write µk , τk , and ν4k for µ, τ , and ν4 when λ and p are replaced by λk and pk , respectively. The
following lemma on the inverse Gamma distribution is crucial for the subsequent arguments.
Lemma 6.1 Let l(u) be a function such that the second-order derivative of l(u) is continuous
and bounded on (0, ∞). Then, for αn large enough, and for all x > 0 and k ≥ 1,
, ∞
(2 − 2k)xl ( (x)
g(u; pk , λk )l(u) du = l(x) +
pk − 1
0
% &
[(2 − 2k)2 (pk − 2) + k 2 αn2 ]x 2 l (( (x)
1
+
+o
.
2
2(pk − 1) (pk − 2)
αn
Proof of Lemma 6.1 Fix an x > 0. Note that µk := λk /(pk − 1) = x + (2 − 2k)x/(pk − 1).
A Taylor expansion of l(µk ) around x up to the second order yields
l(µk ) = l(x) +
(2 − 2k)xl ( (x) (2 − 2k)2 x 2 l (( (ξ )
,
+
pk − 1
2(pk − 1)2
(9)
where ξ is some value between x + (2 − 2k)x/(pk − 1) and x. Recall µk is the mean of g(u; pk , λk ).
A Taylor expansion of l(u) around µk yields
, ∞
, ∞
1
l(u)g(u, pk , λk ) du = l(µk ) + l (( (µk )
(u − µk )2 g(u; pk , λk ) du
2
0
0
,
1 ∞
+
(u − µk )2 g(u; pk , λk )[l (( (˜u) − l (( (µk )] du
(10)
2 0
842
H.L. Koul and W. Song
for some u˜ between u and µk . From Equation (9) and the continuity of l (( , we can verify that the
two leading terms on the right-hand side of Equation (10) match the expansion in the lemma.
Therefore, it is sufficient to show that the third term on the right-hand side of Equation (10) is of
the order o(1/αn ).
Since l (( is continuous, so it is uniformly continuous over any closed sub-intervals in (0, ∞).
For any , > 0, select a 0 < γ < x, such that for any y with |y − x| ≤ γ , |l (( (x) − l (( (y)| < ,.
Let δ1 = x − γ /2. The boundedness of l (( implies
3,
3
3
3
δ1
Downloaded by [98.239.145.180] at 20:47 16 April 2014
0
3
,
3
3
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du3 ≤ c
((
2
((
δ1
0
(u − µk )2 g(u; pk , λk ) du.
Note that the inverse Gamma density function g(u, pk , λk ) is unimodal, and the mode is
αn x/(αn + 2), which approaches x when αn → ∞. Therefore, for αn large enough, δ1 <
αn x/(αn + 2), and for all u ∈ (0, δ1 ), g(u, pk , λk ) ≤ g(δ1 , pk , λk ). Hence,
,
δ1
0
(u − µk )2 g(u; pk , λk ) du ≤ g(δ1 , pk , λk )
,
δ1
0
'
u−x−
(2 − 2k)x
k(αn + 2) − 2
(2
du.
Clearly the integral on the right-hand side is finite. From the definitions of pk and λk ,
g(δ1 , pk , λk ) =
(kαn x)k(αn +2)−1 −k(αn +2) −kαn xδ1−1
δ
e
.
"(k(αn + 2) − 1) 1
By the Stirling approximation, as αn → ∞,
&k(αn +2)−2
kαn x ek(αn +2)−2
kαn x
[1 + o(1)]
√
k(αn + 2) − 2
2π[k(αn + 2) − 2]
√
= O(x kαn ekαn αn ).
(kαn x)k(αn +2)−1
=
"(k(αn + 2) − 1)
%
(11)
Therefore,
−1 √
g(δ1 , pk , λk ) = O(x kαn ekαn δ1−kαn e−kαn xδ1 αn )
)'
+
%
&(kαn
√
x
x
=O
exp 1 −
αn .
δ1
δ1
This relation and δ1 < x now readily implies that g(δ1 , pk , λk ) = o(1/αn ), which in turn implies
that
% &
, δ1
1
2
((
((
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o
.
(12)
αn
0
Now take δ2 = x + γ /2. Then,
3
3, ∞
,
3
3
2
((
((
3
3
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du3 ≤ c
3
δ2
∞
δ2
(u − µk )2 g(u; pk , λk ) du.
But,
,
∞
δ2
(kαn x)k(αn +2)−1
(u − µk ) g(u; pk , λk ) du =
"(k(αn + 2) − 1)
2
,
∞
δ2
% &k(αn +2)
%
&
1
kαn x
(u − µk )
exp −
du.
u
u
2
843
Journal of Nonparametric Statistics
The integral on the right-hand side is bounded above by
4
,
∞
δ2
% &k(αn +2)−2
%
&
1
kαn x
exp −
du.
u
u
By the change of variable, v = kαn x/u, we obtain
Downloaded by [98.239.145.180] at 20:47 16 April 2014
,
∞
δ2
% &k(αn +2)−2
%
&
%
&k(αn +2)−3 , kαn x/δ2
1
kαn x
1
exp −
du =
vk(αn +2)−4 exp(−v) dv. (13)
u
u
kαn x
0
As a function of v, vk(αn +2)−4 exp(−v) is increasing in v ≤ k(αn + 2) − 4 and decreasing in v ≥
k(αn + 2) − 4. Since δ2 > x, so kαn x/δ2 < k(αn + 2) − 4 for αn sufficiently large. Therefore, for
all v ∈ [0, kαn x/δ2 ],
%
&
%
&
kαn x k(αn +2)−4
kαn x
exp −
.
vk(αn +2)−4 exp(−v) ≤
δ2
δ2
Plugging the above inequality into Equation (13), we obtain that
,
∞
δ2
% &k(αn +2)−2
%
%
&
%
&
&
%
&
1
kαn x
kαn x
1 k(αn +2)−3 kαn x k(αn +2)−4 kαn x
exp −
exp −
du ≤
u
u
kαn x
δ2
δ2
δ2
% &k(αn +2)−3
%
&
1
kαn x
=
exp −
.
δ2
δ2
From Equation (11), we have
,
∞
δ2
%
&
%
&
1 k(αn +2)−3
kαn x
exp −
δ2
δ2
)'
+
%
&(kαn
% &
√
x
x
1
=O
exp 1 −
αn = o
,
δ2
δ2
αn
4
√ 5
(u − µk ) g(u; pk , λk ) du ≤ O x kαn ekαn αn ·
2
because 0 < x < δ2 implies 0 < (x/δ2 ) exp(1 − x/δ2 ) < 1. Hence,
% &
, ∞
1
2
((
((
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o
.
αn
δ2
(14)
Finally, we shall show that
% &
, δ2
1
2
((
((
.
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o
αn
δ1
By uniform continuity of l(( ,
,
, δ2
(u − µk )2 g(u; pk , λk )|l (( (˜u) − l (( (µk )| du ≤ ,
0
δ1
∞
(u − µk )2 g(u; pk , λk ) du,
6by∞ the fact 2that |˜u − µk | ≤ |u − µk | < γ , for u ∈ [δ1 , δ2 ] and αn sufficiently large. Because
0 (u − µk ) g(u; pk , λk ) du = O(1/αn ), we obtain
,
δ2
δ1
%
1
(u − µk ) g(u; pk , λk )|l (˜u) − l (µk )| du = , · O
αn
2
((
((
&
.
(15)
844
H.L. Koul and W. Song
The arbitrariness of , combined with Equations (12), (14), and (15) finally yield
% &
, ∞
1
2
((
((
(u − µk ) g(u; pk , λk )[l (˜u) − l (µk )] du = o
.
αn
0
!
Hence, the desired result in the lemma.
In particular, if k = 1, then
% &
, ∞
x 2 l (( (x)
1
g(u; p1 , λ1 )l(u) du = l(x) +
.
+o
2(αn − 1)
αn
0
(16)
Downloaded by [98.239.145.180] at 20:47 16 April 2014
To analyse the limiting behaviour of fˆn (x) as x → 0, similar to the symmetric kernel case, we
analyse the limiting bias of fˆn (x) at x = u/αn , where 0 < u < 1. It is easy to see that
fˆn
%
u
αn
&
n
=
1$
1
n i=1 Xi "(αn + 1)
%
u
Xi
&αn +1
e−u/Xi .
Let p = αn + 1, λ = u, we can show that
% & , ∞
% &
% &
u
u
1
ˆ
E fn
=
g(x, p, λ)f (x) dx = f
+O
.
αn
αn
αn
0
Therefore, fˆn (x) does not suffer from the boundary effect.
The following decomposition of m
ˆ n (x) will be used repeatedly in the proofs below.
1
2
Bn (x) + Vn (x)
1
1
m
ˆ n (x) − m(x) =
+
−
[Bn (x) + Vn (x)],
f (x)
fˆn (x) f (x)
where
n
1$
Kα (x, Xi )[m(Xi ) − m(x)],
Bn (x) =
n i=1 n
n
1$
Vn (x) =
Kα (x, Xi )εi ,
n i=1 n
with Kαn (x, Xi ) defined in Equation (4). Now we are ready to prove Theorem 3.1.
Proof of Theorem 3.1 First, we shall compute the conditional bias of m
ˆ n (x). Direct calculations
shows that E[m
ˆ n (x)|X] − m(x) = Bn (x)/fˆn (x). Since fˆn (x) = f (x) + op (1), it suffices to discuss
the asymptotic property of Bn (x). Note that EBn (x) = EKαn (x, X)m(X) − m(x)EKαn (x). But
, ∞ "
1 αn x #αn +1 exp(−αn x/u)
E(Kαn (x, X)m(X)) =
m(u)f (u) du
u u
"(αn + 1)
0
, ∞
=
g(u, p1 , λ1 )m(u)f (u) du,
0
where p1 = αn + 1, λ1 = αn x. Let H(u) = m(u)f (u). Applying Equation (16) with l(u) = H(u)
and with l(u) = f (u), respectively, yields
% &
1
x 2 H (( (x)
E(Kαn (x, X)m(X1 )) = m(x)f (x) +
+o
,
2(αn − 1)
αn
'
(
% &
x 2 f (( (x)
1
m(x)EKαn (x, X) = m(x) f (x) +
.
+o
2(αn − 1)
αn
Therefore,
Journal of Nonparametric Statistics
845
% &
x 2 [H (( (x) − m(x)f (( (x)]
1
EBn (x) =
+o
.
2(αn − 1)
αn
(17)
Direct calculations show that x 2 [H (( (x) − m(x)f (( (x)]/2 = b(x)f (x), where b(x) is defined in
Equation (6).
Next, consider
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Var(Bn (x)) =
1 2
1
EKαn (x, X)[m(X) − m(x)]2 − [EKαn (x, X)(m(X) − m(x))]2 .
n
n
Note that EKα2n (x, X)(m(X) − m(x))2 equals
%
&
, ∞
1 " αn x #2(αn +1)
1
2αn x
exp −
(m(u) − m(x))2 f (u) du
u2 u
"(αn + 1)
u
0
, ∞
"(2αn + 3)
=
g(u; p2 , λ2 )(m(u) − m(x))2 f (u) du,
xαn 22αn +3 " 2 (αn + 1) 0
where p2 = 2αn + 3, λ2 = 2αn x. By the Stirling approximation, for αn sufficiently large,
√
αn
"(2αn + 3)
= √ [1 + o(1)].
2α
2
+3
n
αn 2
" (αn + 1)
2 π
A Taylor expansion
6 ∞ of m(u) and f (u) around αn x/(αn + 1) up to the first order gives the following
expansion for 0 g(u; p2 , λ2 )(m(u) − m(x))2 f (u) du:
&
% &
, ∞%
αn x 2
1
(m( (x))2 f (x)
u−
,
g(u; p2 , λ2 ) du + o
α
+
1
α
n
n
0
by the assumptions (A1) and (A2), and the fact
&
% &
, ∞%
αn x 2
x 2 αn2
1
u−
.
g(u; p2 , λ2 ) du =
=O
2
αn + 1
(αn + 1) (2αn + 1)
αn
0
Therefore,
&
%
1 2
1
2
.
EK (x, X)[m(X) − m(x)] = O √
n αn
n αn
From Equation (17), EBn (x) = O(1/αn ). Hence,
&
%
&
%
1
1
Var(Bn (x)) = O √
+O
.
n αn
nαn2
(18)
Therefore, Equations (17) and (18), and the fact x 2 [H (( (x) − m(x)f (( (x)]/2 = b(x)f (x) together
yield
)
+
% &
Bn (x)
1
1
b(x)
+ Op * √
+ op
.
(19)
=
f (x)
αn − 1
αn
n αn
Moreover,
'
(
1
E[m
ˆ n (x)|X] − m(x) =
+ op (1) · [EBn (x) + Bn (x) − EBn (x)]
f (x)
'
% &
%
&(
( '
1
1
1
b(x)f (x)
=
+ Op
,
+ op
+ op (1) ·
√
f (x)
αn
αn
n αn
which implies the claim (7) about the conditional bias of m
ˆ n (x).
846
H.L. Koul and W. Song
Next, we verify the claim (8) about the conditional variance of m
ˆ n (x). In fact, with σ 2 (x) =
2
E(ε |X = x),
% &
n
1
1 $ 2 x
Var[m
ˆ n (x)|X] =
·
K αn
σ 2 (Xi ).
(20)
ˆfn2 (x) n2
Xi
i=1
Verify that under condition (A3) about σ 2 (x),
1
2
%√ &
√
n
σ 2 (x)f (x) αn
αn
1 $ 2
2
E 2
Kαn (x, Xi )σ (Xi ) =
,
+o
√
n i=1
n
2nx π
Downloaded by [98.239.145.180] at 20:47 16 April 2014
which, together with Equation (20) and the fact fˆn (x) = f (x) + op (1), implies the claim (8).
Proof of Theorem 3.2
!
Let ξin (x) = n−1 [Kαn (x, Xi ) − EKαn (x, X)]. Then,
fˆn (x) =
n
$
i=1
ξin (x) + EKαn (x, X).
Since EKαn (x, X) = f (x) + x 2 f (( (x)/2(αn − 1) + o(1/αn ),
% & $
n
x 2 f (( (x)
1
fˆn (x) − f (x) −
+o
=
ξin (x).
2(αn − 1)
αn
i=1
Lindeberg–Feller
Central Limit Theory (CLT) will be used to show the asymptotic normality
!
of ni=1 ξin (x). For any a > 0, b > 0, and r > 1, using the well-known inequality (a + b)r ≤
2r−1 (ar + br ), we have
E|ξin (x)|2+δ ≤ n−(2+δ) 21+δ [E(Kαn (x, X))2+δ + (EKαn (x, X))2+δ ].
Let λδ = (2 + δ)αn x, pδ = (2 + δ)(2 + αn ) − 1.A tedious calculation shows that E(Kαn (x, X))2+δ
can be written as
,
1
"((2 + δ)(2 + αn ) − 1) ∞
g(u; pδ , λδ )f (u) du.
(αn x)1+δ (2 + δ)(2+δ)(2+αn )−1
" 2+δ (αn + 1)
0
For n and αn , large enough, using the Stirling approximation, we have
"((2 + δ)(2 + αn ) − 1)
= O((2 + δ)(2+δ)(2+αn ) αn2(2+δ)−(5+δ)/2 ).
" 2+δ (αn + 1)
Also, we have
6∞
0
g(u; pδ , λδ )f (u) du = f (x) + o(1). Hence,
EKα2+δ
(x, X) = O(αn(δ+1)/2 ).
n
Note that
EKαn (x, X) =
EKα2n (x, X) =
,
0
∞
g(u; p1 , λ1 )f (u) du,
"(2αn + 3)
xαn 22αn +3 " 2 (αn + 1)
,
∞
0
g(u; p2 , λ2 )f (u) du.
847
Journal of Nonparametric Statistics
Hence, by Lemma 6.1, we obtain
vn2
= Var
−1
) n
$
7
i=1
+
ξin (x) = Var(fˆn (x))
EKα2n (x, X) − (EKαn (x, X))2
=n
%√ &
√
αn f (x)
αn
=
.
√ +o
n
2nx π
This fact together with EKαn (x, X) = f (x) + o(1) imply
Downloaded by [98.239.145.180] at 20:47 16 April 2014
vn−(2+δ)
n
$
Eξin2+δ (x)
i=1
=
2+δ
nvn−(2+δ) Eξ1n
8
(21)
)% √ & +
αn δ/2
=O
,
n
which converges to 0, by assumption (A4). Hence, the Lindeberg–Feller condition holds. This
completes the proof of the Theorem 3.2.
!
Proof of Theorem 3.3 Fix an x > 0. To show the asymptotic normality of m
ˆ n (x), again we use
the decomposition (21).
We shall first show that Vn!
(x) is asymptotically normal. For this purpose, let ηin =
−1
n Kαn (x, Xi )εi so that Vn (x) = ni=1 ηin . Clearly, Eηin√= 0. By assumption (A3) on σ 2 (x), a
√
2
= [ αn f (x)σ 2 (x)/(2n2 x π)][1 + o(1)]. Therefore,
routine argument leads to Eηin
sn2
= Var
) n
$
ηin
i=1
+
=
2
nEηin
√
f (x)σ 2 (x) αn
=
[1 + o(1)].
√
2nx π
Using a similar argument as in dealing with E|ξin (x)|2+δ in the proof of Theorem 3.2, verify that
for any δ > 0,
E|ηin |2+δ = n−(2+δ) EKα2+δ
(x, X)E(|ε|2+δ |X = x) = O(n−(2+δ) αn(1+δ)/2 ).
n
Hence,
sn−(2+δ)
n
$
i=1
E|ηin |2+δ
)% √ & +
αn δ/2
=O
= o(1).
n
Hence, by the Lindeberg–Feller Central Limit Theorem (CLT), sn−1 Vn (x) →d N(0, 1).
From the asymptotic results on fˆn (x) and Vn (x) in Theorem 3.2 and fact (19) about Bn (x), we
obtain that
1
2
1
1
−1
sn
−
[Bn (x) + Vn (x)] = op (1).
fˆn (x) f (x)
* √
* √
This, together with the result that n/ αn · Op (1/ n αn ) = op (1), implies
f (x)sn−1
%
% &&
b(x)
1
m
ˆ n (x) − m(x) −
= sn−1 Vn (x) →d N(0, 1).
+o
αn − 1
αn
√
The proof is completed by noting that f (x)sn−1 = (v(x) αn /n)−1/2 .
!
848
H.L. Koul and W. Song
6∞
Proof of Theorem 3.4 Recall that E fˆn (x) = 0 g(u; p1 , λ1 )f (u) du. By Equation (16) and the
boundedness of x 2 f (( (x) on [a, b], we obtain
% &
1
ˆ
E fn (x) − f (x) = O
, for any x ∈ [a, b].
αn
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Hence supa≤x≤b |E fˆn (x) − f (x)| = O(1/αn ). Therefore, we only need to show that fˆn (x) −
√
1/4 √
E fˆn (x) = o(αn
log n/ n). For this purpose, let ξin (x) = n−1 [Kαn (x, Xi ) − EKαn (x, Xi )], hence
!
fˆn (x) − E fˆn (x) = ni=1 ξin (x). In order to apply Bernstein inequality, we have to verify the Cram´er
2
condition for ξin , that is, we need to show that, for k ≥ 3, E|ξ1n |k ≤ cnk−2 k!Eξ1n
for some cn only
depending on n.
Note that Kαn (x, X) can be written as
Kαn (x, X) =
" x #αn +2
" α x#
αnαn +1
n
exp −
.
x"(αn + 1) X
X
As a function of u, uαn +2 exp(−αn u) attains its maximum at u = (αn + 2)/αn . Therefore, for any
x and X, by Stirling formula,
%
&
αnαn +1
αn + 2 αn +2
Kαn (x, X) ≤
exp(−(αn + 2))
x"(αn + 1)
αn
≤
(αn + 2)2 (αn + 2)αn
exp(−(αn + 2))
xαn
"(αn + 1)
(αn + 2)2
(αn + 2)αn
exp(−(αn + 2))
αn √
xαn αn 2παn e−αn (1 + o(1))
√
c αn
≤
,
x
=
(22)
for some positive constant c. Therefore, for any k ≥ 3, and αn large enough,
With vn := (
or
!n
i=1
E|ξin |k = n−k E|Kαn (x, Xi ) − EKαn (x, Xi )|k
% √ &k−2
c αn
≤
n−2 E|Kαn (x, Xi ) − EKαn (x, Xi )|2
xn
% √ &k−2
c αn
=
Eξin2 .
xn
Eξin2 )1/2 , this immediately implies
% √ &k−2
c αn
k
E|ξin | ≤ k!
Eξin2
nx
∀ 1 ≤ i ≤ n,
&
% √ &k−2 ' (2
c αn
ξin
ξin k
≤ k!
E
∀ 1 ≤ i ≤ n.
E
vn
nxvn
vn
√
√
√
By Equation (21), vn2 = αn f (x)/2nx π + o( αn /n). This, together with the fact that xf (x) is
bounded away from 0 and ∞ on [a, b], implies
' (k
%
&k−2 ' (2
ξin
cαn 1/4
ξin
E
≤ k!
E
.
(23)
√
vn
vn
n
%
Journal of Nonparametric Statistics
849
Then, by Equation (23) and the Bernstein inequality, for any positive number c,
)
+
3
&
%3 !n
2
*
3 i=1 ξin 3
c
log
n
3 ≥ c log n ≤ 2 exp −
.
P 33
√
1/4 √
vn 3
4(1 + cαn
log n/ n)
1/2
Since αn log n/n → 0, so for n large enough,
3
%3 !n
&
% 2
&
*
3 i=1 ξin 3
c log n
3
3
P 3
≥ c log n ≤ 2 exp −
.
vn 3
8
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Upon taking c = 8, we have
Since
!∞
n=1
3
)3 n
+
3$ 3
*
2
3
3
P 3
ξin 3 ≥ c log nvn = 8 .
3
3
n
i=1
√
n−8 < ∞, so by the Borel–Cantelli lemma and by the fact vn2 = O( αn /n), we obtain
) 1/4 √
+
n
$
α
log
n
n
fˆn (x) − E fˆn (x) =
ξin = o
.
√
n
i=1
!
To bound ni=1 ξin uniformly for all x ∈ [a, b], we partition the interval [a, b] by the equally
spaced points xi , i = 0, 1, 2, . . . , Nn , such that a = x0 < x1 < x2 < · · · < xNn = b, Nn = n3 . It is
easily seen that
3 n
3
)
+
1/4 √
3$
3
2
αn
log n
2Nn
3
3
P max 3
ξin (xj )3 > c
≤ 8 = 5.
√
0≤j≤Nn 3
3
n
n
n
i=1
The Borel–Cantelli lemma implies that
3 n
3
) 1/4 √
+
3$
3
αn
log n
3
3
max 3
ξin (xj )3 = o
.
√
0≤j≤Nn 3
3
n
(24)
i=1
For any x ∈ [xj , xj+1 ],
ξin (x) − ξin (xj ) = n−1 [Kαn (x, Xi ) − EKαn (x, Xi )] − n−1 [Kαn (xj , Xi ) − EKαn (xj , Xi )].
Then, a Taylor expansion of Kαn (x, Xi ) at x = xj up to the first order leads to the following
expression for the difference Kαn (x, Xi ) − Kαn (xj , Xi ):
1
%
&
%
& %
&
%
&2
x − xj
αn x˜ αn +2
αn x˜
αn x˜ αn +3
αn x˜
(αn + 1)
exp −
−
exp −
,
"(αn + 1)αn x˜ 2
Xi
Xi
Xi
Xi
where |x − x˜ | ≤ xj+1 − xj ≤ (b − a)/Nn . Note that for p > 0, the maximum of x p e−x for x > 0
is attained at x = p and equals pp e−p . Hence,
%
αn x˜
Xi
&αn +2
%
αn x˜
Xi
&αn +3
&
%
αn x˜
≤ (αn + 2)αn +2 e−αn −2 ,
exp −
Xi
%
αn x˜
exp −
Xi
&
≤ (αn + 3)αn +3 e−αn −3 .
850
H.L. Koul and W. Song
Therefore, for all 1 ≤ i ≤ n,
|Kαn (x, Xi ) − Kαn (xj , Xi )| ≤
(x − xj )αnαn +2 exp(−αn )
"(αn + 1)˜x 2
1%
2
&
&
%
2 αn +3 −2
3 αn +3 −3
×
1+
e + 1+
.
e
αn
αn
With this upper bound together with the Stirling approximation for the Gamma function, one
concludes that for n and αn large enough,
3/2
Downloaded by [98.239.145.180] at 20:47 16 April 2014
|Kαn (x, Xi ) − Kαn (xj , Xi )| ≤
c(x − xj )αn
,
x˜ 2
for some positive constant c. Because 0 ≤ x − xj ≤ (b − a)/Nn , and x˜ > 1/a,
3/2
|Kαn (x, Xi ) − Kαn (xj , Xi )| ≤
cαn
,
Nn
which implies that when n is large enough, for some constant c,
3/2
|ξin (x) − ξin (xj )| ≤
cαn
,
nNn
1 ≤ i ≤ n.
These bounds imply that for all x ∈ [xj , xj+1 ] and 0 ≤ j ≤ Nn − 1,
3 n
3
) 1/4 √
+
n
3$
3 cα 3/2
$
α
log
n
n
n
3
3
ξin (x) −
ξin (xj )3 ≤
=o
.
√
3
3
3
n3
n
i=1
i=1
(25)
(26)
Finally, from Equations (24) and (26), we obtain
3 n
3
3 n
3
3$
3
3$
3
3
3
3
3
ξin (x)3 ≤ max 3
ξin (xj )3
sup |fˆn (x) − E fˆn (x)| = sup 3
3 0≤j≤Nn 3
3
a≤x≤b
a≤x≤b 3 i=1
i=1
3 n
3
n
3$
3
$
3
3
+ max
sup 3
ξin (x) −
ξin (xj )3
0≤j≤Nn −1 x∈[xj ,xj+1 ] 3
3
i=1
i=1
) 1/4 √
+
log n
αn
=o
.
√
n
This, together with the result supa≤x≤b |E fˆn (x) − f (x)| = O(1/αn ), completes the proof of
Theorem 3.4.
!
Proof of Theorem 3.5 By Equation (21) and Theorem 3.4, it suffices to prove the following two
facts:
3
3
+
) 1/4 √
% &
3 B (x) 3
1
αn
log n
3 n 3
sup 3
,
(27)
+o
√
3=O
αn
n
x∈[a,b] 3 fˆn (x) 3
3
3
) 1/4 √
+
% &
3 V (x) 3
log
n
1
α
n
3 n 3
sup 3
+o
.
(28)
√
3=O
αn
n
x∈[a,b] 3 fˆn (x) 3
We shall prove Equation (28) only, the proof of Equation (27) being similar.
851
Journal of Nonparametric Statistics
Let β, η be such that β < 25 , β(2 + η) > 1, and β(1 + η) >
dn
dn
write εi = εi1
+ εi2
+ µdi n , with
dn
εi1
= εi I(|εi | > dn ),
dn
εi2
= εi I(|εi | ≤ dn ) − µdi n ,
2
5
and define dn = nβ . For each i,
µdi n = E[εi I(|εi | ≤ dn )|Xi ].
Hence,
Downloaded by [98.239.145.180] at 20:47 16 April 2014
!n
!n
!n
dn
dn
dn
Vn (x)
i=1 Kαn (x, Xi )εi1
i=1 Kαn (x, Xi )εi2
i=1 Kαn (x, Xi )µi
= !n
+ !n
+ !
.
n
fˆn (x)
i=1 Kαn (x, Xi )
i=1 Kαn (x, Xi )
i=1 Kαn (x, Xi )
Since E(εi |Xi ) = 0, so µdi n = −E[εi I(|εi | > dn )|Xi ], then from assumption (A4), we have |µdi n | ≤
cdn−(1+η) . Hence,
3!
3
) 1/4 +
3 n K (x, X )µdn 3
αn
i
3 i=1 αn
i 3
sup 3 !n
.
3 ≤ cdn−(1+η) = o √
3
3
n
x∈[a,b]
i=1 Kαn (x, Xi )
dn
Now, consider the part involving εi1
. By the Markov inequality,
∞
$
n=1
P(|εn | > dn ) ≤ E|ε|2+η
n
$
1
2+η
n=1
dn
< ∞.
The Borel–Cantelli lemma implies that
P{∃N, |εn | ≤ dn for n > N} = 1
⇒ P{∃N, |εi | ≤ dn , i = 1, 2, . . . , n, for n > N} = 1
dn
⇒ P{∃N, εi,1
= 0, i = 1, 2, . . . , n, for n > N} = 1.
Hence,
3 !n
3
dn 3
3
3 i=1 Kαn (x, Xi )εi,1 3
sup 3 !n
3 = O(n−k ) ∀k > 0.
3
3
K
(x,
X
)
i
x∈[a,b]
i=1 αn
dn
dn
, we have E[εi,2
|Xi ] = 0, and it is easy to show that
For the term εi,2
dn
Var(εi,2
|Xi ) = σ 2 (Xi ) + O[dn−η + dn−2(1+η) ]
dn k
dn 2
and for k ≥ 2, E(|εi,n
| |Xi ) ≤ 2k−2 dnk−2 E(|εi,n
| |Xi ). Then, from Equation (22) and the bounded2
ness of σ (x) over (0, ∞), we have
dn k
dn k
| ≤ n−k E[Kαkn (x, X)E(|εi,n
| |Xi )]
E|n−1 Kαn (x, Xi )εi,2
≤ cn−k 2k−2 dnk−2 EKαkn (x, X)σ 2 (X)
% √ &k−2
cdn αn
dn 2
≤
E|n−1 Kαn (x, Xi )εi,2
| .
n
Because
1
E[Kα2n (x, X)σ 2 (X)][1 + o(1)]
n2
√
αn f (x)σ 2 (x)
=
[1 + o(1)],
√
2n2 πx
dn 2
| =
E|n−1 Kαn (x, Xi )εi,2
852
H.L. Koul and W. Song
dn
the random variable n−1 Kαn (x, Xi )εi,2
satisfies the Cram´er condition. Therefore, using the
Bernstein inequality as in proving Theorem 3.4, one establishes the fact that for all c > 0,
;
3

3
< n
% 2
&
n
3$
3
$
<
*
c log n
3
dn 3
dn 2 
=

P 3
Kαn (x, Xi )εi,2 3 ≥ c log n
E[Kαn (x, Xi )εi,2 ]
≤ 2 exp −
.
3
3
8
i=1
i=1
Downloaded by [98.239.145.180] at 20:47 16 April 2014
*
√
Take c = 4 and C(x) = c f (x)σ 2 (x)/(2x π) in the above inequality to obtain
@
3

3
n
1/2
31 $
3
α
log
n
n
3
dn 3
≤ 2,
P 3
Kαn (x, Xi )εi,2
3 ≥ C(x)
3n
3
n
n2
i=1
by the Borel–Cantelli Lemma and the boundedness of f (x)σ 2 (x)/x over x ∈ [a, b], this implies,
for each x ∈ [a, b],
3 n
3
) 1/4 √
+
31 $
3
α
log
n
n
3
dn 3
Kαn (x, Xi )εi,2
.
√
3
3=o
3
3n
n
i=1
To show the above bound is indeed uniform, we can use the similar technique as in showing the
uniform convergence of fˆn (x) as in the proof of Theorem 3.4. In fact, the only major difference is
that, instead of using Equation (25), we should use the inequality
dn
dn
|Kαn (x, Xi )εi,2
− Kαn (xj , Xi )εi,2
|≤
3/2
cαn dn
,
Nn
x ∈ [xj , xj+1 ], 1 ≤ i ≤ n.
The above result, together with the facts that f (x) is bounded below from 0 on [a, b], and
supx∈[a,b] |fˆn (x) − f (x)| = o(1), implies
3 !n
3
) 1/4 √
+
dn 3
3
K
(x,
X
)ε
log
n
α
α
i
n
3 i=1 n
i,2 3
sup 3 !n
, a.s.
√
3=o
3
K
(x,
X
)
n
α
i
x∈[a,b] 3
n
i=1
This concludes the proof of Theorem 3.5.
!
Acknowledgements
The authors gratefully acknowledge the editors and two referees for their helpful comments which improved the
presentation of the paper. Research supported in part by the NSF DMS Collaborative Grants 1205271 and 1205276.
References
Abadir, K.M., and Lawford, S. (2004), ‘Optimal Asymmetric Kernels’, Economics Letters, 83, 61–68.
Bouezmarni, T., and Rolin, J. (2003), ‘Consistency of the Beta Kernel Density Function Estimator’, Canadian Journal of
Statistics, 31, 89–98.
Chaubey, Y.P., Sen, A., and Sen, P.K. (2012), ‘A New Smooth Density Estimator for Non-Negative Random Variables’,
Journal of Indian Statistical Association, 50, 83–104.
Chen, S.X. (1999), ‘Beta Kernel Estimators for Density Functions’, Computational Statistics & Data Analysis, 31,
131–145.
Chen, S.X. (2000a), ‘Beta Kernel Smoothers for Regression Curves’, Statistica Sinica, 10, 73–91.
Chen, S.X. (2000b), ‘Probability Density Function Estimation Using Gamma Kernels’, Annals of the Institute of Statistical
Mathematics, 52, 471–480.
Chen, S.X. (2002), ‘Local Linear Smoothers UsingAsymmetric Kernels’, Annals of the Institute of Statistical Mathematics,
54, 312–323.
Cline, D.B. (1988), ‘Admissible Kernel Estimators of a Multivariate Density’, The Annals of Statistics, 16, 1421–1427.
Downloaded by [98.239.145.180] at 20:47 16 April 2014
Journal of Nonparametric Statistics
853
Cowling, A., and Hall, P. (1996), ‘On Pseudodata Methods for Removing Boundary Effects in Kernel Density Estimation’,
Journal of the Royal Statistical Society. Series B (Methodological), 58, 551–563.
Fan, J. (1993), ‘Local Linear Regression Smoothers and Their Minimax Efficiencies’, The Annals of Statistics, 21, 196–216.
Fan, J., and Gijbels, I. (1992), ‘Variable Bandwidth and Local Linear Regression Smoothers’, The Annals of Statistics,
20, 2008–2036.
Gasser, T., and Müller, H.G. (1979), ‘Kernel Estimation of Regression Functions’, in Smoothing Techniques for Curve
Estimation (Vol. 757), Lecture Notes in Mathematics, eds. T. Gasser and M. Rosenblatt, Berlin: Springer Berlin
Heidelberg, pp. 23–68.
Härdle, W., Hall, P., and Marron, J.S. (1988), ‘How Far Are Automatically Chosen Regression Smoothing Parameters
from Their Optimum?’, Journal of the American Statistical Association, 83, 86–95.
Härdle, W., Hall, P., and Marron, J. (1992), ‘Regression Smoothing Parameters that Are Not Far from Their Optimum’,
Journal of the American Statistical Association, 87, 227–233.
Härdle, W., Müller, M., Sperlich, S., and Werwatz, A. (2004), Nonparametric and Semiparametric Models, Berlin
Heidelberg: Springer Verlag.
Hart, J.D. (1997), Nonparametric Smoothing and Lack-of-Fit Tests, New York: Springer.
John, R. (1984), ‘Boundary Modification for Kernel Regression’, Communications in Statistics – Theory and Methods,
13, 893–900.
Jones, M. (1993), ‘Simple Boundary Correction for Kernel Density Estimation’, Statistics and Computing, 3, 135–146.
Jones, M., and Henderson, D. (2007), ‘Miscellanea Kernel-Type Density Estimation on the Unit Interval’, Biometrika,
94, 977–984.
Kotz, S., Balakrishnan, N., and Johnson, N.L. (2000), Continuous Multivariate Distributions, Models and Applications
(Vol. 1), New York: John Wiley & Sons, Inc.
Marron, J.S., and Ruppert, D. (1994), ‘Transformations to Reduce Boundary Bias in Kernel Density Estimation’, Journal
of the Royal Statistical Society. Series B (Methodological), 56, 653–671.
Mnatsakanov, R., and Sarkisian, K. (2012), ‘Varying Kernel Density Estimation On’, Statistics & Probability Letters, 82,
1337–1345.
Müller, H.G. (1991), ‘Smooth Optimum Kernel Estimators Near Endpoints’, Biometrika, 78, 521–530.
Müller, H.G., and Wang, J.L. (1994), ‘Hazard Rate Estimation Under Random Censoring with Varying Kernels and
Bandwidths’, Biometrics, 50, 61–76.
Scaillet, O. (2004), ‘Density Estimation Using Inverse and Reciprocal Inverse Gaussian Kernels’, Nonparametric Statistics,
16, 217–226.
Schuster, E.F. (1985), ‘Incorporating Support Constraints into Nonparametric Estimators of Densities’, Communications
in Statistics – Theory and Methods, 14, 1123–1136.
Wand, M.P., and Jones, M.C. (1994), Kernel Smoothing (Vol. 60), Boca Raton, FL: Chapman & Hall, CRC Press.