Download Report

OUTLIERS IN SAMPLE SURVEYS
P.D. Ghangurde, Statistics Canada, Ottawa, KIA 0T6, Canada
KEYWORDS:
estimation
1.
Variance-inflation,
outlier
2.
robust
Variance-inflation Model
Consider a linear model
Introduction
Y = X
8 + e ,
n×1
n×p p×l
n×1
In the literature on regression analysis several
approaches for detection and treatment of outliers
have been developed. In addition to methods based on
the
mean-shift
and
variance-inflation
models,
estimators based on order statistics such as trimmed
and Winsorized means and M-estimators based on
robust regression methods are available. Regression
diagnostics provide methods for critical examination
of models and measures of influence of individual
outliers and groups of outliers on estimates of
parameters (see Beckman and Cook (1983); Cook and
Weisberg (1982)).
The objective of this paper is to develop outlier
robust estimators for sample surveys, based on
variance-inflation m o d e l . T h i s model is a simple
extension of superpopulation model often implicitly
assumed for the traditional design-based estimators
and more explicitly used in the prediction approach.
Although these estimators are obtained as optimal
estimators of parameters of this model, the results on
the bias and variance of these estimators and optimal
weight reduction for outliers, presented in this paper,
are in the framework of finite population sampling.
These outlier robust estimators are not model
dependent and have not been evaluated by prediction
approach.
Outlier robust estimators in finite
population sampling based on robust regression and
prediction approach have b e e n investigated by
Chambers (1986).
The problem of outliers has been considered in
the literature on finite population sampling in the
context of estimation of mean or total, usually
assuming no auxiliary information in estimation.
Estimators obtained by methods based on order
statistics, s u c h as Winsorization and trimming, and
weight reduction, have been investigated by assuming
simple random sampling (see e.g. Fuller (1970); Ernst
(1980)).
It seems that it is not possible to extend
methods based on order statistics to sample designs
involving stratification and different sampling ratios
and non-response rates between strata and unequal
probabilities of selection, which result in unequal
design weights.
Sample surveys are often periodic
with rotation samples designed for estimation of
changes. Moreover, estimates are needed at several
levels such as stratum, group of strata and domains.
Because of these features of sample surveys,
estimators based on reduction of weights of outliers
are more convenient for use in practice.
In Section 2 we introduce the model in which
variance is a function of an auxiliary variable x
and
assume that outliers have inflated variances.
The
optimal estimators of parameter 8
are derived by
assuming k outliers (l.<k<n) in a random sample of
size n. In Section 3 conditional mean square error of
these estimators and optimal weight reduction have
been derived by assuming simple random sampling
from a finite population.
In concluding remarks in
Section 4, comments have b e e n made on possible
extensions of outlier robust estimation and the
problem of estimation of unit
variances and
covariances of x and y for outliers and non-outliers.
(2.1)
2
2
where e ~ (0, o W),
unknown,W is a diagonal
variance-covariance matrix with elements wi
depending on x i , i = i , 2 . . . . .
n, Y is a n-vector of
responses of y, X is a design matrix of p
auxiliary
variables each with n observations assumed fixed, 8
is
a p-vector of regression coefficients and e is an error
term .
Under the model assuming p=1, wi--x g
i'
of sample means y/x
n
1( z Yi/xi ) are
and
the
?=~i=i
mean
best
ratio
of
linear
ratios
unbiased
estimators of 8 for g=l and 2 respectively.
The
model is appropriate for categorical variables in
socio-economic surveys. It is known that values of g
for many variables lie in the interval [1,2] and more
often closer to i than to 2.
In practice in multipurpose sample surveys with several y-variables and
an auxiliary variable x possibly used for stratification,
ratio estimation, although less than optimal for
variables with g>1, is often used for all y-variables
for convenience of uniform weighting method.
We now consider the variance-inflation model for
k outliers (1.<k<n) which are the last k s a m p l e units,
without loss of generality. Thus
Y = X e + e ,
n×l
n×p p×l
n×l
(2.2)
where e ~ (0, 02 W(k)) and W(k) is
a diagonal
variance-covariance matrix with elements wi , i = i ,
2.....
n-k and wi/w, i = ( n - k + l ) . . . . .
n; w
is
unknown constant (0<w.<1).
This model is a simple
extention of the variance-inflation model considered
by Pregibon (1981), Cook, Holschuh and Weisberg
(1982) and Thompson (1985).
We consider ^expression for the best linear
unbiased estimator 8 i ( w ) o f 8 under (2.2) for the ease
of one outlier, the i th sample unit. Thus
I
^
^
Bi (w) = ~
where for
^
(X' W-Ix) - I Xi ( y i - Y i ) ( l - w )
wi[l-
p = 1 , (X'N-Ix)-I
,
(Z-w) V i i ]
and
I
Xi
(2.3)
are scalars
= (X'W-IX)-I(x'N-Iy) is the estimator of 8
under(2.1) and Vii = w-i i Xi(X,W-ix )- IX,i is the ith
diagonal
element
of
variance-covariance matrix
V = V(Xs), called leverage of Xi.
Also, 0.<Vii.<l
whenp=1. For large values of Xi, Vii is close to i,
which makes contribution of i to 8i (w)
very
large.
The second term on the right hand side• of (2.3) shows
736
change due to variance-inflation of i th
sample
Thus influence of both residual (yi-Yi)
and
unit.
leverage
Vii is r e d u c e d due to f a c t o r
(l-w).
^
For k (> 1)
outliers and wi = x i, the e s t i m a t o r B(i ) (w),
(3.1)
kwx k + (n-k) Xn-k
( i ) r e p r e s e n t s group of k outliers can also be given in
the form which shows weight r e d u c t i o n of outliers, by
: wkyk + (n-k) Yn-k.
kwy k + (n-k) Yn-k
R =
where
is e s t i m a t o r
population
(2.4)
of the population ratio R = ?/X,
means ? and X. The r a t i o
R
of
can
be
e x p r e s s e d as
wkxk + (n-k) Xn-k
P ?i + (l-P) 72
R
When wi = x#, ~ ( i ) ( w )
^
r =
_.
is given by
wkr k + ( n - k )
P 21 + (I-P) X2
where X1 and ?i
rn-k
wk + (n-k)
'
and when x i = i for i = I, 2 . . . . .
given by
(3.2)
9
(2.5)
N, ~ ( i ) ( w )
^
wky k+ (n-k) Yn-k
7 =
wk + (n-k)
'
are unknown population means of
outliers, X2 and 72
non-outliers, P
is
are unknown population means of
is unknown proportion of outliers in
the finite population of (x,y).
2
We also assume that
2
2
2
O2x, O2y are unit variances of non-outliers ~Ix' °ly
2
2
2
are unit variances of outliers, Olx > O2x and Oly >
(2.6)
and Xn-k are sample means of non-outliers, rk
and
2
A l s o , oi
°2y"
xy and °2xy are unit covariances of
outliers and non-outliers respectively. By substituting
rn-k are sample means of ratios ri = Yi/Xi
for
71 = 6172and 21 = 6222 we have
where Yk and Xk are sample means of outliers, Yn-k
outliers and non-outliers respectively. When w÷0,
in
limit these estimators are ~ = :Yn_k/Xn_k, r = rn-k
R=
and ? = Yn-k' which can also be obtained as optimal
estimators of ~3 under corresponding mean-shift model.
The estimator of the population total can be obtained
from (2.6)as NC7 and
72
[1 + P(cs2-1 )]
X2
.
(3.3)
Since in general, 61 # 62 , R ~ 72/22 .
We obtain the
conditional bias B(RIk) by Taylor series linearization
shows reduction of simple
N
random sampling weight n of k outliers by factor w
method as
and resulting adjustment of weights of all n units by
factor n/[n-k(1-w) ]. F r o m (2.5) and (2.6) it can be
B(P, l k ) "
it
seen that the results for r can be obtained from those
for ? by substituting r i for yi for all i.
kw? + (n-k) 72
[ I
]- R
kwX1 + (n-k) X2
(62 - 61)
P(n-k)
kw(l-P) } (__).
?2
= [1_--~(1_62)] {
n_k(l_w62)
X2
These estimators have been investigated in the
following Section 3 using conditional inference given
the number of outliers k assuming simple random
sampling from a finite population. Although it may
be possible to use prediction approach and to obtain
model dependent estimators incorporating design
weights, the results in this paper have been obtained
in the framework of finite population sampling. For
detection of outliers in the case of ratio estimation
and sample mean Cook's distance seems to have good
potential (Cook and Weisberg (1982)).
3.
[1 + P(61-1 )]
(3.4)
When w=O, R = Yn-k/Xn-k and
(62-61) P
B(RIR) = (i_-~i162))
?2
(X)"
2
When w=l, R = .Vn/Xn with no weight reduction and
(62-61)
B(Rlk) = ( l - P ( l _ ~ 2 ) )
Outlier Robust E s t i m a t i o n
We assume t h a t a finite population of size N
contains an unknown proportion P of outliers.
The
population mean X of an auxiliary variable x
is
assumed known. A simple random sample of size n
is
drawn without r e p l a c e m e n t from the population and
outliers are identified on the basis of values of some
test statistic.
Sample units may be identified as
outliers if the 5th unit values ( x i , Y 5 )
lie in some
region of the sample space d e t e r m i n e d by the t e s t .
The t e s t could be based on diagnostics such as Cook's
d i s t a n c e (see Ghangurde (1989)).
The
outlier
robust
ratio
estimator
of the
population mean of y's is given by ^
?R = R2, where
It can be seen that f(w)
=
nP-k } 72
{n_k~l_-~2 ) ~2"
P(n-k)-
kw(1-P)is
n_k(l_w62)
a
monotonic d e c r e a s i n g function of w in ( 0 , l ) and f(w)
kw
X 0 a c c o r d i n g as P ~ [ n - k ( 1 - w ) ] " Hence, if P > k,
^
conditional bias of R for w=0
is in absolute value
^
k
g r e a t e r than t h a t of R for w in (0 1) I f P < "
absolute conditional bias increases as w
[P(n-k)
in
1] and may not be in absolute value lesser
(l-P)k'
than that for w=O.
737
n
increases
This analysis of conditional bias
of R is only of t h e o r e t i c a l i n t e r e s t ; P is not
practice.
By
linearization
of
(3.4)
outliers, more e f f i c i e n t e s t i m a t i o n can be done by an
e x t e n s i o n of Minimum Norm Q u a d r a t i c Unbiased
E s t i m a t i o n (MINQUE) m e t h o d (see Rao (1970)) to
estimation
of
heteroscedastic
variances
and
c o v a r i a n c e s when x and y are random.
known in
and
taking
e x p e c t a t i o n over k, unconditional bias of R is given by
P((s2-(s1) ~'2
B(R) : i+P(~2_1) (~2)(1-P)
i
[(I+P) - w(l+P+P62) + w2p(s2] + 0(~).
The o p t i m u m w, although obtained
under the
v a r i a n c e - i n f l a t i o n model, can still be used in finite
population sampling and mean square e r r o r can be
e s t i m a t e d for values of w close to the o p t i m u m given
by (3.7) to decide on choice of a n o t h e r value of w.
H o w e v e r , due to instability of e s t i m a t e s of bias, it
may be p r e f e r a b l e to use w given in (3.7).
(3.5)
When w=O,
P(~2-~I ) (I-P2)
B(R) =
?2
In the case of s a m p l e mean ~ the optimum wo
can be o b t a i n e d by minimizing mean square e r r o r and
is given by
>
(l+P(~2-~) (X) `~ 0 a c c o r d i n g as ~2<~i .
2
Also when w=l, B(R) = 0 to 0 ( 1 ) .
outlier robust sample
mean ~
2 .....
o
given in (2.6) can be
obtained from above results by substitution
i=l,
n-k
w : ( i - N(l_p) ) o Y + (n-k)
The results for
xi=l
P(aI - I ) 2
722
(3.8)
kl
for
In the l i t e r a t u r e s e v e r a l e s t i m a t o r s obtained by
weight r e d u c t i o n and which can be considered as
linear combinations of Yk and Yn-k
have
been
i n v e s t i g a t e d (see e.g. Hidiroglou and Srinath (1981)).
N, and Xl = ~2 = 62 = I.
_
By Taylor series l i n e a r i z a t i o n method
V(RIk) " (Rkw)2 V(~klk ) + (n-k)R)2~ V(~n_klk )
X
X
Their
optimal r=r o
2Rk2w 2
~2
C°V(Xk'Yk I k)
- 2P'(~2k)2 Cov(Xn_k, Yn_klk),
whereX = [kwXl+(n-k ) X2], ~( = [kwYl+(n-k)
and R = ~(/X.
(3.6)
^
w =
o
?2]
w=
2
2
by
minimizing mean square
°2y
n
^2
( i - ~) Oly
+ k(l
-
+ k(1
k
n)
(Y
- Yn- )
k
2
k
k
- n ) ( y k - .Yn_k)
it is possible to obtain e s t i m a t e
(3.9)
2"
^2
Oly
from
sample outliers, the e s t i m a t o r is unstable for small
values of k.
A l t e r n a t i v e m e t h o d s for e s t i m a t i o n of
h e t e r o s c e d a s t i c v a r i a n c e s have been proposed in the
l i t e r a t u r e (see K l e f f e (1977)).
The MINQUE was
d e v e l o p e d for p>.1 and n unequal v a r i a n c e s (see Rao
(1970)).
Assuming t h a t the last k units
are
e s t i m a t o r s of unit v a r i a n c e s obtained by
method are given by
n
n z
(yi-Yn)2
°ly = i=n-k+l
k (n-2)
n-k
n z (yi-Yn)
"2
i=i
°2y = (n-k) (n-2)
(3.7)
2
n
e s t i m a t o r s are more
2
2
e s t i m a t o r s of Oly and O2y
efficient
obtained
estimators
by
of 6
weighted
S u b r a h m a n i a m (1971)).
738
2
z (yi-Yn)
i=1
- (n-l) (n-2) "
These
This expression for o p t i m u m w o b t a i n e d by assuming
infinite N is simple and needs e s t i m a t i o n of f e w e r
p a r a m e t e r s . The means Ux and uy can be e s t i m a t e d
by sample means Xn and Yn
respectively.
Although
2
2
Olx, Oly and Ol×y
can be e s t i m a t e d from sample
outliers,
MINQUE
n
z (yi-Yn)2
i=1
(n-l) (n-2) '
"2
2
°2x~y + °2y~x - 2°2xy~x~Y
2 2
2 2
Olx~y + OlyUx - 2OlxyUx~y
obtained
n ^2
(i - ~)
Although,
In order to obtain an o p t i m a l value of w
the
conditional mean square e r r o r of ~ given k
can be
minimized as a function of w. H o w e v e r , if conditional
bias ratio is small an optimum w
obtained
by
minimizing conditional v a r i a n c e o f ~ can be used.
By
equating derivative
of conditional v a r i a n c e
with
r e s p e c t to w to zero we have an equation of third
d e g r e e in w
involving
several
parameters.
An
analytical
solution
cannot
be
obtained
unless
e s t i m a t e s are s u b s t i t u t e d for s e v e r a l p a r a m e t e r s
involved. An a l t e r n a t i v e is to minimize limiting value
of V(~lk) for infinite N,
which is the s a m e as
minimizing superpopulation v a r i a n c e of R
under the
v a r i a n c e - i n f l a t i o n model with Ux and Uy, means of x
and y r e s p e c t i v e l y .
It may be noted t h a t under the
model, bias of R is zero. The optimum w is given by
2
with
~ = - ~ Yk + (1 - -~) Yn-k
e r r o r of 7 is the s a m e a s ? with optimalw=wo,
since
k w o / ( ( n - k ) + kwo) = rok/N.
The
important
a d v a n t a g e of e s t i m a t o r s s u g g e s t e d above is t h a t the
weight r e d u c t i o n can be e s t i m a t e d from survey d a t a
a n d extensions to o t h e r sample designs seem possible.
Thus wo can be e s t i m a t e d by
+ (k_.~w)2 V(~klk ) + (n_.~)2 V(~n_klk )
X
X
-
estimator
efficient
and
(3.10)
than
also
the
usual
give
more
as
compared
to
those
least
squares
(Rao
and
4.
References
ConcludingRemarks
Beckman, R.J. and Cook, R.D. (1983), "Outliers,"
Technometrics, Vol. 25, No. 2, pp. 119-149.
Chambers, R . L . (1986), "Outlier
robust
finite
population estimation," JASA, Vol. 81, No. 396,
pp. 1063-1069.
Cook, R.D. Holschuh, N. and Weisberg, S. (1982), "A
note on an alternative outlier model," J.R.S.S., B,
44, No. 3, pp. 370-376.
Cook, R.D. and Weisberg,S. (1982), Residuals and
Influence in Regression, Chapman and Hall.
Ernst, L.R. (1980), "Comparisons of estimators of the
mean w h i c h adjust for large observations,"
Sankhya, C, 42, pp. 1-16.
Fuller, W.A. (1970), "Simple estimators of the mean of
skewed
populations,"
Technical
Report,
Department of Statistics, Iowa State University.
Gather, U. and Kale B.K. (1988), "Maximum likelihood
estimation
in
the
presence
of
outliers,"
Communications
in
Statistics,
Theory
and
Methods, 17 (11), pp. 3767-3784.
Ghangurde P.D. (1989), "Outlier robust estimation in
finite population sampling," presented at the
Statistical Society of Canada meeting, Ottawa.
Hidiroglou, M.A. and Srinath, K.P. (1981), "Some
estimators of a population total f r o m simple
random samples containing large units," JASA,
Vol. 78, No. 375, pp. 690-695.
Kleffe, J. (1977), "Optimal estimation of variance
components," Sankhya, Vol. 39, B, pp. 211-244.
Pregibon, D. (1981)' "Logistic regression diagnostics,"
The Annals of Statistics, Vol. 9, No. 4, pp. 705724.
Rao, C . R . (1970), "Estimation of heteroscedastic
variances in linear models," JASA, Vol.65,
No. 329, pp. 161-172.
Rao, J.N.K. and Subrahmaniam, K. (1971), "Combining
independent estimators and estimation in linear
regression w i t h unequal variances," Biometrics,
27, pp. 971-993.
Thompson, R. (1985), "A note on restricted maximum
likelihood with an alternative outlier model,"
J.R.S.S. (B), 47, No. I, pp. 53-55.
The variance-inflation model seems to be an
appropriate model for outliers in sample surveys. In
the optimal estimator based on the model influence of
outliers with large residuals and high leverage is
reduced due to factor (l-w).
U n d e r appropriate
assumptions about d e p e n d e n c e of v a r i a n c e on x,
it
reduces to o u t l i e r robust e s t i m a t o r s R, r and Y,
in
which weights of outliers are reduced. In the case of
ratio
estimator,
although
the
problem
of
d e t e r m i n a t i o n of o p t i m a l weight was simplified by
assuming an infinite population, it could still be used
in the case of finite population sampling. Although it
may be possible to use p r e d i c t i o n a p p r o a c h and to
obtain
model d e p e n d e n t
estimators
incorporating
design
weights,
conditional
inference
in
finite
population sampling f r a m e w o r k shows t h a t t h e s e
o u t l i e r robust e s t i m a t o r s have desirable properties.
Extension of these e s t i m a t o r s to s t r a t i f i e d sampling
i n c o r p o r a t i n g design weights is being i n v e s t i g a t e d .
Implicit in the v a r i a n c e - i n f l a t i o n model for
outliers is the assumption t h a t superpopulation is a
mixture of distributions with the same mean but
d i f f e r e n t variances.
In the case of sampling from
mixture distributions belonging to the e x p o n e n t i a l
family, maximum likelihood e s t i m a t o r s of p a r a m e t e r s
have been obtained and sufficient conditions have
been established for outliers to occur in samples from
these distributions (see e.g. G a t h e r and Kale (1988)).
It would be of i n t e r e s t t o i n v e s t i g a t e into the
possibility of establishing similar conditions for
occurrence
of outliers
in sampling from
finite
populations which have mixture distributions.
Although unit v a r i a n c e s and c o v a r i a n c e s for
outliers can be e s t i m a t e d from outliers in the sample,
the e s t i m a t o r could be unstable for small values of k.
In the case of r a t i o e s t i m a t i o n an extension of
MINQUE method for e s t i m a t i o n of h e t e r o s c e d a s t i c
v a r i a n c e s and c o v a r i a n c e s of x and y is needed.
The
problem of v a r i a n c e e s t i m a t i o n was not discussed
since it does not involve any new methodology, once w
is t r e a t e d as a c o m p o n e n t of weights of outliers.
_
739