Toutenburg, Srivastava: Amputation versus Imputation of Missing Values

Toutenburg, Srivastava:
Amputation versus Imputation of Missing Values
through Ratio Method in Sample Surveys
Sonderforschungsbereich 386, Paper 155 (1999)
Online unter: http://epub.ub.uni-muenchen.de/
Projektpartner
Amputation versus Imputation of Missing Values
through Ratio Method in Sample Surveys
x
H. Toutenburg
V.K. Srivastava
May 27, 1999
Abstract
In this article, we consider the estimation of population mean when
some observations on the study characteristic are missing in the bivariate sample data. In all, ve estimators are presented and their eciency
properties are discussed. One estimator arises from the the amputation
of incomplete observations while the remaining four estimators are formulated using inputed values obtained by the ratio method of estimation.
1 Introduction
Infeasibility to have all the observations in the sample is not an uncommon
aspect of data collection in many instances of sample surveys. This may occur
due to a variety of reasons. For example, a break-down or some snag may arise
in the instrument and/or measuring device rendering it unusable for completing
the process of data collection. Subjects like patients, animals and plants may
fail to survive due to factors that are unrelated to the experiment. Often typical
practical diculties are faced in the collection of data for a part of the sample.
Sometimes the respondents may supply information which is inconsistent due to
some inner contradictions or otherwise, and the investigator is forced to delete
it.
When some observations in the sample are missing, the simplest solution is
perhaps to amputate the incomplete observations and to restrict attention to
complete observations only for the purpose of statistical analysis. Alternatively,
one may employ some imputation method for nding the substitutes of missing
observations see, e.g., Little and Rubin (1987), Rao and Toutenburg (1995)
and Rubin (1987) for an interesting account. Treating these imputed values as
true observations, one may conduct the statistical analysis using the standard
procedures developed for data without any missing observation. Such a practice,
it is well recognized, may tend to invalidate the inferences and may often have
serious consequences.
Institute of Statistics, University of Munich, 80799 Munich, Germany
of Statistics, University of Lucknow, Lucknow 226007, India
x Department
1
In this article, we consider the estimation of population mean on the basis
of a random sample drawn according to the procedure of simple random sampling without replacement. It is assumed, following Rao and Sitter (1995) and
Tracy and Osahan (1994), that some units in the sample fail to respond and
the observations on the study characteristics are not available while this is not
the case with the auxiliary characteristic on which all the observations in the
sample are available. For the missing values on the study characteristic, the
method of ratio imputation is a commonly employed procedure in sample surveys. Using it, we have considered four estimators for the population mean of
study characteristic besides the conventional estimator (i.e., the mean of available observations) which amputates the incomplete observations. Comparing
their eciency properties, it is observed that outright amputation is not a good
proposition and use of ratio imputation is worthwhile. It helps in improving the
eciency of estimation under some mild constraints.
The plan of this article is as follows. In Section 2, we describe the imputation
procedure and present estimators for the population mean. Bias properties of
these estimators are studied in Section 3. Similarly, their mean squared errors
are analyzed in Section 4 and conditions for the superiority of one estimator over
the other are found. Lastly, the derivation of results is provided in Appendix.
2 Estimators For Mean
Let us consider a nite population of size with values 1 2
N of the
study characteristic and values 1 2
of
the
auxiliary
characteristic.
N
For the estimation of population mean , a random sample of size is drawn
according to the procedure of simple random sampling without replacement.
Assuming the nonresponse to be random, suppose that there are ( ; ) complete
observations ( 1 1 ), ( 2 2 ), , ( n;p n;p ) and incomplete observations
1 2
p . Thus the sample comprises two respondent sets|one of size
( ; ) denoted by and the other of size denoted by .
When the incomplete observations are discarded, it is customary to estimate
by
N
Y Y ::: Y
X X ::: X
Y
n
n
y x
y x
:::
y
x
p
p
x x ::: x
n
p
s
p
s
Y
;p
1 nX
= ;
i
y
n
p
i=1
y :
(2.1)
When the incomplete observations are not discarded and some imputation
method is followed, the completed data set is specied by
z
i=
i if 2
~i if 2 y
i
s
y
i
s
2
(2.2)
and the population mean is estimated by
t
n
X
= 1
i
n
X
X
~i
= 1
i+
i2s
n
(2.3)
z
i=1
y
i2s
!
y
where ~i denotes the imputed value of the study characteristic corresponding to
the observation i .
If the method of ratio imputation is employed, there are two simple choices
of ~i , viz.,
~i = (2.4)
~i = ( ; ) + (2.5)
y
x
y
y
y
y
y
X
x
nX
n
p x
px
P ;p , = 1 Pp and = 1 PN .
where = n;1 p ni=1
i
p i=1
N i=1 i
In the above two formulations, it is assumed that is known. If it is not
known, we may dene the imputed values as
x
x
x
X
X
X
~i = i
y
(2.6)
x
y
x
following Rao and Sitter (1995, p. 459).
On the same lines, we propose another set of imputed values as follows:
~i = ( ; )i + y
nx
y
n
p x
(2.7)
:
px
Utilizing (2.4) { (2.7) in (2.3), we obtain the following four estimators of :
( ; ) + =
(2.8)
1
( ; )2 + + ( ; ) =
(2.9)
2
( ; ) + ( ; ) + =
(2.10)
3
( ; )2 + (2 ; ) =
(2.11)
4
( ; ) + Thus we have ve estimators for estimating the population mean . The
estimator is based on amputation of incomplete data while the estimators 1 ,
2 , 3 and 4 are based on ratio imputation of missing observations. Out of
these four, two estimators require the knowledge of the population mean of
the auxiliary characteristic while the remaining two estimators are free from it.
Y
t
y
t
y
t
y
t
x
n
p x
pX
nx
n
p
x
npX
n
n
p nx
p x
n
p px
npx
px
nx
n
p
n
x
n
p nx
p px
npx
:
Y
y
t
t
t
t
X
3
3 Comparison Of Biases
Let us write
PN
x2 = N1;1 Pi=1 (Xi ; X )2 N Y ; Y )2 1
2 =
Sy
i
N ;1 i=1 (P
= SxSy (1N ;1) Ni=1 (Xi ; X )(Yi ; Y )
Y Sx = XS
; y ; f
= n1 ; N1 Ep np ; g
= Ep np n;1 p ; N1
S
(3.1)
where Ep denotes the expectation with respect to the nonnegative integer valued
random variable . Further, we assume that the correlation coecient is
nonnegative which is a basic requirement for the application of ratio method.
It is easy to see that the mean ignoring the incomplete observations is an
unbiased estimator of while the estimators 1 , 2 , 3 and 4 using the ratio
method of imputation for missing values are generally biased. In order to study
the magnitudes and directions of their biases, we assume that is small and
( ; ) is large which implies that is large. Now let us consider the large sample
approximations which are derived in Appendix following Sukhatme, Sukhatme,
Sukhatme and Asok (1984).
p
y
Y
t
t
t
t
p
n
p
n
Theorem 1 The order (
O n
t1
;2 ) approximations for the biases of the estimators
, t2 , t3 and t4 are given by
E( 1 ; )
( ; ) x y
E( 2 ; )
( ) =
=
( 2) =
=
( 3) =
=
( 4) =
B t1
t
g B t
(3.2)
Y
t
S S
X
(3.3)
Y
( ; ) x y
E( 3 ; )
( ; ) x y
E( 4 ; )
f B t
t
t
=
X
(3.4)
Y
g B t
S S
X
Y
1;
g
S S
f
g
;
(3.5)
x y:
S S
X
From the above expressions, it is interesting to observe that all the four
estimators are unbiased to order ( ;1 ) and the bias precipitates in the terms
of order ( ;2 ). However, the estimators 1 , 2 and 3 are also unbiased to
order ( ;2 ) when = . If is not less than 1, these three estimators are
biased in positive direction. The bias continues to remain positive so long as
1. It changes its sign only when
. So far as the estimator 4 is
;
2
concerned, it is also unbiased to order ( ) when ( ; ) = . Its bias is
positive or negative according as ( ; ) is larger or smaller than .
O n
O n
O n
t
t
t
< <
< O n
f
g 4
t
f
g g
g
Comparing the estimators with respect to the criterion of magnitude of bias,
we nd that the estimators 1 and 3 have an equal amount of bias at least to
the order of our approximation. Further, 2 has always smaller bias than 1 and
cannot exceed . Similarly, the estimator 2 has a smaller magnitude of
3 as
bias in comparison to the estimator 4 when
t
t
t
t
f
t
g
t
t
( 2 ; 2 )( ; )2 +
g
f
f
2 2
;2
( ; )] 0
f g (3.6)
>
while the reverse is true, i.e., 4 is less biased in magnitude than
inequality (3.6) holds with an opposite sign.
Observing that
t
( 1 )]2 ; ( 4 )]2 = ( 3 )]2 ; ( 4 )]2 =
B t
B t
B t
B t
( ;2 )
f g x y
S S
t2
when the
2
X
(3.7)
we see that 4 has smaller magnitude of bias than 1 and 3 when either
2 . On the contrary, the estimators 1 and 3 are less biased in magnitude in
comparison to 4 when is less than 2 .
t
t
t
t
t
>
t
4 Comparison Of Mean Squared Errors
Recalling that is an unbiased estimator of , its variance is given by
V() = E( ; )2
1 1
2
= E
;
y
Y
y
y
p
Y
n
;
p
N
(4.1)
y
S :
As the estimators 1 , 2 , 3 and 4 are generally not unbiased, we consider
their mean squared errors for the purpose of comparison. These are derived In
Appendix and presented below.
t
t
Theorem 2 To order (
O n
t
t
;2 ), the dierences between the variance of y and
the mean squared errors of the estimators t1 , t2 , t3 and t4 are given by
(y t1 ) = E(y ; Y )2 ; E(t1 ; Y )2
Y = 2gSxSy X
(y t2 ) = E(y ; Y )2 ; E(t2 ; Y )2
Y = 2f SxSy X
2
(y t3 ) = E(y ; Y ) ; E(t3 ; Y )2
Y = (2g ; f )Sx Sy X
2
(y t4 ) = E(y ; Y ) ; E(t4 ; Y )2
= (2g ; f )Sx Sy Y :
X
5
(4.2)
(4.3)
(4.4)
(4.5)
As is assumed to be positive, it is clear from (4.2) and (4.3) that both the
estimators 1 and 2 are better than implying the superiority of imputation
over amputation.
Looking at the expressions (4.4) and (4.5), we nd that the estimators 3
and 4 are better than when
t
t
y
t
t
y
2
f
>
(4.6)
:
g
As
, this condition is satised as long as 2
which is the well-known
condition for the superiority of ratio estimator over the sample mean when no
observation is missing see, e.g., Sukhatme et al. (1984, Chap. 5). Thus, so long
as the favourable environment for the application of ratio method prevails (i.e.,
2
), the missingness of some observations on the study characteristic and
their imputation by ratio method do not exert any adverse eect. In fact, the
ratio imputation suceeds in widening the range of admissible values of see
(4.6).
Next, let us compare the biased estimators.
When is known, we have two estimators 1 and 2 out of which 1 ignores
the incomplete observations while 2 incorparates them. Further, 2 has always
smaller magnitude of bias in comparison to 1 see (3.2) and (3.3). If we compare
their mean squared errors, it is seen from (4.2) and (4.3) that the estimator 1
has smaller mean squared error than 2 .
When is not known, we have again two estimators 3 and 4 which utilize the entire set of available observations. Further, 3 has smaller (larger)
magnitude of bias than 4 when 2 is greater (less) than . Comparing them
with respect to the criterion of mean squared error to order ( ;2 ), we observe from (4.4) and (4.5) that both are equally ecient to the given order of
approximation. However, the dierence precipitates if we consider higher order
approximations.
f < g
> > X
t
t
t
t
t
t
t
t
X
t
t
t
t
O n
;3 ), we have
Theorem 3 To order (
O n
( 3 4 ) = E( 3 ; )2 ; E( 4 ; )2
1 1 2 2
=
;
Ep
t
t
t
Q
where
Q
=
N
Y
n
t
N
Y
Y
p
X
n
N
1 X
2
; 1 i=1 i ( i ; )
X
X
X
(4.7)
(4.8)
:
We thus nd that the estimator 4 is better than 3 when is positive which
may generally hold good in many practical situations.
Finally, let us examine the role of knowledge of through a comparison of
estimators 1 and 2 with 3 and 4 .
t
t
X
t
t
t
t
6
Q
Let us rst recall that 1 has the same bias as 3 but it is less biased in
magnitude than 4 for 2
. Further, it is observed from (4.2), (4.4) and (4.5)
that the estimator 1 has invariably smaller mean squared error than 3 and 4 .
Similarly, we observe from (4.3), (4.4) and (4.5) that the estimator 2 is more
ecient than both the estimators 3 and 4 . This means that the knowledge of
plays an important role in improving the eciency of estimation when some
observations are missing and the method of ratio imputation is employed for
them.
t
t
t
> t
t
t
t
t
t
X
APPENDIX
If we write
= (x ; X ) y = (y ; Y )
x
we observe that x and y are of order Op (n; 12 ) while x is of order Op (1).
x = (x ; X )
Further, we have
E( x ) = E( x) = E( y ) = 0
:
Now we can express
( 1 ; ) = ( ; ) ; 1 ; ;1
= y ; x( + y ) 1 + x
t
Y
y
Y
p
nX
py
X
n
x
Y
X
:
Expanding and retaining terms to order p ( ;2 ), we nd
2
( 1; )= y; x; x y; x
O
t
Y
pY
nX
p
E( 1 ; ) = E( y ) ; E( x ) ;
= ; 1
x y; Y
X
Y
p
nX
Y
gS S
X
Y
nX
whence
t
n
X
1 E(
2
)
;
E(
)
x
y
x
nX
2
x
gS
which leads to the result (3.2) of Theorem 1.
Similarly, to order ( ;2 ), we have
E( 1 ; )2 = E 2y ; 2 y x
1 1
= E
;
Y
p X
p
O n
t
Y
pY
p
nX
n
;
p
giving the expression (4.2) of Theorem 2.
7
N
y ;2
S
2
Y
X
x y
gS S
Next, we observe that
( ; )( ; ) + ( ; ) ( 2 ; ) = ( ; ) ;
( ; ) + ( ; ) + ( ; ) + ;1
x
x
x
x 1+
= y ; ( + y)
h
x+ i
= y; x+ y x+
+
1
;
x
= y ; x ; y ; x x + x + p ( ; 25 )
Thus the bias to order ( ;2 ) is
2
E( 2 ; ) = ; x y; x
and the mean squared error to the same order of approximation is
( ; )
2
2
2
E( 2 ; ) = E( y ) ; 2 E
x y+ 2 x y
2
1 1
2
= Ep ; ;
x y
y ;2 These provide the result (3.3) of Theorem 1 and result (4.3) of Theorem 2.
Similarly, for the estimator 3 , we have
( 3 ; ) = ( ; ) ; ; ;1
= y ; ( + y )( x ; x) 1 + x
t
Y
y
p
pY
nX
n
X
p X
px
n
p
p n
Y
nX
p
nX
pY
p
p x
p x
n
Y
nX
x
n
Y
nX
p
n
p
n
py
Y
X
:::
py
:::
X
n
O
n
:
O n
t
t
p
Y
Y
Y
f S S
nX
p n
Y
p
n
X
n
p
fS
X
Y
S
N
p
X
n
f S S :
t
t
Y
y
py
Y
p
x
x
n
Y
nX
x
X
2
= y ; ;Y x + (Y x ; x y ) + x y ] 1 ; x + x2 : : :
nX
X
X
p
= y+ x; x; y; x x
x ; y ; x 1 + x + p ( ; 52 )
Taking expectation and retaining terms upto order ( ;2 ), we get
2
E( 3 ; ) = ; 1
x y; x
which gives the result (3.4) of Theorem 1.
Similarly, to the same order of approximation, we have
2
2
E( 3 ; )2 = E( 2y ) + 2 E( y x ) + E( 2 x )
; 2 E( x y ) ; E( x 2y ) ; E( x x y )
2
1 1
2
2
= Ep ; ;
x y
y+ x;2 pY
nX
p
nX
p
nX
Y
X
Y
Y
X
X
O
n
:
O n
t
t
Y
Y
X
Y
nX
n
p p
Y
S
8
Y
X
:
p nX
p N
gS
X
p nX
Y
Y
gS S
Y
X
fS
p Y
X
gS S
which leads to the result (4.4) of Theorem 2.
Proceeding in the same manner, we can express
( ; )( ; ) ( 4 ; ) = ( ; ) ;
( ; ) + ;1
= y ; 2 ( + y )( ; )( x ; x) 1 + x + x ; x
= y ; ; x + ( x ; y x) + x y +
+
x
t
Y
y
Y
p
n
p
n
n
p x
Y
n X
p
py
x
n
Y
p
x
px
Y p
X
X
nX
p
nX
nX
X
p
pY
:::
2
1 ; x + x ; x +
= y + x ; x ; y ; x
; y ; x x + x 1 + x + p ( ; 25 )
nX
pY
nX
p
p
nX
Y
nX
X
We thus nd
E( 4 ; ) = ; 1
t
Y
Y X
pY
n
:::
Y
n
X
O
X
x y ; (g ; f )
gS S
n
:
Y
2
x
S
X
which is the result (3.5) of Theorem 1.
The result (4.5) of Theorem 2 can be obtained in a similar way.
Lastly, let us consider the result stated in Theorem 3.
It is observed that
( 3 4 ) = E( 3 ; )2 ; E( 4 ; )2
= E( 3 ; ) + ( 4 ; )]( 3 ; ) ; ( 4 ; )]
x 2 y
= 2E y + x 1+ + ( ; 72 )
2 # 1 1 2 "
N
X
2
1
2
)3 Ep
+
(
;
;
=
i
x
( ; 1)
i=1
+ ( ; 27 )
t
t
t
Y
t
t
Y
t
py
nX
Y
Y
Y
n
N
X
t
Y
p Y X
S
n X
X N
t
Y
O n
X
X
p
n
O n
which yields the desired result (4.7).
References
Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data,
Wiley, New York.
Rao, C. R. and Toutenburg, H. (1995). Linear Models: Least Squares and
Alternatives, Springer, New York.
Rao, J. and Sitter, R. (1995). Variance estimation under two-phase sampling
with application to imputation for missing data, Biometrika 82: 453{460.
9
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Sample Surveys,
Wiley, New York.
Sukhatme, P. V., Sukhatme, B. V., Sukhatme, S. and Asok, C. (1984). Sampling
Theory Of Surveys With Applications, Iowa State Univercity Press, Iowa.
Tracy, D. S. and Osahan, S. S. (1994). Random non-response on study variable
versus on study as well as auxiliary variables, Statistica 54: 163{168.
10