Toutenburg, Srivastava: Amputation versus Imputation of Missing Values through Ratio Method in Sample Surveys Sonderforschungsbereich 386, Paper 155 (1999) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Amputation versus Imputation of Missing Values through Ratio Method in Sample Surveys x H. Toutenburg V.K. Srivastava May 27, 1999 Abstract In this article, we consider the estimation of population mean when some observations on the study characteristic are missing in the bivariate sample data. In all, ve estimators are presented and their eciency properties are discussed. One estimator arises from the the amputation of incomplete observations while the remaining four estimators are formulated using inputed values obtained by the ratio method of estimation. 1 Introduction Infeasibility to have all the observations in the sample is not an uncommon aspect of data collection in many instances of sample surveys. This may occur due to a variety of reasons. For example, a break-down or some snag may arise in the instrument and/or measuring device rendering it unusable for completing the process of data collection. Subjects like patients, animals and plants may fail to survive due to factors that are unrelated to the experiment. Often typical practical diculties are faced in the collection of data for a part of the sample. Sometimes the respondents may supply information which is inconsistent due to some inner contradictions or otherwise, and the investigator is forced to delete it. When some observations in the sample are missing, the simplest solution is perhaps to amputate the incomplete observations and to restrict attention to complete observations only for the purpose of statistical analysis. Alternatively, one may employ some imputation method for nding the substitutes of missing observations see, e.g., Little and Rubin (1987), Rao and Toutenburg (1995) and Rubin (1987) for an interesting account. Treating these imputed values as true observations, one may conduct the statistical analysis using the standard procedures developed for data without any missing observation. Such a practice, it is well recognized, may tend to invalidate the inferences and may often have serious consequences. Institute of Statistics, University of Munich, 80799 Munich, Germany of Statistics, University of Lucknow, Lucknow 226007, India x Department 1 In this article, we consider the estimation of population mean on the basis of a random sample drawn according to the procedure of simple random sampling without replacement. It is assumed, following Rao and Sitter (1995) and Tracy and Osahan (1994), that some units in the sample fail to respond and the observations on the study characteristics are not available while this is not the case with the auxiliary characteristic on which all the observations in the sample are available. For the missing values on the study characteristic, the method of ratio imputation is a commonly employed procedure in sample surveys. Using it, we have considered four estimators for the population mean of study characteristic besides the conventional estimator (i.e., the mean of available observations) which amputates the incomplete observations. Comparing their eciency properties, it is observed that outright amputation is not a good proposition and use of ratio imputation is worthwhile. It helps in improving the eciency of estimation under some mild constraints. The plan of this article is as follows. In Section 2, we describe the imputation procedure and present estimators for the population mean. Bias properties of these estimators are studied in Section 3. Similarly, their mean squared errors are analyzed in Section 4 and conditions for the superiority of one estimator over the other are found. Lastly, the derivation of results is provided in Appendix. 2 Estimators For Mean Let us consider a nite population of size with values 1 2 N of the study characteristic and values 1 2 of the auxiliary characteristic. N For the estimation of population mean , a random sample of size is drawn according to the procedure of simple random sampling without replacement. Assuming the nonresponse to be random, suppose that there are ( ; ) complete observations ( 1 1 ), ( 2 2 ), , ( n;p n;p ) and incomplete observations 1 2 p . Thus the sample comprises two respondent sets|one of size ( ; ) denoted by and the other of size denoted by . When the incomplete observations are discarded, it is customary to estimate by N Y Y ::: Y X X ::: X Y n n y x y x ::: y x p p x x ::: x n p s p s Y ;p 1 nX = ; i y n p i=1 y : (2.1) When the incomplete observations are not discarded and some imputation method is followed, the completed data set is specied by z i= i if 2 ~i if 2 y i s y i s 2 (2.2) and the population mean is estimated by t n X = 1 i n X X ~i = 1 i+ i2s n (2.3) z i=1 y i2s ! y where ~i denotes the imputed value of the study characteristic corresponding to the observation i . If the method of ratio imputation is employed, there are two simple choices of ~i , viz., ~i = (2.4) ~i = ( ; ) + (2.5) y x y y y y y X x nX n p x px P ;p , = 1 Pp and = 1 PN . where = n;1 p ni=1 i p i=1 N i=1 i In the above two formulations, it is assumed that is known. If it is not known, we may dene the imputed values as x x x X X X ~i = i y (2.6) x y x following Rao and Sitter (1995, p. 459). On the same lines, we propose another set of imputed values as follows: ~i = ( ; )i + y nx y n p x (2.7) : px Utilizing (2.4) { (2.7) in (2.3), we obtain the following four estimators of : ( ; ) + = (2.8) 1 ( ; )2 + + ( ; ) = (2.9) 2 ( ; ) + ( ; ) + = (2.10) 3 ( ; )2 + (2 ; ) = (2.11) 4 ( ; ) + Thus we have ve estimators for estimating the population mean . The estimator is based on amputation of incomplete data while the estimators 1 , 2 , 3 and 4 are based on ratio imputation of missing observations. Out of these four, two estimators require the knowledge of the population mean of the auxiliary characteristic while the remaining two estimators are free from it. Y t y t y t y t x n p x pX nx n p x npX n n p nx p x n p px npx px nx n p n x n p nx p px npx : Y y t t t t X 3 3 Comparison Of Biases Let us write PN x2 = N1;1 Pi=1 (Xi ; X )2 N Y ; Y )2 1 2 = Sy i N ;1 i=1 (P = SxSy (1N ;1) Ni=1 (Xi ; X )(Yi ; Y ) Y Sx = XS ; y ; f = n1 ; N1 Ep np ; g = Ep np n;1 p ; N1 S (3.1) where Ep denotes the expectation with respect to the nonnegative integer valued random variable . Further, we assume that the correlation coecient is nonnegative which is a basic requirement for the application of ratio method. It is easy to see that the mean ignoring the incomplete observations is an unbiased estimator of while the estimators 1 , 2 , 3 and 4 using the ratio method of imputation for missing values are generally biased. In order to study the magnitudes and directions of their biases, we assume that is small and ( ; ) is large which implies that is large. Now let us consider the large sample approximations which are derived in Appendix following Sukhatme, Sukhatme, Sukhatme and Asok (1984). p y Y t t t t p n p n Theorem 1 The order ( O n t1 ;2 ) approximations for the biases of the estimators , t2 , t3 and t4 are given by E( 1 ; ) ( ; ) x y E( 2 ; ) ( ) = = ( 2) = = ( 3) = = ( 4) = B t1 t g B t (3.2) Y t S S X (3.3) Y ( ; ) x y E( 3 ; ) ( ; ) x y E( 4 ; ) f B t t t = X (3.4) Y g B t S S X Y 1; g S S f g ; (3.5) x y: S S X From the above expressions, it is interesting to observe that all the four estimators are unbiased to order ( ;1 ) and the bias precipitates in the terms of order ( ;2 ). However, the estimators 1 , 2 and 3 are also unbiased to order ( ;2 ) when = . If is not less than 1, these three estimators are biased in positive direction. The bias continues to remain positive so long as 1. It changes its sign only when . So far as the estimator 4 is ; 2 concerned, it is also unbiased to order ( ) when ( ; ) = . Its bias is positive or negative according as ( ; ) is larger or smaller than . O n O n O n t t t < < < O n f g 4 t f g g g Comparing the estimators with respect to the criterion of magnitude of bias, we nd that the estimators 1 and 3 have an equal amount of bias at least to the order of our approximation. Further, 2 has always smaller bias than 1 and cannot exceed . Similarly, the estimator 2 has a smaller magnitude of 3 as bias in comparison to the estimator 4 when t t t t f t g t t ( 2 ; 2 )( ; )2 + g f f 2 2 ;2 ( ; )] 0 f g (3.6) > while the reverse is true, i.e., 4 is less biased in magnitude than inequality (3.6) holds with an opposite sign. Observing that t ( 1 )]2 ; ( 4 )]2 = ( 3 )]2 ; ( 4 )]2 = B t B t B t B t ( ;2 ) f g x y S S t2 when the 2 X (3.7) we see that 4 has smaller magnitude of bias than 1 and 3 when either 2 . On the contrary, the estimators 1 and 3 are less biased in magnitude in comparison to 4 when is less than 2 . t t t t t > t 4 Comparison Of Mean Squared Errors Recalling that is an unbiased estimator of , its variance is given by V() = E( ; )2 1 1 2 = E ; y Y y y p Y n ; p N (4.1) y S : As the estimators 1 , 2 , 3 and 4 are generally not unbiased, we consider their mean squared errors for the purpose of comparison. These are derived In Appendix and presented below. t t Theorem 2 To order ( O n t t ;2 ), the dierences between the variance of y and the mean squared errors of the estimators t1 , t2 , t3 and t4 are given by (y t1 ) = E(y ; Y )2 ; E(t1 ; Y )2 Y = 2gSxSy X (y t2 ) = E(y ; Y )2 ; E(t2 ; Y )2 Y = 2f SxSy X 2 (y t3 ) = E(y ; Y ) ; E(t3 ; Y )2 Y = (2g ; f )Sx Sy X 2 (y t4 ) = E(y ; Y ) ; E(t4 ; Y )2 = (2g ; f )Sx Sy Y : X 5 (4.2) (4.3) (4.4) (4.5) As is assumed to be positive, it is clear from (4.2) and (4.3) that both the estimators 1 and 2 are better than implying the superiority of imputation over amputation. Looking at the expressions (4.4) and (4.5), we nd that the estimators 3 and 4 are better than when t t y t t y 2 f > (4.6) : g As , this condition is satised as long as 2 which is the well-known condition for the superiority of ratio estimator over the sample mean when no observation is missing see, e.g., Sukhatme et al. (1984, Chap. 5). Thus, so long as the favourable environment for the application of ratio method prevails (i.e., 2 ), the missingness of some observations on the study characteristic and their imputation by ratio method do not exert any adverse eect. In fact, the ratio imputation suceeds in widening the range of admissible values of see (4.6). Next, let us compare the biased estimators. When is known, we have two estimators 1 and 2 out of which 1 ignores the incomplete observations while 2 incorparates them. Further, 2 has always smaller magnitude of bias in comparison to 1 see (3.2) and (3.3). If we compare their mean squared errors, it is seen from (4.2) and (4.3) that the estimator 1 has smaller mean squared error than 2 . When is not known, we have again two estimators 3 and 4 which utilize the entire set of available observations. Further, 3 has smaller (larger) magnitude of bias than 4 when 2 is greater (less) than . Comparing them with respect to the criterion of mean squared error to order ( ;2 ), we observe from (4.4) and (4.5) that both are equally ecient to the given order of approximation. However, the dierence precipitates if we consider higher order approximations. f < g > > X t t t t t t t t X t t t t O n ;3 ), we have Theorem 3 To order ( O n ( 3 4 ) = E( 3 ; )2 ; E( 4 ; )2 1 1 2 2 = ; Ep t t t Q where Q = N Y n t N Y Y p X n N 1 X 2 ; 1 i=1 i ( i ; ) X X X (4.7) (4.8) : We thus nd that the estimator 4 is better than 3 when is positive which may generally hold good in many practical situations. Finally, let us examine the role of knowledge of through a comparison of estimators 1 and 2 with 3 and 4 . t t X t t t t 6 Q Let us rst recall that 1 has the same bias as 3 but it is less biased in magnitude than 4 for 2 . Further, it is observed from (4.2), (4.4) and (4.5) that the estimator 1 has invariably smaller mean squared error than 3 and 4 . Similarly, we observe from (4.3), (4.4) and (4.5) that the estimator 2 is more ecient than both the estimators 3 and 4 . This means that the knowledge of plays an important role in improving the eciency of estimation when some observations are missing and the method of ratio imputation is employed for them. t t t > t t t t t t X APPENDIX If we write = (x ; X ) y = (y ; Y ) x we observe that x and y are of order Op (n; 12 ) while x is of order Op (1). x = (x ; X ) Further, we have E( x ) = E( x) = E( y ) = 0 : Now we can express ( 1 ; ) = ( ; ) ; 1 ; ;1 = y ; x( + y ) 1 + x t Y y Y p nX py X n x Y X : Expanding and retaining terms to order p ( ;2 ), we nd 2 ( 1; )= y; x; x y; x O t Y pY nX p E( 1 ; ) = E( y ) ; E( x ) ; = ; 1 x y; Y X Y p nX Y gS S X Y nX whence t n X 1 E( 2 ) ; E( ) x y x nX 2 x gS which leads to the result (3.2) of Theorem 1. Similarly, to order ( ;2 ), we have E( 1 ; )2 = E 2y ; 2 y x 1 1 = E ; Y p X p O n t Y pY p nX n ; p giving the expression (4.2) of Theorem 2. 7 N y ;2 S 2 Y X x y gS S Next, we observe that ( ; )( ; ) + ( ; ) ( 2 ; ) = ( ; ) ; ( ; ) + ( ; ) + ( ; ) + ;1 x x x x 1+ = y ; ( + y) h x+ i = y; x+ y x+ + 1 ; x = y ; x ; y ; x x + x + p ( ; 25 ) Thus the bias to order ( ;2 ) is 2 E( 2 ; ) = ; x y; x and the mean squared error to the same order of approximation is ( ; ) 2 2 2 E( 2 ; ) = E( y ) ; 2 E x y+ 2 x y 2 1 1 2 = Ep ; ; x y y ;2 These provide the result (3.3) of Theorem 1 and result (4.3) of Theorem 2. Similarly, for the estimator 3 , we have ( 3 ; ) = ( ; ) ; ; ;1 = y ; ( + y )( x ; x) 1 + x t Y y p pY nX n X p X px n p p n Y nX p nX pY p p x p x n Y nX x n Y nX p n p n py Y X ::: py ::: X n O n : O n t t p Y Y Y f S S nX p n Y p n X n p fS X Y S N p X n f S S : t t Y y py Y p x x n Y nX x X 2 = y ; ;Y x + (Y x ; x y ) + x y ] 1 ; x + x2 : : : nX X X p = y+ x; x; y; x x x ; y ; x 1 + x + p ( ; 52 ) Taking expectation and retaining terms upto order ( ;2 ), we get 2 E( 3 ; ) = ; 1 x y; x which gives the result (3.4) of Theorem 1. Similarly, to the same order of approximation, we have 2 2 E( 3 ; )2 = E( 2y ) + 2 E( y x ) + E( 2 x ) ; 2 E( x y ) ; E( x 2y ) ; E( x x y ) 2 1 1 2 2 = Ep ; ; x y y+ x;2 pY nX p nX p nX Y X Y Y X X O n : O n t t Y Y X Y nX n p p Y S 8 Y X : p nX p N gS X p nX Y Y gS S Y X fS p Y X gS S which leads to the result (4.4) of Theorem 2. Proceeding in the same manner, we can express ( ; )( ; ) ( 4 ; ) = ( ; ) ; ( ; ) + ;1 = y ; 2 ( + y )( ; )( x ; x) 1 + x + x ; x = y ; ; x + ( x ; y x) + x y + + x t Y y Y p n p n n p x Y n X p py x n Y p x px Y p X X nX p nX nX X p pY ::: 2 1 ; x + x ; x + = y + x ; x ; y ; x ; y ; x x + x 1 + x + p ( ; 25 ) nX pY nX p p nX Y nX X We thus nd E( 4 ; ) = ; 1 t Y Y X pY n ::: Y n X O X x y ; (g ; f ) gS S n : Y 2 x S X which is the result (3.5) of Theorem 1. The result (4.5) of Theorem 2 can be obtained in a similar way. Lastly, let us consider the result stated in Theorem 3. It is observed that ( 3 4 ) = E( 3 ; )2 ; E( 4 ; )2 = E( 3 ; ) + ( 4 ; )]( 3 ; ) ; ( 4 ; )] x 2 y = 2E y + x 1+ + ( ; 72 ) 2 # 1 1 2 " N X 2 1 2 )3 Ep + ( ; ; = i x ( ; 1) i=1 + ( ; 27 ) t t t Y t t Y t py nX Y Y Y n N X t Y p Y X S n X X N t Y O n X X p n O n which yields the desired result (4.7). References Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data, Wiley, New York. Rao, C. R. and Toutenburg, H. (1995). Linear Models: Least Squares and Alternatives, Springer, New York. Rao, J. and Sitter, R. (1995). Variance estimation under two-phase sampling with application to imputation for missing data, Biometrika 82: 453{460. 9 Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Sample Surveys, Wiley, New York. Sukhatme, P. V., Sukhatme, B. V., Sukhatme, S. and Asok, C. (1984). Sampling Theory Of Surveys With Applications, Iowa State Univercity Press, Iowa. Tracy, D. S. and Osahan, S. S. (1994). Random non-response on study variable versus on study as well as auxiliary variables, Statistica 54: 163{168. 10