ERRORS IN SAMPLE SURVEYS TAUQUEER AHMAD Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] 1. Introduction In Probability Sampling when observation yi on the ith unit is the correct value for that unit, the error of estimate arises purely from the random sampling variation i.e. when fraction of units is measured instead of the complete population. This deviation of the sample statistics from the population parameters is usually called sampling error. It is well known that if all units of the population are measured the estimate will be free from sampling error. But in practice it may not be always possible to get the true observation yi on the ith unit. Consequently the estimate based on sample will also involve errors different from sampling errors. All the errors in estimation, which are not the result of sampling, are called non-sampling errors, i.e. these are the residual categories. The sampling errors arise because it is based on a ‘Part' from the ‘Whole’ while non sampling errors mainly arise because of some departure from the prescribed rules of the survey, such as survey design, field work, tabulation or analysis of data etc. This is the reason that the census results though free from sampling errors are subject to various types of non sampling errors and sometimes these non sampling errors may be more important than the sampling errors and thus may affect the results substantially. 2. Classification of Non Sampling Errors Non sampling errors arise due to numerous factors and almost at every stage of survey from planning of the survey to report writing. In order to study different aspects of non sampling errors effectively, it is considered desirable to classify the non sampling errors according to the source or the stage of the survey or type of error. One approach to classify non sampling errors is by the stage of the survey the non sampling errors occur. The three major stages in the survey are 1. Survey design and preparation, 2. Data collection and, 3. Data Processing and Analysis. This classification is more useful for discussing the measures of control of non-sampling errors. A survey activity checklist for control of non-sampling errors is given in Appendix. A second approach to classify the non sampling errors is on the basis of source or type of error. The three major categories under such classifications are (i) Coverage errors, (ii) Non-response errors and, (iii) Measurement or response errors. Errors in Sample Surveys This type of classification is more useful to discuss the implications of non sampling errors and to suggest methods of obtaining unbiased estimators in the presence of such errors. These are discussed in details in the following sections. 3. Coverage Errors The objective of any survey is to make inference about the desired or a Target Population. For this purpose selection is done by applying appropriate randomized procedure to sampling frame in which all the units of the Target Population are supposed to be represented uniquely. The coverage errors arise mainly due to the use of faulty frame of sampling units. For example in a household survey if the old list of households prepared for the population census a few years ago is used for selection of the sample, some newly added households will not form a part of the sampling frame whereas a number of households which might have already migrated will remain in the frame. The use of such frames may thus lead either to inclusion of some units not belonging to the Target Population or to omission of some units which belong to the Target Population. Coverage errors may also arise due to incorrect specifications or ignorance of correct procedure by field workers, failure to identify actual units selected, enumerating wrong units intentionally or unintentionally by the enumerator etc. Some dishonest enumerators complete the questionnaires for some imaginary and make up households and submit them in place of actual households. In USA this practice has been named at Curb Stoning. Rules of associations also many times cause non sampling errors. For example in household surveys dejure and defacto (dejure: usual residence, defacto: actual presence of individual at the time of interview) status may be a cause of some non-sampling errors. Hansen, Hurwitz and Jubine (1964) presented a detailed account of dealing with imperfect frames and proposed a useful technique known as predecessor-successor method to obtain information on omissions in the frame. Seal (1962) presented the use of outdated frames in large scale surveys assuming the changes in the population to be a continuous stochastic process. Hartley (1962) proposed the use of two or more frames to overcome the problem of incomplete frames. Singh (1983,1986,1989) presented a mathematical formulation for the predecessor-successor method for estimating the total number of missing units from the frame and estimation of the total of character under study for the Target Population. 4. Non Response Errors Non response errors arise due to various causes arising right from the stage of the survey design, planning, execution etc. But most of the non response errors arise mainly because of • Not-at-home i.e. respondent may not be at home when the enumerators call on them and, • Refusal: the respondent may refuse to provide information to the enumerators for one reason or the other. (In most of the survey cases legal obligations to respond do not exist). 2 Errors in Sample Surveys A panel on Incomplete Data was established by the Committee on National Statistics, National Research Council, Washington in 1977. The panel prepared three volumes on ‘Incomplete Data in Sample Surveys’ published by Academic Press during 1983. These publications provide detailed information on several case studies, theory and Bibliographies. The first attempt to deal with the problem of non response was perhaps made by Hansen and Hurwitz(1946) through Call back method. They assumed the population as divided into two classes, (i) response class where respondents respond in the first attempt and (ii) a non response class where respondents do not respond in the first attempt. Another method to obtain unbiased estimators from the information collected from the respondent in the first attempt only was proposed by Politz and Simmons (1949). Kish and Hess (1959) proposed the adding of a sample of non responding units from previous surveys for obtaining information about the non respondents. Another method to deal with non response used quite often is the method of substitution i.e. substitution of non respondent units by other similar units. However, substitution does not eliminate, bias, due to non response at all. 4.1 Method of Imputation for Missing Data The extent of non-response varies greatly between different questions. Items, such as race and sex usually have little non-response; on the other hand receipts of income from various sources may have high non-response (Kalton, Kasprzyk and Santos 1981). The multivariate nature of surveys, with all variables potentially subject to missing data, suggest the need for a general purpose strategy for handling item-non-response. Imputation defined as the process of estimating individual missing values in a data set has become quite popular to deal with item non response. Kalton (1982) has given three important desirable features of the imputation procedures as: • • • By weighting adjustments for total non-response, it aims to reduce biases in survey estimates arising from missing data, By assigning values at micro-level and thus allowing analysis to be conducted as if the data set were complete, it makes analysis easier to conduct and results easier to present. Complex algorithms to estimate population parameters in the presence of missing data are not required, and The results obtained from different analysis are bound to be consistent, a feature which need not apply with an incomplete data set. Imputation of missing data does, however, has its drawbacks, as it is a last resort activity, which may be justifiable for statistical data, and is certainly not a cure for, but is often a symptom of poor data quality. It does not necessarily lead to estimates that are less biased than those obtained from the incomplete data set, indeed the biases could be much greater, depending on the imputation procedure and the form of estimate. There is also the risk that the analyst may treat the completed data set as if all the data were actual responses, thereby overstating the precision of the survey estimates. Therefore, the analysts working with a data set containing imputed values should proceed with caution, and be aware of the extent of imputation for the variables in their analysis as well as the details of the procedures used. 3 Errors in Sample Surveys Platek and Gray (1983) discussed the total survey error model to deal with imputation methodology and obtained contribution of different components to the total variance. Singh and Rai (1983) examined the effect of various imputation procedures on survey results and studied empirically some important imputation procedures: 4.1.1 Traditional Methods of Imputation An imputation procedure is defined as a procedure that imputes a value for each missing value which is assumed to be quite close to the true missing value. A wide variety of imputation methods have been developed for assigning values for missing item responses (Kalton and Kasprzyk, 1986). Imputation technique may be quite useful when imputation for any missing value is done based on homogeneous imputation classes. Deductive Imputation: Sometimes the missing answer to an item can be deduced with certainty from the pattern of responses to other items. Edit checks should check for consistency between responses to related items. When the edit checks constrain a missing response to only one possible value, deductive imputation can be employed. Deductive imputation is the ideal form of imputation. Mean Imputation: Missing values are replaced by the mean of all responding values for the variable. This can be done based on the whole dataset or separately for different categories of respondents defined by combinations of selected classification variables. Zero Imputation: It is a method of imputation in which zero is substituted for the missing data when a unit fails to respond. Regression Imputation: This method uses respondent data to regress the variable for which imputations are required on a set of auxiliary variables. The regression equation is then used to predict the values for the missing responses. The imputed value may either be the predicted value or the predicted value plus some residual. There are several ways in which the residual may be obtained. Cold-deck Imputation: Missing values are replaced by values of older data, e.g. from a previous survey, which could furthermore be adjusted for trend. Hot-deck Imputation: In general, a hot-deck procedure is a duplication process - when a value is missing from a sample, a reported value is duplicated to represent this missing value. The adjective “hot” refers to imputing with values from the current sample. This procedure usually has some classification process associated with it. All of the sample units are classified into disjoint groups so that the units are as homogeneous as possible within each group. For each missing value, a reported value is imputed which is in the same classification group. Thus, the assumption is made that within each classification group the non-respondents follow the same distribution as the respondents. Current survey practice uses many variations of hot-deck procedures. A sequential hot-deck procedure is one in which the sample is put in some type of order within each classification group, and for each missing value the previous reported value is duplicated. For example, the ordering might be based on a geographic variable. The result 4 Errors in Sample Surveys of a geographic ordering is that the reported value duplicated for a missing value is from a unit which is geographically close to the unit with the missing value. The sequential hotdeck suffers the disadvantage that it may easily make multiple uses of donors, a feature that leads to a loss of precision in survey estimates. The above disadvantages of the sequential hot-deck are avoided in the hierarchical hotdeck method. The procedure sorts respondents and non respondents into a large number of imputation classes from a detailed categorization of a sizeable set of auxiliary variables. Non-respondents are then matched with respondents on a hierarchical basis, in the sense that if a match cannot be made in the initial imputation class, classes are collapsed and the match is made at a lower level of detail. Another form of hot-deck method is distance function matching which assigns a nonrespondent the value of the ‘nearest’ respondent, where ‘nearest’ is defined in terms of a distance function for the auxiliary variable. Various forms of distance function have been proposed and the function can be constructed to reduce the multiple uses of donors by incorporating a penalty for each use. Multiple Imputation: Because many imputation methods often do not preserve distributional properties, multiple imputations are advocated as a way of improving the ability to make inferences from data where imputation has been undertaken, particularly when the proportion of values missing is high. Multiple imputation method retains the advantages of single imputation like completing the data set and using the expert knowledge for imputation and rectifies its major disadvantages Rubin (1986). As its name suggests, multiple imputation replaces each missing value by a vector composed of M ≥ 2 possible values. The M values are ordered in the sense that the first components of the vectors for the missing values are used to create one complete data set, the second components of the vectors are used to create the second completed data set and so on. There are some practical difficulties with multiple imputation as there is generally a desire to produce one definitive micro data set for public use rather than several which will give slightly different results and the typical data user may not be willing to analyse several datasets in order to obtain each answer. 4.1.2 New Methods of Imputation Recent advances in methods and computing capabilities have made possible the application of more complex statistical modeling techniques like non-parametric regression; neural networks including multi layer perception, self organizing maps, support vector machines, etc. for the purpose of imputation. Measures to Study the Effectiveness of Imputations To study the effectiveness of different imputation methods the following three measures have been computed: Mean Departure (MD): The mean departure denotes the mean of difference between the true value for the missing unit and the imputed value. 5 Errors in Sample Surveys MD = n 1 ∑ n i =1 (y k i ) − yi = y k − y where yi and yik denote the actual value and imputed value using k-th imputation method for the i-th unit. When computed for a large number of samples, MD may provide a measure of bias of the method of imputation. Mean Absolute Departure (MAD): The mean absolute departure is used to denote the mean of the absolute deviation of imputed values from the actual values, i.e., MAD = n 1 ∑ n i =1 yi k − yi when computed for a large number of samples MAD provides a measure of the closeness with which the imputation method reconstructs the missing values. Standard Deviation Departure (SDD): To study the impact of various imputations method on disturbing the distribution of the character under study, the standard deviation departure (SDD) is used which is defined as the difference between the SDD of the actual values and the SDD of the imputed values. Through by a simulation study it was observed that except the zero substitution method, all other imputation methods performed almost equally well and all the three measures worked well. Also as expected, as the non response rate increases, the departures increase for all the imputation methods. 5. Measurement or Response Errors Response Errors arise in data collection or taking observations and are mainly contributed by the respondent or the enumerator or both. Response errors refer to the differences between the individual true value and the corresponding observed sampling value irrespective of the reasons for discrepancies. For example in an Agricultural Survey a Householder may report a total area of his holding, which may differ from the cadastral data. Sometimes the measurement devices or techniques may be defective and may cause observational errors. Many times response errors may be accidental but these may also be introduced purposely or may arise from lack of information. This may be due to fear and prestige or simply to confirm to what they think is appropriate. Women generally declare themselves younger. People raise their level of education or their occupation, Assistant declaring Manager, a Compounder declaring Medical Practitioner, etc. Similarly people exaggerate their salary, rent, money spent on food, clothing etc. Mc Ford (1951) showed how people tried to appear well informed. Respondents were asked if they had heard about some particular magazines, writers, piece of legislation etc. that in fact never existed. There was very large proportion of respondents answering ‘Yes’. Given the importance of measurement errors in survey sampling an International Conference on Measurement Errors in Surveys’ was held during Nov 11-14, 1990 in Tucson, Arizona sponsored by Survey Research Methods Section of the American 6 Errors in Sample Surveys Statistical Association. Thirty two invited papers presented at the conference have been published in a book form ‘Measurement Errors in Surveys’ Edited by Paul P. Biemer, Robert M. Groves, Lars E. Lyberg, Nancy A. Mathiowetz and Seymour Sudman by Wiley Interscience, John Wiley & Sons Inc. 1991. 5.1 Study of Measurement Errors In recent years much of research on sampling practices has been devoted to the study of measurement errors. The objectives are to discover the components that make large contributions and to find ways of eliminating or decreasing their contributions. Ideally the best method is to obtain the correct value yi. The approach is however limited to items which can be measured correctly by some alternative method. Belloc (1954) compared data on Hospitalisation as reported in household interview with the hospital records for the individual. Checks of this type called ‘Record checks’ are possible with items such as age, occupation, price paid, etc. An alternative method is to remeasure by an independent method, which is more accurate. Kish and Lansing (1954) engaged professional appraises to estimate the selling price of homes that had already been reported by the homeowners. Another possibility is to reinterview a sub sample of respondents with more qualified enumerators and with more accurate measuring devices. 5.2 Interpenetrating Subsampling This important technique proposed by Mahalanobis (1946) mainly for estimation of variance is particularly useful for study of correlated errors. In simplest terms, a random sample of n units is divided at random into k sub samples each sub sample containing n m = units. The fieldwork and processing of samples are planned in such a manner that k there is no correlation between the errors of measurement between units in different subsamples. The most important factor which introduces correlation is the bias of enumerators and thus if each of the k enumerators is assigned to different sub-samples and if there is no correlation between errors of measurement for different interviewers, we can easily, estimate the contribution of interviewer bias to the variance and also give a test of significance of the null hypothesis of no interviewer bias. In the mathematical treatment of observational errors, mathematical models based on the assumption that repetitive observation can be made on a unit, have been proposed by Sukhatme and Seth (1952), Hensen et.al. (1953, 1961, 1964). 7 Errors in Sample Surveys APPENDIX CHECK LIST FOR CONTROL OF NON SAMPLING ERRORS Survey Activity Action 1. General Planning Has any such survey been conducted earlier 2. Selection of and items 3. Data collection 4. Data processing and Correct and unique identification of each questionnaire, analysis instruction for manual edit and coding, verification of coders' work, computer processing, method of estimation and tabulation. 5. Report writing topics Number and length of questions, reference period, concepts, frame, sampling design, sampling units and rules of association, methods of data collection, development of questionnaires, pre testing for refining and estimating cost factors, outline of tabulation, interviewer selection and training. Schedule of field supervision, editing of completed questionnaires, re-interview of sub-sample, suggestions for improvement in subsequent surveys. Description of survey design, concepts and definitions, sampling and non sampling errors and suggestions for future surveys. References and Suggested Reading Belloc, B.B. (1954). Validation of morbidity survey data by comparison with hospital records. J. Amer. Stat. Assoc., 49, 832-846. Hansen, M.H. and Hurwitz, W.N. (1946). The problem of non response in sample surveys. Jour. Amer. Stat. Assoc., 41, 517-529. Hanson, R.H., and Marks, E.S. (1958). Influence of the interviewer on the accuracy of survey results. J. Amer. Stat. Assoc., 53, 635-655. Hansen, M.H., Hurwitz W.N. and Bershad, M. (1961). Measurement errors in census and surveys Bull. Int. Stat. Inst., 38, 2, 359-374. Hansen, M.H., Hurwitz, W.N. and Jubine, T.B. (1964). The use of imperfect lists for probability sampling at the U.S. Bureau of the Census. Bull. Internal. Statist. Inst., 40. Kish, L. and Lansing. J.B. (1954). Response errors in estimating the value of homes. J .Amer. Stat. Assoc., 49, 520-538. Mahalanobis, P.C. (1946). Recent experiments in statistical sampling in the Indian Statistical Institute. J. Roy. Stat. Soc,. 109, 325-370. Politz, A.N. and Simmens, W.R. (1949). An attempt to get the ‘not at homes into the sample without call backs. J. American Stat. Assoc. 44, 9-31, and 45, 136-137. Seal, K.C. (1962). Use of outdated frames in large scale sample surveys. Calcutta Statist. Assoc. Bull.11. 8 Errors in Sample Surveys Singh, R. (1983). On the use of incomplete frames in sample surveys. Biom. J. 25, 545549. Singh, R. (1985). Estimation from incomplete data in longitudinal surveys. JSPI, 7, 163170. Singh, R.(1986). Predecessor-Successor Method. Encyclopedia of Statistical Sciences. V.7, 137-139. John Wiley & Sons Inc. Singh, R. and T. Rai (1983). Use of Imputations for Missing Data in Census and Surveys. Project Report, Indian Agricultural Statistics Research Institute (ICAR), New Delhi. Sukhatme, P.V. and Seth, G.R.(1952). Non sampling errors in surveys. J. Indian Soc. Agril. Statist. 4, 5-41. Zarkovich, S.S. (1966). Quality of Statistical Data. F.A.O., Rome. 9