Workshop on Complex Sample Survey Design National Statistical Office, and International Health Policy, Thailand: 5-6 September 2013 Featuring Professor Thomas Lumley Venue: Training Room, Government Complex, Building B (south), Laksi, Bangkok Schedule Thursday 5 September 8.30 am: Registration 9.00 am: Session 1: Basic Survey Design Concepts 10.00 am: Refreshment break 10.30 am: Session 2: Simulation of Sampling Designs 12.30 pm: Lunch break 1.30 pm: Session 3: Using the Survey Library 3.00 pm: Refreshment break 3.30 pm: Session 4: Discussion Friday 6 September 9.00 am: Session 5: Computer Exercises 10.30 am: Refreshment break 11.00 am: Session 6: Further Computer Exercises 12.30 pm: Lunch break 1.30 pm: Session 7: Conclusions and Recommendations 3.00 pm: Prize giving Teaching Faculty Pilot: Professor Thomas Lumley Co-pilot: Dr Alan Geater Cabin Crew: Aj Edward McNeil Dr Bungon Kumphon Aj Wandee Wanishsakpong “In cluster sampling birds of a feather flock together.” Senior Flight Attendant: Dr Apiradee Lim “Duck or fish?” Ground control: Aj Don McNeil “Trust nobody, believe nothing, and check everything.” Workshop on Complex Sample Survey Design Session 1: Basic Survey Design Concepts • • • • • • • • • • • • • Introduction Sample Size: Theory Sampling: Practice Stratified Samples The 2005 Verbal Autopsy Study Proportional Allocation Sampling Thailand Statistical Divisions Perinatal Deaths in PHA 12 Sampling Frames Coverage Choice of Geographical Units Illustration of Stratification: PHA 12 Zone Clustering Introduction In this two-day workshop we will consider methods for taking samples from populations using designed surveys. These populations comprise subjects in a specified region of the world at a particular time, such as all persons living in Scotland in 2010, or all children under 5 who died in Thailand in 2005, or all elementary, middle, and high schools in California in 2012. We’ll focus on specific characteristics of interest for these subjects, such as the mental health status of persons based on a standard questionnaire, or their reported cause of death classified by International Classification of Diseases (ICD-10) code, or the academic performance index of a school. The reason for taking a sample is to estimate distributions of such population characteristics without studying every member of the population, and thus obtain this information with sufficient accuracy at substantially reduced cost. Sample Size: Theory A basic question is the size of the sample. Larger samples give more accurate results, but at greater cost. What about the population size? How does this affect the sample size? One might expect that a larger population requires a larger sample, but often the population size doesn’t matter. To see this, suppose we want to know the average number of hours per day a person in Thailand aged 5 or more used the Internet in 2012, based on a random sample of 900 persons. In statistical theory, a 95% confidence interval for a population mean based on a random sample of size n is approximately y 2 s/ n, where y is the sample mean and s is the sample standard deviation. If n is 900, and y and s are 3.4 and 3.0 hours/day, respectively, the result is the interval 3.2 to 3.6 hours. But note that the population size does not figure in this result! Note also that the formula for the confidence interval of a population mean assumes that the sample is random1. A simple random sample (SRS) (Lumley 2010 page 2) is one where every subset [of specified size] from the population is equally likely to be selected2. Simple random sampling has an advantage over other sampling methods: it produces samples that are unbiased on average. This is because it gives equal weight to all samples of the same size, and consequently ignores any way that a bias could occur. Such biases arise from population subjects that are more likely to be sampled because they are easier to get hold of (like students in a class who sit close to their teacher). But although the sample is chosen at random, it’s still possible to get a biased sample purely by chance. This risk can be reduced by appropriate stratification (Lumley 2010 Section 2.2). ________________________________________________________________________________________________________________________ 1 There 2 are other assumptions. What are they? How many samples of size 2 are there in a population of size 10? Sampling: Practice Let’s consider how a simple random sample of size n (say 900) could be taken from a population, say from the population of Thai residents aged 5 or more. If we had a database containing data from a recent census of the population with answers to this question, it would be a simple matter to just take a sample of n records from this database using a pseudo-random number generator. In practice such databases are rarely available, and it is usually necessary to go out and collect the data, using a sample design that will work in practice. A possible such design might involve standing on street corners in Bangkok and asking passers-by how many hours per day on average they spend on the Internet. (Many will refuse to respond, but eventually it might be possible to obtain 900 responses3.) ___________________________________________________________________________________________________________________________ 3 Is this a feasible sample design? Explain. Sampling from Regions Or we could sample residents from each of Thailand’s 13 Public Health areas 4. How many should be sampled from each region? Each region is just a smaller population, and if the sample size is the major determinant of estimation accuracy, the samples should be approximately the same size. To see this, note that the difference between means of two samples of sizes n1 & n2 from a large population has standard error s (1/n1+1/n2) where s is the pooled standard deviation, and is minimized when n1 = n2. The result extends to any number of components. It’s similar to the fact that a chain cannot be stronger than its weakest link. _________________________________________________________________________________________________________________ 4 How is this better than the corresponding SRS design? Explain. The 2005 Verbal Autopsy Study A verbal autopsy (VA) study to determine true cause of death sampled 3316 inhospital & 6328 outside-hospital deaths in 2005 from 28 of the 926 districts in nine of Thailand’s 76 provinces. Rao et al (2010) described the design as follows. The nine provinces were selected to be Bangkok and two from each of the four regions by ranking provinces by numbers of reported deaths and selecting one province above and one below the median. The 28 districts were selected similarly. Finally, approximately 50% of death certificates were selected from all villages and urban areas within these districts using the SRS method. Proportional Allocation Sampling The VA study had sample sizes varying from 316 in Chumpon to 2418 in Ubol Ratchathani. This variation arose because SRS was used at the final stage, that is samples were allocated in proportion to the population. This is not quite the same as PPS (Probability-Proportional-to-Size) sampling (Lumley 2010 p 46). This method gives sample sizes that vary in proportion to the populations of the regions. But as we have seen, the accuracy of an estimate obtained by random sampling depends mainly on the size of the sample and not the size of the population. Unless the populations of the regions are of similar size, proportional allocation can give rise to small samples with relatively large standard errors and resulting loss of accuracy. Instead, it might be better to balance sample sizes across regions. As an illustration, consider the VA survey results for the seven southern-most provinces comprising PHA 12 (Public Health Area 12). Thailand Statistical Divisions PHA 12 comprises seven provinces, 76 districts, and 535 tambons. (Thailand has 13 such PHAs.) The VA sample design included just a single province (Songkla) and two of its 16 districts (Krasae Sin and Hat Yai), with respective sample sizes 60 (from 116 deaths) and 787 (from 1642 deaths). And 331 of the 787 (42%) subjects sampled from Hat Yai were in tambon 901101 (the Central Business District of Hat Yai city). As a consequence, the results were dominated by data from just one highly urbanized sub-district containing several large hospitals. This sampling method gives an inaccurate estimate of the total number of perinatal deaths in the population, as the next slide shows. Krasae Sin District (population deaths 116) Hat Yai District (pop deaths 1642) Perinatal Deaths in PHA 12 Suppose we restrict the population to deaths for children aged under 5 in 2005 in PHA 12. This population size is N = 970. We’ll take the outcome variable to be the total number of deaths due to perinatal originating conditions in the population. For the purpose of this exercise assume that we know only the numbers of reported deaths by location (in or outside hospital) and district. The VA sample for this region contained just one subject aged under 5 in Krasae Sin district and 22 in Hat Yai district, with corresponding reported numbers of perinatal deaths 0 and 10. The estimated total number of perinatal deaths in the population based on this sample is obtained by multiplying the sample proportion (10/23) by the population size (970). The result is 421.7. Using the formula 5 SE = N√{p(1-p)(1-n/N)/n} for its standard error, a 95% confidence interval is (223.6, 619.8). ______________________________________________________________________________________________________ 5 In contrast to the statement on Slide 1.3, the accuracy of the proportion p now depends on the population size N through the term 1-n/N. Why? When is Proportionate Sampling Justified? We have argued for taking samples of similar size from each region. But whether this is a good rule or not depends a lot on what is being estimated. If you want to estimate means for each region as accurately as possible, take equal sample sizes in each region. In this case proportionate sampling will lead to some regions having very small sample sizes, and inaccurate estimates. On the other hand, if you want to estimate means or totals for the whole population as accurately as possible, take larger sample sizes from larger regions, and proportionate sampling works well. This is because a population mean is a weighted sum of region means, with weights proportional to populations in regions. Errors in an estimated mean for a large region will be expanded by the weighting, and so need to be smaller to begin with. In this case PPS is good for estimating an overall population mean or total. Sampling Frames The sampling frame for a survey is a list of all the population elements. For a survey of Internet use in Thailand it would comprise all residents aged 5 years or more in 2012 (about 60 million). The Verbal Autopsy study frame was a list of deaths reported in the VR database in 2005. For an Internet use survey it is unlikely that a sampling frame would be available containing addresses of persons to be sampled. In such situations, particularly for household surveys, a feasible method involves collecting information from subjects sampled randomly from regions within larger regions in stages. One of us (DM) used this method in 1985 to obtain data on health expenditure from 10,000 households in Pakistan. The sampling frame was a voluminous list, from its National Statistical Office, of population sizes of all villages and city blocks in the four major regions of the nation, which was used to sample up to 20 households from each of 600 selected villages and city blocks. The four major regions of Pakistan are Punjab, Sindh, Balochistan and NWFP (the NorthWest Frontier Province). In 1985, there were more than 200,000 villages and city blocks in the nation, which had a population of 80 millions. Coverage For a survey of a large and diverse area such as a whole country, a sample survey should ideally provide appropriate coverage of all elements in the population. This means that both the sampling frame and the sample itself should include all such factors that are associated with the outcomes of interest. In socio-economic and health surveys these factors may include ethnicity, religion, gender, age, population density, wealth and occupation, and where these are known to be strong predictors of an outcome, stratification can improve coverage. However, there may be unknown factors that are strongly associated with an outcome, or factors known to be associated but for which no data are available. To provide adequate coverage in these situations, it is also important to ensure that the sample design covers the geographical area as widely as possibly within budget limitations. Stratified Sampling If the outcome of interest is correlated with a known factor, the efficiency of a sample survey can be improved by stratifying on that variable. This involves subdividing the population into strata according to different levels of the factor, taking separate random samples from these strata, and combining them. For example, if the proportion of perinatal deaths is substantially higher in hospitals than outside hospitals, efficiency is gained by stratifying by location of death in or outside hospital. Similarly, if the outcome of interest is the true number of transport accident deaths in Thailand (as distinct from the reported number, which is much too low due to the large percentage reported with ill-defined cause), and given that reported cause of death is a strong predictor of the true cause, efficiency would be gained by stratifying on the reported cause. Choice of Geographical Units For many nationwide surveys in Thailand (including the 2005 Verbal Autopsy survey) its 926 districts comprise the geographical units from which further samples are taken. These districts vary a lot in population size. At the 2000 Census, district KoKut in Trat had just 2042 residents, whereas the Samut Prakan city district had 435,122. Super-districts (Lim et al 2007) provide balanced strata for sample surveys. Odton et al (2010) defined 235 Thai super-districts, including 16 in PHA 12. Clustering Taking a simple random sample from a population spread over a region containing many different places (usually villages and city blocks) can be very costly, particularly if it involves getting information from residents (Lumley 2010 Figure 3.1). For example, if this had been done for the survey of health expenditure in Pakistan in 1985, it would have meant going to households in nearly 10,000 different places. To minimize such travel costs, households to be surveyed were selected from 600 clusters within the sampling frame, which were stratified by population density and literacy variation (available from the 1980 National Census). In each of these 600 clusters, a place (village or city block) was randomly selected, and up to 20 households were then selected from a list enumerating all n households in the place after consultation with a local authority. Households were selected by taking every kth one in the geographically ordered list, where k = ceiling(n/20). References Lim A and Choonpradub C: A statistical method for forecasting demographic time series counts, with application to HIV/AIDS and other infectious disease mortality in Southern Thailand, South-East Asian J Trop Med Public Health 2007, 38(6):1029-1040. Lumley T: Complex Surveys: A Guide to Analysis Using R, Wiley 2010. Odton P, Bundhamcharoen K and Ueranantasun A: District-level Variations in the Quality of Mortality Data in Thailand, Asia-Pacific Population Journal 2010, 25(1): 79-90. Rao C, Porapakkham Y, Pattaraarchachai J, Polprasert W, Swampunyalert W and Lopez AD: Verifying causes of death in Thailand: rationale and methods for empirical investigation, Population Health Metrics 2010, 8:11. Session 2: Monte Carlo Simulations of Sampling Designs Session 3: Creating Survey Objects using the R Survey Library Session 4: Discussion of Topics from Sessions 1-3 Sessions 5-6: Further Computer Exercises Session 7: Discussion of Topics from Sessions 1-6 & Conclusion
© Copyright 2024