Workshop on Complex Sample Survey Design Policy, Thailand: 5-6 September 2013

Workshop on Complex Sample Survey Design
National Statistical Office, and International Health
Policy, Thailand: 5-6 September 2013
Featuring Professor Thomas Lumley
Venue: Training Room, Government Complex,
Building B (south), Laksi, Bangkok
Schedule
Thursday 5 September
8.30 am: Registration
9.00 am: Session 1: Basic Survey Design Concepts
10.00 am: Refreshment break
10.30 am: Session 2: Simulation of Sampling Designs
12.30 pm: Lunch break
1.30 pm: Session 3: Using the Survey Library
3.00 pm: Refreshment break
3.30 pm: Session 4: Discussion
Friday 6 September
9.00 am: Session 5: Computer Exercises
10.30 am: Refreshment break
11.00 am: Session 6: Further Computer Exercises
12.30 pm: Lunch break
1.30 pm: Session 7: Conclusions and Recommendations
3.00 pm: Prize giving
Teaching Faculty
Pilot: Professor Thomas Lumley
Co-pilot:
Dr Alan Geater
Cabin Crew:
Aj Edward McNeil
Dr Bungon Kumphon
Aj Wandee
Wanishsakpong
“In cluster sampling
birds of a feather
flock together.”
Senior Flight Attendant:
Dr Apiradee Lim
“Duck or fish?”
Ground control:
Aj Don McNeil
“Trust nobody,
believe nothing,
and check everything.”
Workshop on Complex Sample Survey Design
Session 1: Basic Survey Design Concepts
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Sample Size: Theory
Sampling: Practice
Stratified Samples
The 2005 Verbal Autopsy Study
Proportional Allocation Sampling
Thailand Statistical Divisions
Perinatal Deaths in PHA 12
Sampling Frames
Coverage
Choice of Geographical Units
Illustration of Stratification: PHA 12 Zone
Clustering
Introduction
In this two-day workshop we will consider methods for taking
samples from populations using designed surveys.
These populations comprise subjects in a specified region of the
world at a particular time, such as all persons living in Scotland
in 2010, or all children under 5 who died in Thailand in 2005, or
all elementary, middle, and high schools in California in 2012.
We’ll focus on specific characteristics of interest for these
subjects, such as the mental health status of persons based on
a standard questionnaire, or their reported cause of death
classified by International Classification of Diseases (ICD-10)
code, or the academic performance index of a school.
The reason for taking a sample is to estimate distributions of
such population characteristics without studying every member
of the population, and thus obtain this information with sufficient
accuracy at substantially reduced cost.
Sample Size: Theory
A basic question is the size of the sample. Larger samples give
more accurate results, but at greater cost.
What about the population size? How does this affect the
sample size?
One might expect that a larger population requires a larger
sample, but often the population size doesn’t matter.
To see this, suppose we want to know the average number of
hours per day a person in Thailand aged 5 or more used the
Internet in 2012, based on a random sample of 900 persons.
In statistical theory, a 95% confidence interval for a population
mean based on a random sample of size n is approximately
y 2 s/ n, where y is the sample mean and s is the sample
standard deviation. If n is 900, and y and s are 3.4 and 3.0
hours/day, respectively, the result is the interval 3.2 to 3.6 hours.
But note that the population size does not figure in this result!
Note also that the formula for the confidence interval of a
population mean assumes that the sample is random1.
A simple random sample (SRS) (Lumley 2010 page 2) is one
where every subset [of specified size] from the population is
equally likely to be selected2.
Simple random sampling has an advantage over other sampling
methods: it produces samples that are unbiased on average.
This is because it gives equal weight to all samples of the same
size, and consequently ignores any way that a bias could occur.
Such biases arise from population subjects that are more likely
to be sampled because they are easier to get hold of (like
students in a class who sit close to their teacher).
But although the sample is chosen at random, it’s still possible to
get a biased sample purely by chance. This risk can be reduced
by appropriate stratification (Lumley 2010 Section 2.2).
________________________________________________________________________________________________________________________
1 There
2
are other assumptions. What are they?
How many samples of size 2 are there in a population of size 10?
Sampling: Practice
Let’s consider how a simple random sample of size n (say 900)
could be taken from a population, say from the population of Thai
residents aged 5 or more. If we had a database containing data
from a recent census of the population with answers to this
question, it would be a simple matter to just take a sample of n
records from this database using a pseudo-random number
generator.
In practice such databases are rarely available, and it is usually
necessary to go out and collect the data, using a sample design
that will work in practice.
A possible such design might involve standing on street corners
in Bangkok and asking passers-by how many hours per day on
average they spend on the Internet. (Many will refuse to respond,
but eventually it might be possible to obtain 900 responses3.)
___________________________________________________________________________________________________________________________
3 Is
this a feasible sample design? Explain.
Sampling from Regions
Or we could sample residents from each of
Thailand’s 13 Public Health areas 4. How
many should be sampled from each
region?
Each region is just a smaller population, and
if the sample size is the major determinant of
estimation accuracy, the samples should
be approximately the same size. To see
this, note that the difference between means
of two samples of sizes n1 & n2 from a large
population has standard error s (1/n1+1/n2)
where s is the pooled standard deviation,
and is minimized when n1 = n2.
The result extends to any number of components. It’s similar to
the fact that a chain cannot be stronger than its weakest link.
_________________________________________________________________________________________________________________
4
How is this better than the corresponding SRS design? Explain.
The 2005 Verbal Autopsy Study
A verbal autopsy (VA) study to determine
true cause of death sampled 3316 inhospital & 6328 outside-hospital deaths
in 2005 from 28 of the 926 districts in
nine of Thailand’s 76 provinces. Rao et al
(2010) described the design as follows.
The nine provinces were selected to be
Bangkok and two from each of the four
regions by ranking provinces by numbers
of reported deaths and selecting one
province above and one below the
median. The 28 districts were selected
similarly. Finally, approximately 50% of
death certificates were selected from all
villages and urban areas within these
districts using the SRS method.
Proportional Allocation Sampling
The VA study had sample sizes varying from 316 in Chumpon to
2418 in Ubol Ratchathani. This variation arose because SRS
was used at the final stage, that is samples were allocated in
proportion to the population. This is not quite the same as PPS
(Probability-Proportional-to-Size) sampling (Lumley 2010 p 46).
This method gives sample sizes that vary in proportion to the
populations of the regions. But as we have seen, the accuracy
of an estimate obtained by random sampling depends mainly on
the size of the sample and not the size of the population. Unless
the populations of the regions are of similar size, proportional
allocation can give rise to small samples with relatively large
standard errors and resulting loss of accuracy. Instead, it might
be better to balance sample sizes across regions.
As an illustration, consider the VA survey results for the seven
southern-most provinces comprising PHA 12 (Public Health
Area 12).
Thailand Statistical Divisions
PHA 12 comprises
seven provinces, 76
districts, and 535
tambons. (Thailand
has 13 such PHAs.)
The VA sample
design included just
a single province
(Songkla) and two
of its 16 districts
(Krasae Sin and Hat
Yai), with respective
sample sizes 60
(from 116 deaths)
and 787 (from 1642
deaths).
And 331 of the 787 (42%)
subjects sampled from Hat
Yai were in tambon 901101
(the Central Business
District of Hat Yai city).
As a consequence, the
results were dominated by
data from just one highly
urbanized sub-district
containing several large
hospitals.
This sampling method gives
an inaccurate estimate of
the total number of perinatal
deaths in the population, as
the next slide shows.
Krasae Sin
District
(population
deaths 116)
Hat Yai
District
(pop
deaths
1642)
Perinatal Deaths in PHA 12
Suppose we restrict the population to deaths for children aged
under 5 in 2005 in PHA 12. This population size is N = 970.
We’ll take the outcome variable to be the total number of deaths
due to perinatal originating conditions in the population. For the
purpose of this exercise assume that we know only the numbers
of reported deaths by location (in or outside hospital) and district.
The VA sample for this region contained just one subject aged
under 5 in Krasae Sin district and 22 in Hat Yai district, with
corresponding reported numbers of perinatal deaths 0 and 10.
The estimated total number of perinatal deaths in the population
based on this sample is obtained by multiplying the sample
proportion (10/23) by the population size (970).
The result is 421.7. Using the formula 5 SE = N√{p(1-p)(1-n/N)/n}
for its standard error, a 95% confidence interval is (223.6, 619.8).
______________________________________________________________________________________________________
5 In
contrast to the statement on Slide 1.3, the accuracy of the proportion
p now depends on the population size N through the term 1-n/N. Why?
When is Proportionate Sampling Justified?
We have argued for taking samples of similar size from each
region. But whether this is a good rule or not depends a lot on
what is being estimated.
If you want to estimate means for each region as accurately as
possible, take equal sample sizes in each region. In this case
proportionate sampling will lead to some regions having very
small sample sizes, and inaccurate estimates.
On the other hand, if you want to estimate means or totals for
the whole population as accurately as possible, take larger
sample sizes from larger regions, and proportionate sampling
works well. This is because a population mean is a weighted
sum of region means, with weights proportional to populations in
regions. Errors in an estimated mean for a large region will be
expanded by the weighting, and so need to be smaller to begin
with. In this case PPS is good for estimating an overall
population mean or total.
Sampling Frames
The sampling frame for a survey is a list of all the population
elements. For a survey of Internet use in Thailand it would
comprise all residents aged 5 years or more in 2012 (about 60
million). The Verbal Autopsy study frame was a list of deaths
reported in the VR database in 2005.
For an Internet use survey it is unlikely that a sampling frame
would be available containing addresses of persons to be
sampled. In such situations, particularly for household surveys, a
feasible method involves collecting information from subjects
sampled randomly from regions within larger regions in stages.
One of us (DM) used this method in 1985 to obtain data on health
expenditure from 10,000 households in Pakistan. The sampling
frame was a voluminous list, from its National Statistical Office, of
population sizes of all villages and city blocks in the four major
regions of the nation, which was used to sample up to 20
households from each of 600 selected villages and city blocks.
The four major
regions of Pakistan
are Punjab, Sindh,
Balochistan and
NWFP (the NorthWest Frontier
Province).
In 1985, there were
more than 200,000
villages and city
blocks in the
nation, which had
a population of 80
millions.
Coverage
For a survey of a large and diverse area such as a whole country,
a sample survey should ideally provide appropriate coverage of
all elements in the population. This means that both the sampling
frame and the sample itself should include all such factors that
are associated with the outcomes of interest.
In socio-economic and health surveys these factors may include
ethnicity, religion, gender, age, population density, wealth and
occupation, and where these are known to be strong predictors of
an outcome, stratification can improve coverage.
However, there may be unknown factors that are strongly
associated with an outcome, or factors known to be associated
but for which no data are available. To provide adequate
coverage in these situations, it is also important to ensure that
the sample design covers the geographical area as widely as
possibly within budget limitations.
Stratified Sampling
If the outcome of interest is correlated with a known factor, the
efficiency of a sample survey can be improved by stratifying on
that variable.
This involves subdividing the population into strata according to
different levels of the factor, taking separate random samples
from these strata, and combining them.
For example, if the proportion of perinatal deaths is substantially
higher in hospitals than outside hospitals, efficiency is gained by
stratifying by location of death in or outside hospital.
Similarly, if the outcome of interest is the true number of transport
accident deaths in Thailand (as distinct from the reported number,
which is much too low due to the large percentage reported with
ill-defined cause), and given that reported cause of death is a
strong predictor of the true cause, efficiency would be gained by
stratifying on the reported cause.
Choice of Geographical Units
For many nationwide surveys in Thailand (including the 2005
Verbal Autopsy survey) its 926 districts comprise the geographical
units from which further samples are taken.
These districts vary a lot
in population size. At the
2000 Census, district KoKut in Trat had just 2042
residents, whereas the
Samut Prakan city district
had 435,122.
Super-districts (Lim et al
2007) provide balanced
strata for sample surveys.
Odton et al (2010) defined
235 Thai super-districts,
including 16 in PHA 12.
Clustering
Taking a simple random sample from a population spread over
a region containing many different places (usually villages and
city blocks) can be very costly, particularly if it involves getting
information from residents (Lumley 2010 Figure 3.1).
For example, if this had been done for the survey of health
expenditure in Pakistan in 1985, it would have meant going to
households in nearly 10,000 different places.
To minimize such travel costs, households to be surveyed were
selected from 600 clusters within the sampling frame, which
were stratified by population density and literacy variation
(available from the 1980 National Census). In each of these 600
clusters, a place (village or city block) was randomly selected,
and up to 20 households were then selected from a list
enumerating all n households in the place after consultation with
a local authority. Households were selected by taking every kth
one in the geographically ordered list, where k = ceiling(n/20).
References
Lim A and Choonpradub C: A statistical method for forecasting demographic
time series counts, with application to HIV/AIDS and other infectious
disease mortality in Southern Thailand, South-East Asian J Trop Med
Public Health 2007, 38(6):1029-1040.
Lumley T: Complex Surveys: A Guide to Analysis Using R, Wiley 2010.
Odton P, Bundhamcharoen K and Ueranantasun A: District-level Variations in
the Quality of Mortality Data in Thailand, Asia-Pacific Population Journal
2010, 25(1): 79-90.
Rao C, Porapakkham Y, Pattaraarchachai J, Polprasert W, Swampunyalert
W and Lopez AD: Verifying causes of death in Thailand: rationale and
methods for empirical investigation, Population Health Metrics 2010, 8:11.
Session 2: Monte Carlo Simulations of Sampling Designs
Session 3: Creating Survey Objects using the R Survey Library
Session 4: Discussion of Topics from Sessions 1-3
Sessions 5-6: Further Computer Exercises
Session 7: Discussion of Topics from Sessions 1-6 & Conclusion