Workshop on Complex Sample Survey Design • Introduction

Workshop on Complex Sample Survey Design
Session 3: Using the Survey Library
•
•
Introduction
Describing the Sample to R:
(1) SRS
(2) A Stratified Sample
(3) A Clustered Sample
•
Estimating Transport Accident Deaths
(1) SRS
(2) A Stratified Sample
(3) A Clustered Sample
(4) A Clustered Stratified Sample
•
•
Estimating Stroke Deaths
Summary
Introduction
In Session 2 we used Monte Carlo simulation, starting with a
population with known outcomes, and taking repeated samples
using simple, stratified, and clustered random samples.
In this session, we’ll learn how to use R’s survey library to
create a survey design object for data collected in a survey.
This library uses statistical theory to obtain information about
the population without the need to do any simulations to
compute the standard error.
We will also focus on a particular nationwide survey that was
undertaken in Thailand, the 2005 Verbal Autopsy Study, with
particular reference to determining the true burden of transport
accident and stroke deaths in the various regions of Thailand.
2
Describing the Sample to R: (1) SRS
3
After entering a version of R in which the survey library has been
installed, this library is loaded using the command
library(survey)
After specifying the folder containing your data files, you can read
the srs1.txt data file created in Session 2 using the command
srs1 <- read.table("srs1.txt", h=TRUE)
To describe this data table as a SRS sample, the first step is to
specify the finite population correction (fpc), which takes
account of the reduced error due to the finiteness of the
population. This is achieved simply by inserting a variable in the
table containing the population size, as follows.
srs1$fpc <- 970
To create a design object named srs1.design (say), type
srs1.design <- svydesign(id=~1, fpc=~fpc, data=srs1)
The argument id=~1 says that subjects are being sampled as 4
individuals, and not in stages (such as in cluster sampling).
For basic identifying information for the design, just type its name.
To estimate numbers of outcomes in each cause group (defined
as the variable vrGrp), use the function svytotal() by typing
This estimate of total perinatal deaths agrees largely with that
given by the formula on Slide 2.8. Recall that the number of
perinatal deaths in the population is 358, so the estimate is 15%
too low. R commands are in ex31.Rcm.
............(2) A Stratified Sample
5
These commands (in ex32.Rcm) create a survey object from the
sample strats1.txt stratified by location (see Slide 2.11).
Note that it requires knowing the numbers in each stratum in the
population 1 (482 & 488 for in and outside hospital, respectively).
strats1 <- read.table("strats1.txt", h=TRUE)
strats1$fpc <- ifelse(strats1$ghos==0, 488, 482)
strats1.design <- svydesign(id=~1, strata=~ghos,
fpc=~fpc, data=strats1)
To estimate numbers of outcomes in each cause group, type
svytotal(~vrGrp, design=strats1.design)
and get results as follows.
In contrast to the estimate given
by the SRS, which was 15% too
low, this estimate of the number
of perinatal deaths in the
population is too high by 12%.
______________________________________________________________________________________________________
1 The
R command table(p12$ghos) gives these counts.
......(3) A Clustered Sample
6
These commands (in ex33.Rcm) create a survey object from the
sample clusts1.txt (see Slide 2.17).
Read
sample
clusts1 <- read.table("clusts1.txt", header=TRUE)
data
At each stage you need to tell R how many units (primary,
secondary, etc) were in the population sampled from, so the
function can calculate the probability weights.
In this case fpc1 is 16 (number of super-districts in the population
sampled from) and fpc2 is the number of individuals in each
selected super-district in the population.
FPC <- as.data.frame(table(pop$sDistID))
Create
names(FPC) <- c("sDistID","fpc2")
design
clusts1 <- merge(clusts1, FPC, by="sDistID") variables
clusts1$fpc1 <- length(unique(p12$sDistID))
clusts1$ID <- 1:nrow(clusts1)
Define
design
clusts1.design <- svydesign(id=~sDistID+ID,
fpc=~fpc1+fpc2, data=clusts1) object
Results
To estimate numbers of outcomes in each cause group, type
svytotal(~vrGrp, design=clusts1.design)
The result is
This estimate of total perinatal deaths in the population is
too low by 37%, but the 95% confidence interval contains
the true value (358).
7
Estimating Transport Accident Deaths
8
For a more realistic application, we now consider the problem of
estimating the true number of deaths from transport accidents in
Thailand, using the Vital Registration (VR) database on reported
deaths in 2005. Since cause of death patterns are quite different
for children under 5 and older persons, we’ll focus on subjects
aged 5 or more years.
This population of 384,689 contains 10,761 reported transport
accident deaths), but there are many more due to misreporting.
Using data from the VA study, Nuntaporn Klinjun (2013) 2 fitted a
logistic regression model that gives the total as 21,716, and
created a data table containing the probability that a transport
accident was the true cause for each record.
We’ll use the subset of these data for PHA 12 (N = 24,084) to
illustrate various sample designs, with sample sizes 824, the
same as was used in the Verbal Autopsy study.
__________________________________________________________________________________________________________________-________________
2
http://www.sat.psu.ac.th/mcs/seminar/slides/parnSlides.pdf
9
The VA sample for this region contains subjects aged 5+ (59 in
Krasae Sin district and 765 in Hat Yai district), among whom 43
were assessed to have died from transport accidents. When this
count is scaled up by the reciprocal of the sampling fraction n/N
(824/24084), it gives the estimate 1256.7 for the total number of
transport accident deaths in PHA 12 in 2005, which is quite close
to the value 1248.2 obtained by Klinjun (2013) based on her
analysis of the VA study data. (This value is obtained simply by
summing the estimated probabilities that each subject died due
to a transport accident, for all subjects in the population.)
If the outcomes were independent, the formula for the standard
error of a proportion estimated from a finite population (Slide
2.8), might be used to obtain a 95% confidence interval for this
estimate 3. It is 890.1 to 1623.6.
___________________________________________________________
3
Use R to check this result.
............(1) from a Simple Random Sample
10
Get a SRS (n = 824) from the PHA 12 population, and then use
the survey package to estimate the total (see ex34.Rcm).
p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE)
N <- nrow(p12)
# population size
iseed <- 410701; set.seed(iseed)
n <- 824
# sample size
s12 <- p12[sample(N, size=n),]
# create sample
library(survey)
# load survey library
s12$fpc <- N
s12.design <- svydesign(id=~1, fpc=~fpc, data=s12)
ta12tot <- svytotal(~p.transAcc, design=s12.design)
This estimate is close to the 1256.7
from scaling up the VA sample total.
The 95% confidence interval is much narrower than that given
by the formula for the standard error of a proportion.
............(2) from a Stratified Random Sample
11
Can we improve this estimate by stratifying on known variables?
In her analysis of the VA study Klinjun found the reported cause
to be a strong predictor of transport accident cause of death.
ROC curve
Prob=0.333
To simplify, use just 3 levels of reported ICD-10 code: (1) transport
accident (V), (2) other injury (W or X00-X59), and (3) other cause
of death. We’ll use this variable to stratify the sample to estimate
the true number of transport accident deaths.
The VA survey used simple random sampling from districts,
percentages in these three groups being 5.6, 3.4 and 91.0,
respectively, so we’ll use sample sizes of 40 in the two smaller
strata to match these percentages, leaving 744 in the remaining
stratum.
The following R commands (continued on next slide) create a
stratified random sample and define the corresponding design
object (see ex35.Rcm).
12
p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE)
str(p12)
# list variable names and data types
rc <- p12$ICD10
# reported ICD10-code
ch <- substr(rc, 1, 1)
# its chapter
k <- as.integer(substr(rc, 2, 3)) # and integer
p12$strat <- ifelse(ch=="V", "1:transAcc",
ifelse(ch=="W" | (ch=="X" & k<60), "2:othInj", "3:other"))
n <- c(40, 40, 744)
# sample sizes in strata
p12.1 <- subset(p12, strat=="1:transAcc")
p12.2 <- subset(p12, strat=="2:othInj") # define strata
p12.3 <- subset(p12, strat=="3:other")
# in population
13
iseed <- 410701
set.seed(iseed)
ss12.1 <- p12[sample(nrow(p12.1), size=n[1]),] # sample
ss12.2 <- p12[sample(nrow(p12.2), size=n[2]),] # from
ss12.3 <- p12[sample(nrow(p12.3), size=n[3]),] # strata
ss12 <- rbind(ss12.1, ss12.2, ss12.3) # combine samples
library(survey)
# load survey library
N <- table(p12$strat)
# strata sizes in population
ss12$fpc <- ifelse(ss12$strat=="1:transAcc", N[1],
ifelse(ss12$strat=="2:othInj", N[2], N[3]))
ss12.design <- svydesign(id=~1, fpc=~fpc, strata=~strat,
data=ss12)
ta12tot <- svytotal(~p.transAcc, design=ss12.design)
This estimate is now lower than
the population value (1248.2).
The confidence interval is very much shorter, and still includes the
population value, so the stratified sample has done very well.
............(3) from a Clustered Sample
14
Use a function to create a clustered random sample with up to
round(n/k) (where n = 824) subjects in each of k = 3 randomly
selected super-districts of PHA 12, and then define the design
object for this cluster sample (see ex36.Rcm).
clusters <- function(data, k, n) {
clus <- subset(p12, sDistID %in%
sample(unique(sDistID), size=k))
samp <- subset(clus, ID %in% unlist(tapply(ID, sDistID,
FUN=function(x) x[sample.int(length(x),
size=min(length(x), n))])))
samp$fpc1 <- length(unique(p12$sDistID))
FPC <- as.data.frame(table(p12$sDistID))
names(FPC) <- c("sDistID","fpc2")
samp <- merge(samp, FPC, by="sDistID")
clus.design2 <- svydesign(id=~sDistID+ID,
fpc=~fpc1+fpc2, data=samp)
svyTotal <- svytotal(~p.transAcc, design=clus.design2)
c(total=coef(svyTotal), se=SE(svyTotal))
}
Results
15
The estimate is 24% higher than the true population value
(1248.2), but still within a 95% confidence interval.
The standard error is 33% higher than the value 142.82 from
simple random sampling with the same total sample size.
To simulate this sampling process 500 times, use the commands
nit <- 500; k <- 4; n <- 200; iseed <- 410701
set.seed(iseed)
nDeaths <- replicate(nit, clusters(p12, k, n))
apply(nDeaths, 1, FUN=mean)
The result shows that the mean is quite close to the true mean,
but the clustering has increased the standard error from the SRS.
............(4) from a Clustered Stratified Sample
16
Given that precision is reduced by only 25% by clustering into 3
of the 16 super-districts, and stratifying on three cause of death
groups reduces the standard error from 142.8 (for the SRS) to
63.1, it seems desirable to combine these two methods. The
following program (ex37.Rcm) does this.
p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE)
sD <- read.table("superDistricts.txt", header=TRUE)
p12 <- merge(p12, sD, by="distID")
rc <- p12$ICD10
ch <- substr(rc, 1, 1); k <- as.integer(substr(rc, 2, 3))
p12$strat <- ifelse(ch=="V","1:transAcc",
ifelse(ch=="W" | (ch=="X" & k<60), "2:othInj", "3:other"))
unique.sdists <- unique(pha12$sDistID)
iseed <- 410701
set.seed(iseed)
sampled.IDs <- sort(sample(unique.sdists, 3))
# randomly select 3 super-districts
stage1 <- subset(pha12, sDistID %in% sampled.IDs)
stage1$fpc1 <- length(unique.sdists)
# fpc for first stage 17
n <- c(40,40,744)
# total sample sizes in strata
n.each <- round(n/3)
# allocated to each cluster
available <- table(stage1$strat, stage1$sDistID)
stage1$stage1rows <- 1:nrow(stage1)
# now take samples from all cluster.stratum combinations
i11 <- sample(subset(stage1, sDistID==sampled.IDs[1] &
strat=="1:transAcc")$stage1rows,n.each[1])
i12 <- sample(subset(stage1, sDistID==sampled.IDs[1] &
strat=="2:othInj")$stage1rows,n.each[2])
i13 <- sample(subset(stage1, sDistID==sampled.IDs[1] &
strat=="3:other")$stage1rows,n.each[3])
i21 <- sample(subset(stage1, sDistID==sampled.IDs[2] &
strat=="1:transAcc")$stage1rows,n.each[1])
i22 <- sample(subset(stage1, sDistID==sampled.IDs[2] &
strat=="2:othInj")$stage1rows,n.each[2])
i23 <- sample(subset(stage1, sDistID==sampled.IDs[2] &
strat=="3:other")$stage1rows,n.each[3])
i31 <- sample(subset(stage1, sDistID==sampled.IDs[3] &
strat=="1:transAcc")$stage1rows,n.each[1])
i32 <- sample(subset(stage1, sDistID==sampled.IDs[3] &
strat=="2:othInj")$stage1rows,n.each[2])
i33 <- sample(subset(stage1, sDistID==sampled.IDs[3] &
strat=="3:other")$stage1rows,n.each[3])
# and combine them to get the sample at the second stage 18
stage2.rows <- c(i11,i12,i13,i21,i22,i23,i31,i32,i33)
stage2.data <- stage1[stage2.rows,]
# Define the stage 2 design variables
stage2.data$no.strata <- 1
stage2.data$fpc2 <- rep(as.vector(available), rep(n.each,3))
The clustered-stratified sample is now ready to be processed by
the survey library. Use the following commands to do this.
library(survey)
design <- svydesign(id=~sDistID+stage1rows,
strata=~no.strata+strat, fpc=~fpc1+fpc2,
data=stage2.data)
svytotal(~p.transAcc, design)
But the standard error is larger than for the simple cluster
design, so stratifying seems to give no benefit at all! Why?
We’ll discuss this later. For now, let’s look at another example.
Estimating Stroke Deaths
19
Stroke (ICD-10 code I60-69) is the main cause of death for Thai
people. The doughnut plots below compare age-adjusted deaths
by gender and 21 cause groups (defined in Nattakit Pipatjaturon’s
analysis 3 of the VA data) for Thailand in 2005 & Japan in 2006.
Areas of inner circles denote numbers of deaths in the region
with lower mortality, and outer rings denote corresponding deaths
in the region with higher mortality (Thailand in red, Japan blue).
Suicide & stomach cancer are lower in Thailand, but most other
causes (transport accident, HIV & endocrine disease) are higher.
__________________________________________________________________________________________________________________-________________
3
http://www.sat.psu.ac.th/mcs/seminar/slides/nattakitSlides.pdf
20
Pipatjaturon’s analysis involved fitting a logistic regression
model to the risk of stroke death P, expressed as
log(P/(1-P)) = const + factor(province) + factor(gender-age group)
+ factor(VR cause-location group)
Results show higher
… and if the reported cause is stroke the
risk of stroke death
in the Bangkok area. true cause is very likely to be stroke.
We’ll use data from Bangkok (PHA 13) to illustrate estimation
of stroke deaths. This region reported N = 28,850 deaths aged
5+ in 2005. The VA sample for Bangkok had n = 857 aged 5+
(391, 216 & 250 in districts Nongkam, Ratavee and Kannayaw,
1043: 39/250
1037: 37/216
respectively) among whom
137 (61, 37 & 39) were
assessed as stroke deaths.
Sample proportions are shown. 1023:
21
61/391
Using the formula on Slide
2.18, the estimate for the total
number of stroke deaths in
PHA 13 in 2005 is 4627.8.
This number is 1.2% lower
than the value 4686.2
estimated by Pipatjaturon.
Using the survey package we’ll start with a simple random
sample of size 857 from the population for PHA 13 (ex38.Rcm).
Results
This estimate is lower than that obtained by scaling up the
sample total to the population total.
The 95% confidence interval has a margin for error around 400
on each side of the estimate. Now let’s see if stratifying can
improve its accuracy. We’ll run the R commands in ex39.Rcm.
22
Just two strata of size
50 (stroke), 807 (other
reported cause)
Results
The estimated total is closer
to the population total (4686.2)
…and the confidence interval is halved.
23
Summary
24
In this session we learnt how to create survey design objects
using the functions svydesign & svytotal in R’s survey library.
This procedure requires additional variables to be created in the
data table comprising the sample, including an ID field, one or
more fields (depending on whether the design is single stage or
multi-stage comprising finite population correction (fpc)
factors, and sampling unit definitions.
Using three different outcome variables (one binary and the
others measured), we applied this method to estimate interval
estimates of total counts for numbers of deaths from various
causes (reported perinatal and VA-assessed transport accident
and stroke) in two PHAs in Thailand.
For the two examples involving VA-assessed outcomes, we
found that stratification by reported cause improved the accuracy
substantially, but there did not appear to be any benefit in using
stratification for a cluster design with the transport accidents.
Computer Exercises
25
1. Use Notepad to edit the R command file ex31.Rcm in your
directory d:/IThai so that iseed is your personal ID.
Open the R command window and copy & paste the contents
of ex31.Rcm into this window.
2. Repeat this exercise for ex32.Rcm and write your results for
vrGrp1:peri here: total_______ SE________
3. Repeat the above exercise for ex33.Rcm and write your
results here: total_______ SE________
4. Repeat the above exercise for ex34.Rcm and write your
results for ta12Tot here: total_______ SE________
5. Repeat the above exercise for ex35.Rcm and write your
results for ta12Tot here: total_______ SE________
26
6. Repeat the exercise for ex36.Rcm & write your results below:
total.p.transAcc________
SE.p.transAcc________
7. Repeat the exercise for ex37.Rcm and write your results
here: total_______ SE________
8. Repeat the exercise for ex38.Rcm and write your results for
p.stroke here: total_______ SE________
9. Repeat the above exercise after randomly shuffling outcomes
and write your results for p.stroke here:
total_______ SE________