Estimating Sample Size Computer Laboratories Epidemiology and Biostatistics Department

Estimating Sample Size
Computer Laboratories
Epidemiology and Biostatistics Department
Faculty of Medicine Universitas Padjadjaran
2013
Reference
• Dahlan, MS. Besar Sampel dan Cara
Pengambilan Sampel dalam Penelitian
Kedokteran dan Kesehatan. Edisi 3. Jakarta:
Salemba Medika; 2008
• Hulley, SB et al. Designing Clinical Research. 3rd
ed. Philadelphia: Lippincott Williams & Wilkins;
2007
Introduction
Whom? what? Design?
↓
How many subjects to sample?
Introduction
• If the sample size is too small  fail to
answer its research question
• If the sample size is too large  more
difficult and costly than necessary
Introduction
• Goal  to estimate an appropriate number
of subjects for a given study design
• Should be estimated early in the design
phase, when major changes are still
possible
– Research design is not feasible
– Different predictor or outcome variables are
needed
Reasons for sampling
• Unable to perform total sampling
• Results from representative sample
(appropriate number of subjects and
sampling technique) can be generalized to
population
• More efficient and ethical
Generalization
Study subjects
Internal validity
Intended
sample
External validity I
Accessible
population
External validity II
Study/Target
population
Internal validity
• Representative actual sample/study
subjects from intended sample
– same characteristics with intended sample
– problems: non-response, drop-out, loss to
follow-up
External validity I
• Representative intended sample from
accessible population
– Appropriate sample size
– Probabilistic sampling method
External validity II
• Representative accessible population from
target/study population
How to get appropriate
sample size?
• Appropriate sample size formula
– Can be decided from our research questions/
research problems/problem identification
• Correct sample size calculation
Type of research:
Specific design
• Diagnostic
– Sensitivity, specificity, PPV, NPV, LR (+), LR (-)
• Prognostic
– Example: What are the prognostic factors of
shock in DHF patients?
• Survival analysis
– Example: Is there a mortality rate difference
between HIV-patient treated with HAART starting
at CD4 count 200 and  200 ?
Type of research:
Non-specific design
• Descriptive
– To estimate population proportion
• What is the prevalence of diarrhea in Kecamatan X?
– To estimate population mean
• What is the mean of FBG level among adults in
Kecamatan X?
• Analytic
– To find relationship/association between
dependent and independent variable
– To find a (proportion, mean) difference between
two or more groups
– To find correlation between variables
Notes
• In one study, it is possible to use more than one
sample size formula, due to:
– More than one research questions
– Different study design
• Cohort and nested-case control
Notes
• Stated in advance the primary and
secondary research questions/hypotheses
• The sample size calculations are always
focused on the primary research
question/hypothesis
Power of the study
(1 – β)
• Results may be different
• Need to be calculated again due to:
– Actual sample/study subjects  intended
sample
–  in correlation study is different
– Effect size (p1 – p2, x1 – x2) is different
– Sample size is predetermined
Z and Z *
Value of  or 
Z
Descriptive or
Two-sided
One-sided
Z
1%
2.81
2.57
2.57
5%
1.96
1.64
1.64
10%
1.64
1.44
1.44
15%
1.44
1.28
1.28
20%
1.28
0.84
0.84
*From Dahlan, MS, 2008
For two-tailed hypothesis  Z1 – /2
For one-tailed hypothesis  Z1 – 
Strategies for minimizing sample size
and maximizing power
• Use continuous variable (for outcome variable)
– Permits smaller sample size for a given power
– Permits greater power for a given sample size
Strategies for minimizing sample size
and maximizing power
• Use paired measurements or matching
– By comparing each subject with herself, it
removes the baseline between-subjects part
of the variability of the outcome variable
– Example:
• Change in weight on a diet has less variability than
the final weight
• Final weight is highly correlated with initial weight
Strategies for minimizing sample size
and maximizing power
• Increase the precision
– Standardizing the measurement methods
– Training and certifying observer
– Refining the instrument
– Automating the instrument
– Repeating the measurement
Strategies for minimizing sample size
and maximizing power
• Use unequal group sizes
– In general, the gain in power when the size of
one group is increase to twice the size of the
other is considerable
– Tripling or quadrupling one of the groups
provide progressively smaller gains.
– Example:
In a case control study  1 case : 2 controls
Strategies for minimizing sample size
and maximizing power
• Use more common outcome (with caution!)
– More frequent outcome
– Enroll subjects at greater risk of developing
that outcome
– Extend the follow-up period
– Loosen the definition of what constitutes an
outcome
Common Errors to Avoid
• Estimating sample size late during the design
of the study  most common
• Percentage or rate misinterpreted as numeric
• No planning for dropouts or subjects with
missing data
• Equal vs unequal sample sizes
• Two-sided alternative hypothesis or statistical
analysis (Z1 - /2), but we use one-sided (Z1 - )
during sample size determination
Literature vs Judgement*
Judgement
Literature
or pilot
study
Variable
Descriptive
Categorical
Probability of type I error = 
Precision = d
Probability of type I error =  (one/two-sided)
Probability of type II error = 
p1 – p2
Numerical
Probability of type I error = 
Precision = d
Probability of type I error =  (one/two-sided)
Probability of type II error = 
x1 – x2
Categorical
Proportion
Proportion in control/non-exposed/standard group = P2
Numerical
Standard deviation
Combined standard deviation = S
Correlation coefficient = r
*From Dahlan, MS, 2008
Analytic
Case I
• Students have a variety of reasons for doing
research while in medical school. As part of
the Jatinangor program you are interesting in
reproductive health. The aim of your study is
to know the prevalence of puberty (defined by
menarche or wet dreams) among primary
school children in Kecamatan Jatinangor.
There is no previous study on prevalence of
puberty in that community.
Answer
a. The most appropriate study design: crosssectional study
Outcome variable : prevalence of puberty
(history of menarche or wet dreams  YesNo, nominal)
Predictor variable : b. The most appropriate statistical analysis for
the study: Descriptive statistics
Answer
c. The target population: All Primary school in
Kecamatan Jatinangor
The accessible population: Primary school in
Kecamatan Jatinangor
Study unit of the study: Student age of 7 – 12
years old
d. The appropriate sampling technique for the
study: Stratified random sampling, cluster
sampling
Answer
e. Using 95% confidence interval ( =.05) and with
precision of the study 10 % (within 10% of the true
value), the sample size needed and the appropriate
sampling technique are :
• For α= 0.05 then Z0.975 = 1.96
make sure npq ≥ 5  97(0,5)(0,5) = 24.25 ≥ 5
• The researcher will need at least 97 student age of 7
– 12 years old
Case II
• Suppose we wishes to know the random
blood glucose level (mg/dl) among medical
students in Faculty of Medicine X
Answer
a. The most appropriate study design: Crosssectional study
Outcome variable : random blood glucose
level (numeric)
Predictor variable : b. The most appropriate statistical analysis for
the study: Descriptive statistics
Answer
c. the target population: All medical students in
Faculty of Medicine X
the accessible population: All medical
students in Faculty of Medicine X
the study unit of the study: Medical student
d. The appropriate sampling technique for the
study: Simple random sampling, stratified
random sampling
Answer
The aspects that can be
determined by the researcher
from the beginning
The aspects that must be
searched by the researcher
from literature or a pilot study
• d (precision)
• s (standard deviation)
f. Based on a pilot study, ten students were selected,
and the following were the result of their random
blood glucose level. Using α= 0.05 and a precision
of 2.5 mg/dl, the estimation of sample size needed
for the study are:
Answer
• For α = 0.05 then Z0.975 = 1.96 ; d = 2.5 mg/dl ; s =
13.47 mg/dl
• The researcher will need at least 112 medical
students
Case III
• One of the batch 2010 medical student
prepare to conduct a study (for his minor
thesis) on risk factors of diarrhea. Let’s say
that the hypothesis was exclusive breastfed
babies (first six months of life) will be less
dehydrated (mild to moderate vs severe)
during diarrhea in their age 7 to 11 months.
The researcher wishes to conduct the study in
Hasan Sadikin Hospital Bandung period of
January – December 2011.
Answer
a. The most appropriate study design? Case-control,
cross-sectional study
Outcome variable : dehydration during diarrhea (mild to
moderate or severe, nominal)
Predictor variable : history of exclusive breastfeeding (yes
or no, nominal)
b. The most appropriate statistical analysis for the
study: Chi-square test (assuming there are no
confounding variables)
Answer
c. The target population: Baby age of 7 to 11 months
diagnosed with diarrhea treated in Pediatric
Emergency Unit, Hasan Sadikin Hospital, Bandung,
period of January – December 2011
The accessible population: Baby age of 7 to 11
months diagnosed with diarrhea treated in Pediatric
Emergency Unit, Hasan Sadikin Hospital, Bandung,
period of January – December 2011
The study unit of the study: Medical record
d. The appropriate sampling technique for the study:
Simple random sampling
Answer
The aspects that can be
determined by the researcher
from the beginning
The aspects that must be
searched by the researcher
from literature or a pilot study
• α
• β,
• p1 – p2
• p2 (depends on the study
design)
Answer
•
•
Using α = 0.05, β= 0.2, and difference of proportion considered by the
researcher to be clinically significant = 0.2, the estimation of sample size
needed for the study are
For α = 0.05 then Z0.95 = 1.64 (one-sided) and β = 0.2 then Z0.8 = 0.84 ; p1 –
p2= 0.2
p2 = 18/35 = 0.51 (cross-sectional)
p1 = 0.2 + p2 = 0.2 + 0.51 = 0.71
q1 = 1 – p1 = 1 – 0.71 = 0.29
q2 = 1 – p2 = 1 – 0.51 = 0.49
p = (p1+p2)/2 = (0.71 + 0.51)/2 = 0.61
q = 1 – p = 1 – 0.61 = 0.39
p2 = 17/32 = 0.53 (case control)
p1 = 0.2 + p2 = 0.2 + 0.53 = 0.73
q1 = 1 – p1 = 1 – 0.73 = 0.27
q2 = 1 – p2 = 1 – 0.53 = 0.47
p = (p1+p2)/2 = (0.73 + 0.53)/2 = 0.63
q = 1 – p = 1 – 0.61 = 0.37
Answer
Cross sectional study
The researcher will need at least 73 exclusive breastfed babies and 73
non-exclusive breastfed babies diagnosed with diarrhea
Answer
Case control study
•
•
For case group, the researcher will need at least 71 babies diagnosed with
diarrhea plus severe dehydration
For control group, the researcher will need at least 71 babies diagnosed
with diarrhea plus mild to moderate dehydration
Case IV
• The researcher wishes to compare fasting
blood glucose level (mg/dl) between medical
students of Faculty of Medicine X with and
without family history of DM type II. The
subjects were matched according to age and
sex.
Answer
a. The most appropriate study design: cross-sectional
study
Outcome variable : fasting blood glucose level (numeric)
Predictor variable : -
b. The most appropriate statistical analysis for the
study: Paired t-test with Wilcoxon signed-rank test
as an alternative
Answer
c. The target population: All medical students in
Faculty of Medicine X
The accessible population: All medical students in
Faculty of Medicine X
The study unit of the study: Medical student
d. The appropriate sampling technique for the study?
Matching technique
Answer
The aspects that can be
determined by the researcher
from the beginning
The aspects that must be
searched by the researcher
from literature or a pilot study
• α
• β
• x1 – x2
• S (combined standard
deviation from two
observations)
Answer
Based on a pilot study, six-paired students with
family history of DM type II and without family
history of DM type II were selected
α = 0.05, β = 0.2, and difference of mean considered by the researcher to
be clinically significant = 2.5 mg/dl, the estimation of sample size needed
for the study are
Answer
• For α = 0.05 then Z0.975 = 1.96 (two-sided) and β = 0.2 then
Z0.8 = 0.84
• x1 – x2 = 2.5 ; s1 = 4.88 mg/dl, n1 = 6 ; s2 = 3.74 mg/dl, n2 = 6
The researcher will need at least 24 of medical students with
family history of DM type II and 24 medical students without
family history of DM type II (matched according to age and sex)
Case V
• The investigator wants to conduct a cross-sectional
study to know whether DM will give negative effect
on the treatment outcome of TB. Data will be
collected from hospital. The register showed that
there are 50 people meet the criteria of inclusion in
this study. From previous study, after 6 months of
therapy, 9.6% of cultured sputum specimens from
non-diabetic patients were still positive for
Mycobacterium tuberculosis (RR = 2.65).
Answer
a. Outcome variable : response for treatment
(Yes-No, nominal)
Predictor variable : random blood glucose
level (numeric)
b. The most appropriate statistical analysis for
the study: Chi-square test
Answer
c. The target population: All TB patients with DM in
Hospital X
The accessible population: Adult TB age of 20 to 65
years old diagnosed with DM treated in in Hospital
X
The study unit of the study: Medical record
d. The appropriate sampling technique for the study?
Simple random sampling
• The power of the study in the number of
samples taken from a total sampling? (Using 
= .05) : looking the formula and put the
sample size
Case VI
• Let’s say the researcher has a hypothesis that
serum 25(OH)-vitamin D levels (ng/ml) is
positively correlated with bone mineral
density, estimated using the quantitative
ultrasound
index
(QUI),
among
postmenopausal women in Kecamatan
Jatinangor
Answer
a. The most appropriate study design: Case-control,
cross-sectional study
Serum 25(OH)-vitamin D levels (numeric)
Quantitative ultrasound index (numeric)
b. What is the most appropriate statistical analysis for
the study? Correlation methods (Pearson or
Spearman’s rho coefficient correlation)
Answer
c. The target population: Postmenopausal
women in Kecamatan Jatinangor
The accessible population: Women who
come to Posbindu Lansia in all villages
The study unit of the study: Postmenopausal
woman
d. The appropriate sampling technique for the
study: Consecutive sampling
Answer
The aspects that can be
determined by the researcher
from the beginning
The aspects that must be
searched by the researcher
from literature or a pilot study
• α
• β
• r (Pearson’s correlation
coefficient)
Based on pilot study, with 10 participants
For α = 0.05 then Z0.975 = 1.64
(one-sided) and β = 0.2 then Z0.8 =
0.84
r = 0.78 (using SPSS or Excel)
Answer
• The researcher will need at least 9 postmenopausal
women
Review
•
Study Design
– Non-specific or specific?
– Observational (cross-sectional, case-control, cohort) or experimental?
•
Variables
– Predictor/dependent and outcome/independent
– Scale of measurement
•
•
•
•
Categorical (nominal or ordinal)
Numerical
Paired vs unpaired observation
Hypothesis
– Type I and type II error (α, β)
– Power of the study (1 – β)
– One or two-sided alternative hypothesis
•
•
Statistical analysis
Sampling technique
– Probabilistic sampling technique
– Non-probabilistic sampling technique