Statistical Thinking and Smart Experimental Design Luc Wouters November 2014

Statistical Thinking and
Smart Experimental Design
Luc Wouters
November 2014
Statistical Thinking and
Smart Experimental Design
Luc Wouters
November 2014
Statistical Thinking and Reasoning in
Experimental Research
9h00 – 9h30:
11h00 – 11h15:
12h30 -13h15:
15h00 – 15h15:
16h45 – 17h15:
3
Welcome - Coffee
Comments & Question - Coffee
break
Lunch
Comments & Questions – Coffee
Break
Summary – Concluding Remarks –
Comments & Questions
Statistical Thinking and Reasoning in
Experimental Research
¾
¾
¾
¾
¾
¾
¾
¾
¾
¾
4
Introduction
Smart Research Design by Statistical Thinking
Planning the Experiment
Principles of Experimental Design
Common Designs in Biosciences
The Required Number of Replicates - Sample Size
The Statistical Analysis
y
The Study Protocol
Interpretation & Reporting
Concluding Remarks
Statistical Thinking and
Smart Experimental Design
I. Introduction
5
There is a problem with research in the
biosciences - Replicability
6
¾
Scholl, et al. (Cell, 2009):
“Synthetic lethal interaction between oncogenic KRAS
dependency
p
y and STK33 suppression
pp
in human cancer cells.”
¾
Conclusion: cancer tumors could be destroyed by targeting
STK33 protein
¾
Amgen Inc.:
24 researchers, 6 months labwork
¾
Unable to replicate
p
results of Scholl,, et al.
There is a problem with research in the
bi
biosciences
i
- Acceptance
A
t
7
There is a problem with research in the
bi
biosciences
i
- Acceptance
A
t
¾ Séralini, et al. (Food Chem Toxicol, 2012):
“Long term toxicity of a roundup herbicide and a rounduptolerant genetically modified maize”
¾ Conclusion: GM maize and low concentrations of
Roundup herbicide causes toxic effects (tumors) in rats
¾ Press conference
e.g. Natural News: Shock findings in new GMO study: Rats fed lifetime
of GM corn grow horrifying tumors, 70% of females die early
¾ Severe impact on general public and on interest of industry
y Referendum labeling of GM food in California,
y Bans on importation of GMOs in Russia and Kenya
8
There is a problem with research in the
bi
biosciences
i
- Acceptance
A
t
The critiques:
¾ Heavily criticized (VIB)
¾ European
p
Food Safety
y Authority
y ((2012)) review:
• inadequate design, analysis and reporting
• wrong strain of rats
• number of animals too small
¾ Journal had paper withdrawn late 2013
¾ Republished 2014 in journal with less impact (Environmental
Sciences Europe)
9
There is a problem with research in the
bi
biosciences
i
- Efficiency
Effi i
10
The problem of doing “good science”
S
Some
critical
iti l reviews
i
Ioannidis, 2005
Kilkenny, 2009
Loskalzo, 2012
11
Things get even worse
R t ti rates
Retraction
t
¾ 21% of retractions were
due to error
¾ 67% misconduct,
including fraud or
suspected
t d fraud
f d
12
The integrity of our profession is put into
question
ti
13
The integrity of our profession is put into
question
ti
“The way we do our research (with our animals) is
stone age”
Ulrich Dirnagl,
Dirnagl
Charité University Medicine, Berlin
Science 2013
14
What’s wrong with research in the
bi
biosciences?
i
?
15
Co rse objectives
Course
objecti es
¾ Increase quality of research reports by:
¾ Statistical Thinking:
t provide
to
id elements
l
t off a generic
i methodology
th d l
tto
design insightful experiments
¾ Statistical Reasoning:
to provide guidance for interpreting and reporting
results
16
Thi course prepares you ffor a dialogue
This
di l
17
Statistical thinking leads to highly
productive
d ti research
h
¾ Set of statistical precepts
(rules) accepted by his
scientists
¾ Kept research to proceed in
orderly and planned fashion
¾ Open mind for the
unexpected
t d
¾ World record: 77 approved
medicines over a p
period of
40 years
¾ 200 research papers/year
18
Statistical Thinking and
Smart Experimental Design
II. Smart Research Design
g by
y Statistical Thinking
g
19
The architecture of experimental research
The two types of studies
20
The architecture of experimental research
R
Research
h is
i a phased
h
d process
1. Definition
proposal
2. Design
protocol
3. Data Collection
data set
4. Analysis
conclusions
5. Reporting
report
Phase
21
Deliverable
The architecture of experimental research
R
Research
h is
i an iterative
it ti process
22
Modulating between
th concrete
the
t and
d th
the abstract
b t t
23
Research st
styles
les ma
may vary…
ar
The ‘novelist’
D
D
C
A
R
The ‘data salvager’
D
D
C
A
R
The ‘lab freak’
D
24
D
C
A
R
The ‘smart researcher’
D
D
C
A
R
• Definition
• Design
g
• Data Collection
• Analysis
• Reporting
The smart researcher
¾ time
ti
spentt planning
l
i and
dd
designing
i i an experiment
i
t att
the outset saves time and money in the long run
¾ optimizes the lab work and reduces the time spent in
the lab
¾ the design of the experiment is most important and
governs how
h
d
data
t are analyzed
l
d
¾ minimizes time spent at the data analysis stage
¾ can look
l k ahead
h d tto th
the reporting
ti phase
h
with
ith peace off
mind since early phases of study are carefully
planned and formalized (proposal, protocol)
25
The design of insightful experiments is a
process informed
i f
d by
b statistical
t ti ti l thi
thinking
ki
26
The Seven Principles of Statistical
Thi ki
Thinking
1 Time spent thinking on the conceptualization and design of an
1.
experiment is time wisely spent
2. The design of an experiment reflects the contributions from different
sources of variability
3. The design of an experiment balances between its internal validity and
external validity
4. Good experimental practice provides the clue to bias minimization
5. Good experimental design is the clue to the control of variability
6 Experimental design integrates various disciplines
6.
7. A priori consideration of statistical power is an indispensable pillar of
an effective experiment
27
Statistical Thinking and
Smart Experimental Design
III. Planning
g the Experiment
p
28
The planning process
29
The planning process
30
The planning process
31
Limit number
n mber of research objecti
objectives
es
Number of research objectives should be limited
(≤ 3 objectives)
Example Séralini study:
¾ 10 treatment groups, females and males = 20
groups
¾ Only
O l 10 animals/group
i l /
¾ 50 animals/group “golden standard” in toxicology
32
Importance of a
auxiliary
iliar h
hypotheses
potheses
Reliance on auxiliary hypotheses is the rule rather
than the exception in testing scientific hypotheses
(Hempel 1966)
(Hempel,
Example Séralini study:
¾ Use of Sprague-Dawley breed of rats as model for
“human” cancer
¾ Known to have higher mortality in 2 year studies
¾ Prone to develop spontaneous tumors
33
T pes of e
Types
experiments
periments
34
Ab se of exploratory
Abuse
e plorator e
experiments
periments
35
¾
Most published in biomedical research exploratory
¾
Too often published as if confirmatory experiments
¾
Do not unequivocally
q
y answer research q
question
¾
Use of same data that generated research hypotheses to
prove these hypotheses involves circular reasoning
¾
I f
Inferential
ti l statistics
t ti ti should
h ld be
b interpreted
i t
t d and
d published
bli h d
as “for exploratory purposes only”
¾
Séralini studyy was actuallyy conceived as exploratory
p
y
¾
See also Nature 506: 150-152, 13 February 2014
bit.ly/1tJWs4a
Role of the pilot e
experiment
periment
36
Types of experiments
objective
bj ti - usage
37
Statistical Thinking and
Smart Experimental Design
IV. Principles
p
of Statistical Design
g
38
Some terminolog
terminology
39
¾
Factor:
the condition that we manipulate in the experiment (e.g. concentration
of a drug, temperature)
¾
Factor level:
the value of that condition (e.g. 10-5M, 40
40°C)
C)
¾
Treatment:
combination of factor levels (e.g. 10-5M and 40°C)
¾
Response
p
or dependent
p
variable:
characteristic that is measured
¾
Experimental unit:
the smallest physical entity to which a treatment is independently
applied
¾
Observational unit:
physical entity on which the response is measured
¾
See Appendix A
A, Glossary of Statistical Terms
The str
structure
ct re of the response
Assumptions
40
•
Additi model
Additive
d l
•
Treatment effects constant
•
Treatment effects in one unit do
not affect other units
Defining the e
experimental
perimental unit
nit
41
The choice of the e
experimental
perimental unit
nit
y Temme et al. ((2001))
y Comparison of two genetic strains of mice,
wild-type and connexin32-defficient
y Diameters of bile caniliculi in livers
y Statistically significant difference between
Means ± SEM from 3 livers
two strains (P < 0.005
0 005 = 1/200)
y Experimental unit = Cell
y Is this correct?
42
The choice of the e
experimental
perimental unit
nit
43
¾ Experimental unit = animal
¾ Results not conclusive at all
¾ Wrong choice made out of ignorance
ignorance, or out of
convenience?
¾ This was published in peer reviewed journal !!!
The choice of the experimental unit
Pl t R
Plant
Research
h
44
The choice of the experimental unit
Ri
Rivenson
study
t d on N-nitrosamines
N it
i
y Rats housed 3 / cage
y Treatment supplied in drinking water
y Impossible to treat any 2 rats differently
y Rats within a cage not independent
y Rat not experimental unit
y Experimental unit = cage
y Same remarks for Séralini study
y Rats housed 2/cage
y Rats in same cage are not independent of one another
y Number of experimental units is not 10/treatment, but 5/treatment
45
The choice of the experimental unit
Th reverse case
The
y Skin irritation test in volunteers
y 2 compounds
y Backs of volunteers divided in 8 test areas
y Each compound randomly assigned to 4 different test areas
y Experimental unit = Test Area, not volunteer
46
The choice of the experimental unit
Summary
Choice of experimental unit of particular concern when:
¾microscopy studies (the cell is not treated)
¾animals
¾
i l are nott iindividually
di id ll caged
d and/or
d/
independence in question
¾plant research when independence is not
guaranteed. Then the plot, tray, or entire growth
chamber (environmental factor) is experimental
unit (Fernandez 2007).
47
Why engineers don’t care very much
about
b t statistics
t ti ti
48
The basic dilemma:
b l
balancing
i iinternal
t
l and
d external
t
l validity
lidit
49
Failure to minimize bias ruins an
experiment
i
t
F il
Failure
to
t minimize
i i i bias
bi results
lt in
i llostt experiments
i
t
Failure to control variability can sometimes be remediated after the facts
Allocation to treatment is most important source of bias
50
Req irements for a good e
Requirements
experiment
periment
51
Basic strategies for
maximizing
i i i Si
Signal/Noise
l/N i ratio
ti
52
A control
t l is
i a standard
t d d ttreatment
t
t condition
diti against
i t which
hi h
all others may be compared
53
¾ Single blinding – the condition
¾ Neutralizes investigator bias
under which investigators are
uninformed as to the treatment
condition of experimental units
¾ Double blinding – the condition
under which both experimenter
and
d observer
b
are uninformed
i f
d as
to the treatment condition of
experimental
p
units
54
¾ Neutralizes investigator bias
and bias in the observer’s
scoring system
In toxicological histopathology, both
blinded and unblinded evaluation are
recommended
¾ Group blinding
¾ Entire treatment groups are
coded: A, B, C
¾ Individual blinding
¾ Each experimental unit is coded
individually
¾ Codes are kept in randomization
list
¾ Involves independent person that
maintains list and prepares
treatments
55
¾ Describes
D
ib practical
ti l actions
ti
¾ Guidelines for lab technicians
¾ Manipulation of experimental
units (animals, etc.)
¾ Materials used
¾ Logistics
¾ Definitions of measurements
and scoring methods
¾ Data collection & processing
¾ Personal responsibilities
56
¾ A detailed protocol is
imperative to minimize
potential bias
Calibration is an operation
p
that compares
p
the output
p of a measurement
device to standards of known value, leading to correction of the values
indicated by the measurement device
Neutralizes bias in the
investigator’s measurement
system
57
¾ Randomization ensures that the effect of uncontrolled sources
of variability has equal probability in all treatment groups
¾ Randomization provides a rigorous method to assess the
uncertainty (standard error) in treatment comparisons
¾ Formal randomization involves a chance dependent
mechanism, e.g. flip of a coin, by computer (Appendix B)
¾ Formal randomization is not haphazard allocation
¾ Moment of randomization such that the randomization covers
all substantial sources of variation,,
i.e. immediately before treatment administration
¾ Additional randomization may be required at different stages
58
Randomized versus Systematic Allocation
¾ First unit treatment A,, second treatment B, etc. yyielding
g sequence
q
AB,, AB,, AB
¾ Other sequences as AB BA, BA AB, etc. possible
.
59
•
any systematic arrangement might coincide with a specific
pattern in the variability
•
biased estimate of treatment effect
•
randomization
d i ti is
i a necessary condition
diti ffor a statistical
t ti ti l
analysis, without randomization a statistical analysis is not
justified
Improvements of Randomization Schemes
.
¾ Balancing
B l
i means causes iimbalance
b l
iin variability
i bilit
¾ Invalidates subsequent statistical analysis
60
Haphazard Allocation versus Formal Randomization
•
•
•
•
Effect of 2 diets on body weight of rats
12 animals arrive in 1 cage
g
technician takes animals out of cage
first 6 animals Îdiet A, remaining 6 Îdiet B
Isn’t this
Random ?
•
heavy animals react slower and are easier to catch
than smaller animals
•
first 6 animals weigh more than remaining 6
¾ Haphazard treatment allocation is NOT the same as formal randomization
¾ Haphazard allocation is subjective and can introduce bias
¾ Proper randomization requires a physical “randomizing
randomizing device”
device
61
Moment of Randomization
•
Rat brain cells harvested and seeded on
Petri-dishes (1 animal = 1 dish)
•
Dishes randomly divided in two groups
•
Two sets of Petri dishes placed in incubator
•
After 72 hours group 1 treated with drug A
and group 2 with drug B
¾ Effect of treatment confounded with systematic errors due to incubation
¾ Randomization should be done as late as possible just before treatment application
¾ Randomization sequence should be maintained throughout the experiment
62
Basic strategies for
maximizing
i i i Si
Signal/Noise
l/N i ratio
ti
63
¾ Simple random sample
a selection of units from a
defined target population such
that each unit has an equal
chance of being selected
A random sample is a stringent
requirement that is very difficult to
fulfill in biomedical research
64
¾ Neutralizes sampling bias
¾ Provides the foundation for the
population model of inference
¾ Particularly important in genetic
research (Nahon & Shoemaker)
Switch to randomization model
of statistical inference in which
inference is restricted to the
actual experiment only
Reducing the intrinsic variability and bias
by:
¾ use of genetically uniform animals
¾ use of phenotypically uniform animals
¾ environmental control
¾ nutritional control
¾ acclimatization
¾ measurement system
65
Beware off
B
external validity
Basic strategies for
maximizing
i i i Si
Signal/Noise
l/N i ratio
ti
66
Replication can be an effective
strategy to control variability
(precision)
Replication is an expensive
strategy to control variability
Precision of experiment is quantified
by standard error of difference
between two treatment means
Replication
R li ti is
i also
l
needed to estimate SD
67
subsampling:
Standard error of difference:
¾ multiple observations over time
¾ duplicate, triplicate measurement
¾ different levels of subsampling
¾ Replication is in most cases only effective at the level of the true experimental unit
¾ Replication at the subsample level (observational unit) makes only sense when the
variability at this level is substantial as compared to the variability between the
experimental
p
units
68
A blocking factor is a known and unavoidable source of variability that adds
or subtracts an offset to every experimental unit in the block
Typical
blocking factors
• age group
• microtiter plate
• run of assay
• batch of material
• subject (animal)
• date
• operator
69
A covariate is an uncontrollable but measurable attribute of the experimental
unit or its environment that is unaffected by the treatments but may have an
influence on the measured response
Typical
covariates
Covariates filter out one particular source
of variation. Rather than blocking it
represents a quantifiable (by coefficient β)
attribute of the system studied.
• baseline
measurement
• weight
• age
• ambient
temperature
Covariate adjustment requires
slopes
p to be parallel
p
70
Req irements for a good e
Requirements
experiment
periment
71
Simplicit of design
Simplicity
Protocol adherence?
Complex Design
Statistical analysis?
Simple designs require
a simple analysis
72
The calc
calculation
lation of uncertainty
ncertaint (error)
“It iis possible
ibl and
d iindeed
d d it iis allll ttoo ffrequent,
t ffor an experiment
i
t tto b
be so
conducted that no valid estimate of error is available” (Fisher, 1935)
Validity of error
estimation
Respond independently
to treatment
Experimental units
Differ only in random way
from experimental units in
other treatment groups
Without a valid estimate of error,
error a valid
statistical analysis is impossible
73
Statistical Thinking and
Smart Experimental Design
V. Common Designs
g in Biosciences
74
The jjungle
ngle of e
experimental
perimental designs
Split Plot
Greco-Latin Square Design
Latin Square Design
Completely Randomized Design
Randomized Complete Block Design
Plackett-Burman Design
75
Youden Square Design
Factorial Design
Experimental design is
an integrative
i t
ti conceptt
An experimental design is a synthetic approach to
minimize bias and control variability
76
Three aspects of experimental design
d t
determine
i complexity
l it and
d resources
77
Completel Randomi
Completely
Randomized
ed Design
¾ Studies the effect of a single independent variable (treatment)
¾ One-way treatment design
Completely randomized error–control design
¾ Most common design (default design)
design), easy to implement
Lack of precision in
comparisons is
major drawback
78
Completely Randomized Design
E
Example
l 1
¾ Effect of treatment on proliferation of gastric epithelial cells in rats
¾ 2 drugs & 2 controls (vehicle) = 4 experimental groups
g p, individual housing
g in cages
g
¾10 animals/group,
¾ Cages numbered and treatments assigned randomly to cage numbers
¾ Cages sequentially put in rack
¾ Blinding of laboratory staff and of
histological evaluation,
¾ Treatment code = cage number
(individual blinding)
79
Completely Randomized Design
E
Example
l 2
Bias due to plate location effects in microtiter plates (Burrows (1984)
(1984),
causes parabolic patterns:
Underlying
y g causes
unkown
Solution byy Faessel ((1999):
)
80
•
Random allocation of
treatments to the wells
•
Bias of plate location is
now introduced as
random variation
Randomization can be used to transform bias into noise
A convenient way to randomize microtiter
plates
l t
y dilutions made in tube rack 1
y tubes are numbered
y MS Excel used to make
randomization map
y map taped on top of rack 2
y tubes from rack 1 are pushed to
corresponding number in rack 2
y multipipette used to apply drugs
to 96-well plate
y results entered in MS Excel and
sorted
Alternative methods:
• design (Burrows et al
al. 1984
1984, Schlain et al.
al 2001)
• statistical model (Straetemans et al. 2005)
81
Factorial Design (full factorial)
¾ Studies the jjoint effect of two or more independent variables ((factors))
¾ Considers main effects and interactions
¾ Factorial treatment design
C
Completely
l t l R
Randomized
d i d error-control
t l design
d i
82
Factorial Design
E
Example
l
No interaction, additive effects
83
Interaction present, superadditive effects
Factorial Designs
I t
Interaction
ti and
d Drug
D
Synergy
S
A hypothetical
y
factorial
experiment of a drug combined
with itself at 0 and 0.2 mg/l
0
0.2 mg/L
0
0%
1%
0.2 mg/L
1%
22%
S
Synergy
off a drug
d
with
ith itself?
it lf?
The pharmacological concept of drug synergy requires more than just
statistical interaction
84
Factorial Design
F th examples
Further
l
¾
Effects of 2 different culture media and 2 different incubation times on
growth of a viral culture
¾
Microarray experiments interaction between mutants and time of
observation
b
ti
¾
For cDNA microarrays specialized procedures required (see Glonek
and Solomon
Solomon, 2004; Banerjee and Mukerjee,
Mukerjee 2007)
¾ The need for a factorial experiment should be recognized at the
design phase
¾ The analysis should make use of the proper methods
i.e. ANOVA
85
Factorial Design
Hi h di
Higher
dimensional
i
ld
designs
i
¾
3 x 3 x 3 design = 27 treatments , 2 replicates = 54 experimental units
is this manageable?
¾
Interpretation
p
of higher
g
((3-way)
y) interactions?
Neglect higher order interactions
Do not test ALL combinations
Fractional Factorial Design
(e g Process optimization)
(e.g.
86
Factorial Design
Effi i
Efficiency
off factorial
f t i l design
d i
Only valid method to study interaction between factors
B absent
B Present
A Absent
18
15
A Present
13
10
Interaction =
(18 + 10) – (15 +13) = 0
No important interaction from statistical
AND scientific point of view
Effect of factor A:
[(13 – 18) + (10 -15)]/2 = -5
d.f. = 4 x (n – 1)
87
Effect of factor B:
[(15 – 18) + (10 -13)]/2 = -3
d.f. = 4 x (n – 1)
Increase external
validity
Randomi ed Complete Block Design
Randomized
¾ Studies the effect of a single
g independent
p
variable ((treatment)) in
presence of a single isolated extraneous source of variability (blocks)
¾ One way treatment design
One
O
e way
ay b
block
oc e
error-control
o co t o des
design
g
• Increase of precision
• Eliminates possible
imbalance in
blocking factor
• Assumes treatment effect the same among all blocks
• Blocking characteristic must be related to response or else
loss of efficiency due to loss in df
88
Randomized Complete Block Design
E
Example
l
¾ Effect of treatment on proliferation of gastric epithelial cells in rats
¾ 2 drugs & 2 controls (vehicle) = 4 experimental groups
¾10 animals/group, individual housing in cages
¾ Cages numbered and treatments assigned randomly to cage numbers
¾ Blinding of laboratory staff
Shelf height will probably influence the results
¾
¾
¾
¾
89
1 shelf = 1 block
8 animals (cages) / shelf, 5 shelves
Randomize treatments per shelf
2 animals / shelf / treatment
Not restricted to shelf height, other characteristics
can also be included (e.g. body weight)
Randomized Complete Block Design
P i dD
Paired
Design
i
¾ Special case of RCB design with only 2 treatments and 2 EU/block
¾ Example Cardiomyocyte experiment
90
Randomized Complete Block Design
P i dD
Paired
Design
i – Example
E
l
Experimental Design
Results
• Left panel
• ignores pairing
• groups overlap,
• standard error
drug-vehicle = 7.83
• Right panel
• pairs are connected by lines
• consistent increase in drug
treated dishes
• standard error
drug-vehicle
drug
vehicle = 2.51
2 51
91
The CRD requires 8 times more EU’s then the paired design
Latin Sq
Square
are Design
¾ Studies the effect of a single independent variable (treatment) in
presence of two isolated, extraneous sources of variability (blocks)
¾ One-way treatment design
Two-way block error-control design
•
•
•
92
In 1 k x k Latin square only k EU’s/treatment
More EU’s
EU s by stacking Latin squares upon each other or next to each
other
Randomizing the rows (columns) of the stacked LS design can be
advantageous
Latin squares to eliminate row-column effects in 96-well microtiter plates, 1 plate = 2 x (5 x 5 LS)
also of interest for experiments in green houses and growth chambers
Split Plot Design
¾ A split-plot design recognizes two types of experimental units:
•
• Plots
• Subplots
Two-way crossed (factorial) treatment design
Split plot error-control
error control design
¾ Plots randomly assigned to
primary factor (color)
¾ Subplots randomly assigned
to secondary factor (symbol)
93
Split Plot design
E
Example
l
¾ Effects of 2 diets and vitamin supplement on weight
g g
gain in rats
¾ Rats housed 2 / cage (independence?)
¾ Cages randomly allocated to diet A or diet B
¾ Individual rats color marked randomly selected to receive vitamin or not
¾ Main plot factor = diet
¾ Subplot factor = vitamin
94
Repeated Meas
Measures
res Design
¾ Two independent variables ‘time’
time and ‘treatment’
treatment
Each is associated with different experimental units
¾ Special case of split plot design
95
Statistical Thinking and
Smart Experimental Design
VI. The Required
q
Number of Replicates
p
– Sample
p Size
96
Determining sample size is a
risk-cost
i k
t assessmentt
Replicates
Cost
Uncertainty
Confidence
97
Choose sample size just big enough to
give confidence that any biologically
meaningful effect can be detected
Th context
The
t t off biomedical
bi
di l experiments
i
t
Sample size depends on
Study
Obj ti
Objectives
Statistical
Context
Point Estimation
Interval Estimation
Study
Design
Hypothesis Testing
Study Specifications
Assumptions
98
The h
hypothesis
pothesis testing conte
contextt
Population Model
Null Hypothesis
Decision made
Do not reject null hypothesis
Reject null hypothesis
Alternative Hypothesis
State of Nature
Null hypothesis Alternative
true
hypothesis
true
Correct decision False negative
(1 – α)
β
False positive
α
Correct decision
(1 – β)
• Alternative hypothesis,
effect size:
• Size of difference
• Variability
Specify
p
y allowable
false positive &
false negative rate
S
Sample
l size
i
Level of significance
0.01, 0.05, 0.10
99
Power
80% or 90%
Number of treatments
Number of blocks
Major determinants of sample si
size
e
100
Approaches for determining sample size
Formal power analysis approaches
Requires input of
• significance level
• power
• alternative hypothesis (i.e. effect size Δ)
¾ R-package pwr (Champely, 2009)
¾ Lehr’s equation:
q
•
•
•
•
smallest difference
variability (SD)
n per group
single comparison (2 groups)
significance level 0.05
power 80%, other powers possible
Rule-of-thumb: Mead’s Resource Requirement Equation
E=N-T-B
101
•
•
•
•
E = df for error, 12 ≤ E ≤ 20, blocked exps. ≥ 6
N = df total sample size (no. EUs – 1)
T = df for treatment combination
B = df for blocks and covariates
df = number of occurences – 1
Approaches for determining sample size
E
Examples
l
Effect size Δ = difference in means / standard deviation
Δ = 0.2
0 2 small; Δ = 0.5
0 5 medium; Δ = 0.8
0 8 large (Cohen)
Cardiomyocytes completely randomized design
Treated - Control = 10; SD = 12.5
Δ = 10/12.5 = 0.8
standard deviation of either group
or pooled SD
> require(pwr)
> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,type="two.sample",
pwr t test(d 0 8 power 0 8 sig level 0 05 type "two sample"
+
alternative="two.sided")
Two-sample t test power calculation
n = 25.52457
d = 0.8
sig.level = 0.05
power = 0.8
alte nati e = two.sided
alternative
t o sided
NOTE: n is number in *each* group
102
Lehr’s equation
16/0.64 = 25
Approaches for determining sample size
E
Examples
l ((cont.)
t)
Rule-of-thumb: Mead’s Resource Requirement Equation
E=N-T-B
•
•
•
•
E = df for error, 12 ≤ E ≤ 20 , ≥ 6 blocked exps.
N = df total sample size (no. EUs – 1)
T=d
df for
o treatment
ea e co
combination
b a o
B = df for blocks and covariates
Cardiomyocytes completely randomized:
N=9
9, T = 1
1, B = 0
E = 10, too small (min. = 12)
At least 14 E.U.: 13 (N) – 1 (T) = 12 (E)
Cardiomyocytes paired experiment:
N = 9, T = 1, B = 4
E = 4, too small (min. = 6)
At least: 14 EU (7 animals): 13(N) – 1 (T) – 6 (B)
103
Ho
man subsamples
s bsamples ?
How many
Standard error of difference:
Vertical line corresponds to upper bound:
Taking differential costs into consideration:
104
How many subsamples ?
E
Example
l
Morphologic study diameter of cardiomyocytes in sheep
7 sheep with intervention
6 sheep control
Diameter of 100 epicardial cells/sheep
Variance components:
Upper bound (neglecting differential costs):
Assume cost of 1 animal is equivalent to 100 subsamples then:
1 animal equivalent to 1000 subsamples then 55 cells sufficient
Taking a priori 100 subsamples was definitely a waste of time in this
case
105
M ltiplicit and sample si
Multiplicity
size
e
106
M ltiplicit and sample si
Multiplicity
size
e
•
•
•
•
¾ For single comparison, large sample sizes required for
medium (Δ = 0.5) to large (Δ = 0.8) effects
107
¾ Effects in early research usually extremely large
2 tests: 20% increase RSS
3 tests: 30% increase RSS
4 tests: 40% increase RSS
10 tests: 70% increase RSS
Statistical Thinking and
Smart Experimental Design
VII. The Statistical Analysis
y
108
The Statistical Triangle
Design, Model, and Analysis are Interdependent
109
Every experimental design is underpinned
b a statistical
by
t ti ti l model
d l
Statistical analysis:
•
•
110
Fit model to data
Compare treatment effect(s) with error
Measurement theory defines allowable
operations
ti
and
d statements
t t
t
111
Nominal
¾ Naming
¾ Ex.
E C
Color,
l gender,
d etc.
t
¾ Frequency tables
Ordinal
¾ Ranking and ordering
¾ Ex.
E S
Scoring
i as none, minor,
i
moderate,….
d t
¾ Differences between scores do not reflect same
distance
¾ Frequency tables, median values, IQR
Interval
¾
¾
¾
¾
Ratio
¾ Differences and ratios make sense
¾ True zero point, negative numbers impossible
¾ Ex.
Ex Concentrations
Concentrations, duration
duration, temperature in degrees
Kelvin, age, counts,…
¾ Geometric means, coefficient of variation
Adding, subtracting (no ratios)
Arbitrary zero, 30°C ≠ 2 x 15°C
Ex. Temperature in degrees Fahrenheit or Celsius
Means standard deviations
Means,
deviations, etc
etc.
Significance tests
The concept of the p-value is often misunderstood
Randomization Model
Exp.
Data
Null hypothesis (H0): no difference in response
between treatment groups
Consider H0 as true
Observed
Test Statistic
t
P-value
P
value =
Probability of obtaining a value for
the test statistic that is at least as
extreme as t assuming H0 is true
Statistical Model
Reject H0
112
Yes
p≤α
No
Do not reject H0
Significance tests
E
Example
l Cardiomyocytes
C di
t
Mean diff. = 7% viable MC
standard error = 2.51
Assume H0 is true
Test statistic
t = 7/2.51 = 2.79
Distribution of possible
values of t with mean 0
P(t≥2.79) = 0.024
113
Probabilityy of obtaining
g an
increase of 2.79 or more is
0.024, provided H0 is true
Significance tests
T
Two-sided
id d ttests
t
Mean diff. = 7% viable MC
standard error = 2.51
Assume H0 is true
Test statistic
t = 7/2.51 = 2.79
Distribution of possible
values of t with mean 0
R
Results
lt iin opposite
it
direction are also of
interest
P(|t|≥2.79) = 0.048
114
Two-sided tests are the rule!
Probabilityy of obtaining
ga
difference of ±2.79 or more
is 0.048, provided H0 is true
St ti ti always
Statistics
l
makes
k assumptions
ti
Parametric Methods:
Nonparametric Methods:
•
•
•
•
•
•
Distribution of error term
Randomization
Independence
Equality of variance
Randomization
Independence
Always verify assumptions
•
•
115
Planning (historical data)
Analysis time before actual
analysis
M ltiplicit
Multiplicity
Multiple comparisons
Multiplicity
Multiple variables
Multiple
p measurements ((in
time)
116
•
20 doses of drug tested against its control each at
g
level of 0.05
significance
•
Assume all null hypotheses are true, no dose differs from
control
•
P(correct decision of no difference) = 1 - 0.05 = 0.95
•
P(ALL 20 decisions correct) = 0.9520 = 0.36
•
P(at least 1 mistake) =
P(NOT ALL 20 correct) = 1 – 0.36 = 0.64
Decision made
Do not reject null hypothesis
Reject null hypothesis
State of Nature
Null Alternative
hypothesis
hypothesis
true
true
Correct decision
(1 – α)
False positive
α
False negative
β
Correct decision
(1 – β)
(1 –
•
The “Curse of Multiplicity” is of
particular importance in microarray
data (30
(30,000
000 genes → 1
1,500
500 FP)
•
Multiplicity and how to deal with it
must be recognized at the
planning stage
Statistical Thinking and
Smart Experimental Design
VIII. The Studyy Protocol
117
The writing of the study protocol finalizes
th research
the
hd
design
i phase
h
Research protocol
Technical protocol
¾ Rehearses research logic
¾ Forms the basis for reporting
¾ Describes practical actions
¾ Guidelines for lab technicians
• General hypothesis
• Working hypothesis
• Experimental design rationale
(treatment, error control, sampling
design)
• Measures to minimize bias
(randomization, etc.)
• Measurement methods
• Statistical analysis
118
•
•
•
•
•
Manipulation of animals
Materials
Logistics
Data collection & processing
Personal responsibilities
Writing the Study Protocol is
already working on the Study
Report
Statistical Thinking and
Smart Experimental Design
IX. Interpretation
p
& Reporting
p
g
119
P i t to
Points
t consider
id in
i the
th Methods
M th d S
Section
ti
Experimental
p
design
g
Weaknesses & Strengths :
•
•
•
•
•
•
•
120
Size and number of experimental groups
How were experimental units assigned to
treatments?
Was proper randomization used or not?
Was it a blinded study and how was
g introduced?
blinding
Was blocking used and how?
How were outcomes assessed?
In case of ambiguity, which was the
experimental unit and which the
observational unit and why?
Statistical Methods:
•
•
•
•
Report and justify statistical methods
with enough detail
Supply level of significance and when
applicable
li bl direction
di ti off ttestt ((one-sided,
id d
two-sided)
Recognize multiplicity and give
justification of strategy to deal with it
Reference to software and version
Points to consider in the Results Section
S
Summarizing
i i th
the D
Data
t
Always report number of experimental units used in the analysis
All units
it mustt b
be accounted
t d ffor, no di
discrepancies
i with
ith M
Methods
th d
For - and only for - normally
di ib d data
distributed
d
mean and
d standard
d d
deviation (SD) are sufficient statistics
• SD is descriptive statistic about spread,
i.e. variability
• Report as mean (SD) NOT as mean ± SD
• SEM is measure of precision of the mean
(makes not much sense)
Mean – 2 x SD can lead to ridiculous values
if data not normally distributed
(concentrations, durations, counts, etc.)
Not normally distributed
Median values and interquartile range
Extremely small datasets (n < 6)
Raw data
Spurious
p
p
precision detracts a p
paper’s
p
readability
y
and credibility
121
Points to consider in the Results Section
G hi l Di
Graphical
Displays
l
¾ Complement tabular presentations
¾ Better suited for identifying patterns:
“A picture says more than a thousand words”
¾ Whenever possible display individual data (e.g. use of beeswarm package in R)
20
Occurence
es
15
10
5
0
Fa
Fb
Ma
Individual data, median values and 95%
confidence
fid
iintervals
l
122
Mb
Longitudinal data (measurements over time)
Individual subject profiles
Points to consider in the Results Section
Si ifi
Significance
Tests
T t
¾ Specify method of analysis do not leave ambiguity
e.g. “statistical methods included ANOVA, regression analysis as well as tests
of significance” in the methods section is not specific enough
¾ Tests of significance should be two-sided
• Two group tests allow direction of alternative hypothesis, e.g. treated >
control, this is a one-sided alternative
• Use of one-sided tests must be justified and the direction of the test must be
stated beforehand ((in protocol)
p
)
• For tests that allow a one-sided alternative it must be stated whether twosided or one-sided was used
¾ Report exact p-values rather than p<0.05 or NS
•
•
•
•
•
123
0.049 is significant, 0.051 not ?
Allows readers to choose their own critical value
Avoid reporting p = 0.000; but report p<0.001 instead
Report p-values to the third decimal
For one-sided tests if result is in “wrong” direction then p’ = 1 - p
Points to consider in the Results Section
Si ifi
Significance
Tests
T t – Confidence
C fid
IIntervals
t
l
Provide confidence intervals to interpret size of
effect
ff t
•
•
•
124
Null hypothesis is never proved
Lack of evidence is no evidence for lack of
effect
ff t
Confidence intervals provide a region of
plausible values for the treatment effect
Points to consider in the Results Section
Si ifi
Significance
Tests
T t – Confidence
C fid
IIntervals
t
l
125
X is significant, while Y is not
F l claims
False
l i
off iinteraction
t
ti
“The percentage of neurons showing cue-related activity increased with
training in the mutant mice (P<0.05), but not in the control mice (P>0.05)”
Invalid reasoning from nonsignificant result (P>0.05)
The difference between “significant”
g
and “not significant”
g
is not itself
significant
Test for equalitiy of effect of training in 2 groups
Only a factorial experiment with proper method of analysis
(ANOVA) and test for interaction can make such a claim
126
Concluding Remarks
Statistical thinking is the key to succesful experiments
1. Time spent thinking on the conceptualization and design of an
experiment is time wisely spent
2. The design of an experiment reflects the contributions from different
sources of variability
3 The design of an experiment balances between its internal validity and
3.
external validity
4. Good experimental practice provides the clue to bias minimization
5. Good experimental design is the clue to the control of variability
6. Experimental design integrates various disciplines
7 A priori consideration of statistical power is an indispensable pillar of
7.
an effective experiment
127
Concl ding Remarks
Concluding
R l off St
Role
Statistician?
ti ti i ?
128
•
Professional skilled in solving research problems
•
Team member and collaborator/partner in research project often granted
co-authorship
•
Should be consulted when there is doubt with regard to design, sample
size, or statistical analysis
•
It can be wise to include a statistician from the very beginning of the
project
Fi h about
Fisher
b t th
the R
Role
l off th
the St
Statistician
ti ti i
“To
To consult the statistician after an
experiment is finished is often merely to ask
him to conduct a post mortem examination.
perhaps
p say
y what the experiment
p
He can p
died of.”
(Presidential Address to the First Indian
Statistical Congress, 1938; Fisher, 1938).
129
Statistical Thinking and Smart
Experimental Design
Slides, Course notes at:
http://www.datascope.be
Teşekkür ederim
131