How to Reduce the Number of Animals by Improving Experimental Development

How to Reduce the Number of
Animals by Improving Experimental
Design and Statistics in Drug
Development
Michael FW Festing
c/o Understanding Animal
Research, 25 Shaftsbury Av.
London, UK.
[email protected]
1
"...the
standard of design of experimental
investigations is poor and the basic
principles of design are widely ignored...
Mead (1990) The Design of Experiments, Cambridge Univ. Press
2
1
Poor agreement between
animal and human responses
Intervention
Human results
Animal results (metaanalysis)
Agree?
Corticosteroids for
head injury
No improvement
Improved nurological
outcome
n=17
No
Antofibrinolytics for Reduces blood loss
surgery
Too little good quality data
n=8
No
Thrombolysis with
TPA for acute
ischaemic stroke
Reduces death
Reduces death but
publication bias and
overstatement (n=113)
Yes
Tirilazad for stroke
Increases risk of death
Reduced infarct volume and
improved behavioural score
n=18
No
Corticosteroids for
premature birth
Reduces mortality
Reduces mortality n=56
Yes
Bisphosphonates
for osteoperosis
Increase bone density
Increase bone density n=16
Yes
3
Perel et al (2007) BMJ 334:197-200
Good experimental design
z
z
Saves animals, so easier to justify ethically
Saves time and money
z
Badly designed experiments wasteful and may
give invalid results
4
2
Types of experiment
z
Pilot study
z
z
Exploratory experiment
z
z
z
z
z
Logistics and preliminary information
Aim is to provide data to generate hypotheses
May “work” or “not work”
Often many outcomes
Statistical analysis may be problematical (many characters
measured, data snooping). p-values may not be correct
Confirmatory experiment
z
z
z
Clear specification of aims of the experiment
Simple formal hypothesis stated a priori.
Choice of model, treatments and dependent variables
5
A well designed experiment
z
Absence of bias
z
z
High power
z
z
z
z
z
Low noise (uniform material, blocking, covariance)
High signal (sensitive subjects, high dose)
Large sample size
Wide range of applicability
z
z
Correct experimental unit, randomisation, blinding
Replicate over other factors (e.g. sex, strain): factorial
designs
Simplicity
Amenable to a statistical analysis
6
3
Experimental Unit
The smallest division of the
experimental material such that
any two experimental units can
receive different treatments
Unit of randomisation
Unit of statistical analysis
7
The animal as the experimental
unit
Animals individually treated. May be individually housed or grouped
N=8
8
4
A cage as the Experimental
Unit.
Treatment in water or diet. Animals can not
receive different treatments.
N=4
9
An animal for a period of time: repeated
measures or crossover design
Animal
1
2
3
N=16
Treatment 1
Treatment 2
10
5
Teratology: mother treated,
young measured
Mother is the experimental unit.
N=2
11
Aim: to detect strain differences in diurnal pattern of
blood alcohol)
ELD group
ELD group
Single cage of 8 mice killed at each time point (288 mice in total)
12
6
Randomisation
Minimises the chance of a systematic difference between groups
causing bias
Method: Physical, using cards
Spread sheet
Original Randomised Animal number
1
2
1
1
3
2
1
3
3
1
1
4
2
2
5
2
1
6
2
2
7
2
1
8
3
3
9
3
2
10
3
3
11
3
1
12
13
Randomisation, blinding and
cage assignment
Cage 1 2
Original Randomised Animal
1
2
1
1
3
2
1
3
3
1
1
4
2
2
5
2
1
6
2
2
7
2
1
8
3
3
9
3
2
10
3
3
11
3
1
12
2 3
3
4
3
1
2,X 3,X 3,X
2,3,3,1 2,1,2,1
1, 2, 3 1, 2, 3
1,1 1,1 2,2 2,2
1,1,1,1,1 2,2,2,2
etc
individually housed
etc individual + companion
etc Grouped at random
etc
Randomised block
Two/box, box ExpU
etc By treatment,
box is ExpU
14
7
Failure to randomise and/or blind
leads to more “positive” results
Blind/not blind
odds ratio
3.4 (95% CI 1.7-6.9)
Random/not random
odds ratio
3.2 (95% CI 1.3-7.7)
Blind Random/
not blind random
odds ratio
5.2 (95% CI 2.0-13.5)
290 animal studies scored for blinding, randomisation and
positive/negative outcome, as defined by authors
Bebarta et al 2003 Acad. emerg. med. 10:684-687
15
A well designed experiment
z
Absence of bias
z
z
High power
z
z
z
z
z
Large sample size
Low noise
Good signal
Wide range of applicability
z
z
Identify the experimental unit, randomisation, blinding
Replicate over other factors (e.g. sex, strain)
Simplicity
Amenable to a statistical analysis
16
8
Sample size determination
z
Power analysis: Mathematical combination of six
variables
z
z
z
z
z
Use for clinical trials (e.g. simple but expensive)
Difficult to use for complex designs
Needs estimate of standard deviation
Subjective estimate of effect size of clinical interest (signal)
Resource equation: Law of diminishing returns
z
z
Quick, Easy, Approximate
Good for inexpensive complex non-clinical designs
17
Power analysis: the variables
Effect size of
scientific interest
(signal)
Chance of a false positive
result. Significance level
(0.05)
Sample size
Sidedness of statistical
test (usually 2-sided)
Power of the
Experiment (8090%?)
Variability of the
experimental material
(noise)
18
9
Group size and Signal/noise
ratio
Bad
140
Power
90%
80%
120
Group size
100
80
Neutral
60
Good
40
20
0
0
0.5
1
1.5
2
2.5
3
Signal/noise
ratio
Effect size (Std.
Devs.)
Assuming 2-sample, 2 sided t-test and 5% significance level
19
Comparison of two anaesthetics for dogs
under clinical conditions
(Vet. Anaesthes. Analges.)
Unsexed healthy clinic dogs,
• Weight 3.8 to 42.6 kg.
• Systolic BP 141 (SD 36) mm Hg
Assume:
• a 20 mmHg difference between
groups is of clinical importance,
• a significance level of α=0.05
• a power=90%
• a 2-sided t-test
Signal/Noise ratio 20/36 = 0.56
(standardised effect size)
δ = |μ1−μ2|/σ
Required sample size 68/group
20
10
Power and sample size
calculations using nQuery Advisor
21
A second paper described:
• Male Beagles weight 17-23 kg
• mean BP 108 (SD 9) mm Hg.
• Want to detect 20mm
difference between groups (as
before)
With the same assumptions as
previous slide:
Signal/noise ratio = 20/9 = 2.22
Required sample size 6/group
22
11
Summary for two sources of dogs: aim is to
be able to detect a 20mmHg change in blood
pressure
Type of dog
SDev Signal/noise
Random dogs 36
Male beagles
9
0.56
2.22
Sample
size/gp(1)
68
6
%Power (n=8)
(2)
18
98
(1) Sample size: 90% power
(2) Power, Sample size 8/group
Assumes α=5%, 2-sided t-test and effect size 20mmHg
23
Inbred strains are more uniform.
Does a new drug cause anaemia?
Specify effect size (signal):
Anaemia if RBC count* reduced by 0.50 (signal)
Assume 5% significance and 2-sided test
Previous data on outbred CD-1 mice: Mean RBC count 9.00
Std. Dev.
0.68 (noise)
Signal/noise ration is 0.5/0.68
0.73
Previous data on inbred C57BL/6 mice:Mean RBC count 9.60
Std. Dev.
0.25 (noise)
Signal/noise ratio is 0.5/0.25
2.00
* x 1012/l
24
12
Group size and Signal/noise
ratio
140
Power
120
90%
80%
Group
Group size
size
100
Using CD-1 mice
80
60
40
Using C57BL/6
20
0
0
0.5
1
1.5
2
2.5
3
Signal/noise
ratio
Effect size (Std.
Devs.)
Assuming 2-sample, 2 sided t-test and 5% significance level
25
Variation in kidney weight in
58 groups of rats
90
80
Variability
70
60
Mycoplasma
50
Outbred
40
F1
F2
30
20
10
0
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57
Sample numbe r
26
Gartner,K. (1990), Laboratory Animals, 24:71-77.
13
Required sample sizes
Type
Genetics
F1 hybrid
13.5
0.74
30
80
F2 hybrid
18.4
0.54
55
53
Outbred
20.1
0.49
67
46
Mycoplasma
free
18.6
0.54
55
53
With
Mycoplasma
43.3
0.23
298
14
Disease
Std.Dev Signal*/
noise
Sample Power**
size
Factor
*signal is 10 units, two sided t-test, α=0.05, power = 80%
** Assuming fixed sample size of 30/group
27
Isogenic strains to control
within-strain variation
Outbred stocks
Isogenic strains (inbred, F1)
z
z
z
z
z
z
z
Isogenic (animals identical)
Homozygous, breed true (not F1)
Phenotypically uniform
Defined (quality control)
Genetically stable
Extensive background
data with genetic profile
Internationally distributed
Like immortal clones of genetically
identical individuals. Several hundred
strains available. Most common rat
strain F344
z
z
z
z
z
z
z
Each individual different
Do not breed true
Phenotypically variable
Not defined (no QC)
Genetic drift can be rapid
Validity of background data
questionable. No genetic profile
Not internationally distributed
Stocks with same name will be
different due to genetic drift and
selection. Most common rat stock is
“Sprague-Dawley”
28
14
The randomised block design: another
method of controlling noise
Treaments A, B & C
B
C
A
B1
A
C
B
B2
B
A
C
A
C
B
B
C
A
B3
B4
•
•
•
•
•
•
Randomisation is within-block
Can be multiple differences
between blocks
Heterogeneous age/weight
Different shelves/rooms
Natural structure (litters)
Split experiment in time
B5
Common with in-vitro studies where the “experiment”
(block) is repeated on several days. Should be more widely
used in animal research.
29
A randomised block
experiment
Apoptosis score
Analysed using a 2-way ANOVA without interaction
500
450
400
350
300
250
200
150
100
50
0
Control
CGP
STAU
365 398 421
1
423 432
459
2
Week
308
320 329
3
30
Treatment effect p=0.023
15
The Resource Equation method of
determining sample size
E= (Total number of animals)-(number of groups)
10<E<20
Student's t, 5% critical value
The Resource Equation & Sample Size
12.0
E= (total numbers)-(number of groups)
9.5
10<E<20
7.0
4.5
2.0
0
5
10
15
20
25
30
35
31
Degrees of freedom
A factorial design incorrectly analysed as four
separate experiments
E= (Total number of animals)-(number of groups).
10<E<20
8 mice per group,
8 treatment groups,
64 mice total.
E=64-8 = 56
Alternative
3 mice per group
8 groups
24 mice total
E=24-8 = 16
Saving:40 mice
32
16
Factorial designs: can
increase signal
Factorial design
Treated Control
E=16-4 = 12
Single factor design
One variable at a time (OVAT)
Treated Control
Treated Control Treated Control
E=16-2 = 14
E=16-2 = 14
E=16-2 = 14
33
Factorial designs
(By using a factorial design)”.... an experimental
investigation, at the same time as it is made more
comprehensive, may also be made more efficient if
by more efficient we mean that more knowledge
and a higher degree of precision are obtainable by
the same number of observations.”
R.A. Fisher, 1960
34
17
Factorial designs
z
Any number of factors:
z
z
z
z
Drug treatments, prior treatments, sexes, strains
Any number of levels of each factor
Can screen many variable for effect on
character of interest
Sub-group size can be quite small
35
Factorial: what do we mean by
group size?
8
8 or 4?
8 or 2?
8 or 1?
Trt. Ctrl.
Trt. Ctrl.
Trt. Ctrl.
Trt. Ctrl.
Single factor
Inbred strain
2x2 Factorial
2x4 Factorial Randomised block
8 or ??
Trt. Ctrl.
Outbred
stock
36
18
Factorial designs for drug
interactions
Drug A
Control
Control
Treated
(1)
a
b
ab
Estimates
a
b
axb
Drug B
Treated
37
Comparison: single outbred stock
vs factorial with inbred strains
Dose of chloramphenicol (mg/kg)
0
500
1000
1500
2000
2500
Outbred
CD-1
8
8
8
8
8
8
CBA
2
2
2
2
2
2
C3H
2
2
2
2
2
2
BALB/c
2
2
2
2
2
2
C57BL
2
2
2
2
2
2
Inbred
Festing,M.F.W.,et. al. (2001) Strain differences in haematological response to chloramphenicol succinate in mice:
implications for toxicological research. Food and Chemical Toxicology, 39, 375-383.
38
19
Red blood cell counts
Strain
CBA
CBA
C3H
C3H
BALB/c
BALB/c
C57BL
C57BL
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
Control
10.57
9.88
8.49
7.87
10.10
10.08
9.60
9.56
9.10
10.27
9.01
7.76
8.42
8.83
10.01
8.65
1500mg/kg
8.33
8.51
7.40
7.51
8.95
9.29
9.81
9.83
Four inbred strains
8.90
8.26
7.45
8.50
8.71
7.79
8.67
8.19
One outbred stock
39
Counts following chloramphenicol at
1500mg/kg
Red blood cell counts
Strain N
CD-1 16
0
9.01
Strain N
0
BALB/c 4 10.09
C3H
4
8.18
C57BL 4
9.58
CBA
4 10.23
Mean 16
9.51
Dose * strain
Signal
Noise
1500 (Difference) (SD) Signal/noise p
8.31
0.70
0.68
1.03
0.058
Signal
Noise
1500 (Difference) (SD)
9.12
0.97
0.25
7.46
0.72
0.25
9.82 (0.24)
0.25
8.42
1.81
0.25
8.70
0.81
0.25
Signal/noise
p
3.88
2.88
(0.96)
7.24
3.24
<0.001
<0.001
40
20
Example of a factorial compared with
a single factor design
Strain
CBA
CBA
C3H
C3H
BALB/c
BALB/c
C57BL
C57BL
WBC
Control Treated
1.90
0.40
2.60
0.20
2.10
0.40
2.20
0.40
1.60
1.30
0.50
1.40
2.30
0.80
2.20
1.10
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
CD-1
3.00
1.70
1.50
2.00
3.80
0.90
2.60
2.30
1.90
1.90
3.50
1.20
2.30
1.00
1.30
1.60
Four inbred strains
One outbred stock
41
WBC counts following chloramphenicol at
2500mg/kg
White blood cell counts
Strain N
CD-1 16
0
2.23
Strain N
0
CBA
4 2.25
C3H
4 2.15
BALB/c 4 1.05
C57BL 4 2.25
Mean 16 1.93
Dose * strain
Signal
Noise
2500 (Difference) (SD) Signal/noise p
1.83
0.40
0.86
0.47
0.38
Signal
Noise
1500 (Difference) (SD)
0.30
1.95
0.34
0.40
1.85
0.34
1.35 -0.30
0.34
0.95
1.30
0.34
1.20
0.73
0.34
Signal/noise
p
5.73
5.44
(0.88)
3.82
2.15
<0.001
<0.001
42
21
A factorial randomised block experiment to
detect the effect of BHA on liver EROD activity
Festing MF (2003) Principles: the Need for Better Experimental Design. Trends Pharmacol Sci 24: pp
341-345.
Block 2
Block 1
Treated Control
Treated Control
A/J
129/Ola
NIH
BALB/c
43
The two blocks were separated by approximately 3 months
A real experiment to detect the effect of
BHA on liver EROD activity
Block 2
Block 1
Treated Control
Treated Control
A/J
129/Ola
NIH
BALB/c
18.7
17.9
19.2
26.3
7.7
16.7
8.4
14.4
12.0
9.8
9.7
Mean
14.7
19.8
6.4
6.7
8.1
6.0
The two blocks were separated by approximately 3 months
Mean
11.3 (diff 3.4)
44
22
Effects of BHA on liver EROD activity in four mouse strain
(a 2x4 factorial randomised block experiment)
EROD activity
25
Control
BHA
20
Treatment p<0.001
Strain p=0.05
Strain x Treatment, p=0.03
Std. Dev. 1.6
15
10
5
0
A/J
129/Ola
NIH
BALB/c
A/J
129/Ola
NIH
BALB/c
2 mice per mean (16 total), done as a randomised block design.
45
A well designed experiment
z
Absence of bias
z
z
High power
z
z
z
z
z
Low noise (uniform material, blocking, covariance)
High signal (sensitive subjects, high dose)
Large sample size
Wide range of applicability
z
z
Correct experimental unit, randomisation, blinding
Replicate over other factors (e.g. sex, strain): factorial
designs
Simplicity
Amenable to a statistical analysis
46
23
Conclusions
z
z
z
Scope for improvement
z Experiments often poorly designed
z Many scientists have little training in experimental design and
statistics
Common errors:
z Failure to identify Experimental unit
z Failure to randomize and use blinding
z Lack of knowledge of sample size determination
z Poor understanding of effects of variation
z Failure to use/understand randomized block designs
z Failure to understand factorial designs
Greater investment in training would save animals, money and
time
47
Festing MF (2003) Principles: the Need for Better Experimental
Design. Trends Pharmacol Sci 24: pp 341-345.
48
24
49
An animal room as the
Experimental Unit
Does the presence of rats affect breeding performance of mice?
Pups born per litter
15
with
without
10
N=33
5
BALB/c B6/JN B6/N CD1
CF1
CFW DBA/2 FVB
Strain/stock
50
25
An animal room for a period of time: repeated
measures, within-subject, crossover or
randomised block design
Anima
rooml
1
2
3
4
N=16
with rats
without rats
51
Some factors (e.g. strain, sex) can not be randomised
so special care is needed to ensure comparability
Six cages of 7-9 mice of each strain: error bars are SEMs
"CBA mice showed greater
variability in body weights than
TO mice..."
Outbred TO (8-12 weeks
commercial)
Inbred CBA (12-16
weeks Home bred)
52
26
Body weight of mice housed 1, 2, 4 or 8 per cage
Chvedoff et al (1980) Arch.Toxicol. Suppl 4:435
Mice/cage
8 SD=2.9
4 SD=3.2
2 SD=3.9
1 SD=5.8
35
45
55
Weight
53
The consequences of
variability
Specification:
Assume a treated and a control group
Effect size to be detected of 5g (the signal) or more
A 90% power
A 5% significance level & a 2-sided t-test.
Number/cage
1
2
4
8
Mean
46.0
44.7
42.6
42.2
SD
5.8
3.9
3.2
2.9
Signal/
noise
0.86
1.28
1.56
1.72
Estimated
group size
30
14
10
9
54
27
Chloramphenicol toxicity in mice:
Outbred CD1, 8 mice per level
Difference
from control
Std. Devs.
Signal/
noise inratio
8
HCT
HGB
LYMPH
NEUT
PLT
RBC
RETICS
WBC
7
Effect size detectable with 90%
power and 5% significance level,
2 sided
6
5
4
3
2
1
0
0
500
1000
1500
2000
2500
dose (mg/kg)
Re-drawn from Festing et al (2001) Fd. Chem.Tox. 39:375
55
Chloramphenicol toxicity in mice:
4 strains, 8 mice per level
HCT
HGB
LYMPH
NEUT
PLT
RBC
RETICS
WBC
DifferenceSignal/
from control
in Std.
Devs.
noise
ratio
9
8
Effect size detectable with 90%
power and 5% significance level, 2
sided
7
6
5
4
3
2
1
0
0
500
1000
1500
2000
2500
Dose (mg/kg)
56
28
Chloramphenicol toxicity in mice:
4 strains, 8 mice per level
HCT
HGB
LYMPH
NEUT
PLT
RBC
RETICS
WBC
DifferenceSignal/
from control
in Std.
Devs.
noise
ratio
9
8
Effect size detectable with 90%
power and 5% significance level, 2
sided
7
6
5
4
3
2
1
0
0
500
1000
1500
2000
2500
Dose (mg/kg)
57
Mistakes in this experiment:
1. The cage is the experimental unit so there are 36, not 288 experimental units
to detect
differences
in diurnal
pattern
of
2. Aim:
The authors
lookedstrain
at the results
before deciding
the statistical
analysis
alcohol)
3. blood
They should
have done a pilot study and then eliminated two of the treatments
4. A t-test is not the correct method of analysis
ELD group
ELD group
Single cage of 8 mice killed at each time point (288 mice in total)
58
29
Chloramphenicol toxicity in mice:
Outbred CD1, 8 mice per level
Difference
from control
Std. Devs.
Signal/
noise inratio
8
HCT
HGB
LYMPH
NEUT
PLT
RBC
RETICS
WBC
7
Effect size detectable with 90%
power and 5% significance level,
2 sided
6
5
4
3
2
1
0
0
500
1000
1500
2000
2500
dose (mg/kg)
Re-drawn from Festing et al (2001) Fd. Chem.Tox. 39:375
59
30