An Investigation of Confidence Intervals for Binary

An Investigation of Confidence Intervals for Binary
Prediction Performance with Small Sample Size
Xiaoyu Jiang, Steve Lewitzky, Martin Schumacher
BioMarker Development
Novartis Institutes for BioMedical Research
Ascona Workshop 2013
Statistical Genomics and Data Integration for Personalized Medicine
BioMarker Development at Novartis
 Identify, develop, and deliver biomarkers for customized therapy (General Medicine)
 Biomarker uses
• Predict patient outcome
• Monitor patient progress; Identify disease subtypes; Understand drug mechanism
 Patient outcomes of interest
• Drug response and patient stratification
• Risk of adverse drug reaction (ADR)
• Optimal dosing
 Biomarker types
• Genetic (DNA)
• RNA (mRNA, miRNA)
• Protein
• Metabolite
• Imaging
• Others (e.g., clinical chemistry)
 Statistics methodologies
• Statistical predictive modeling
2 • Resampling
Overview of Workflow
Binary predictive performance estimation (when independent
sample is unavailable)
Repeatedly
divide dataset
into training set
and test set
Feature
selection on
training set
Train
classifier
on training
set
Resampling
3 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Predict
test
samples
Evaluate
predictions
Immediate Issues for Estimating Predictive Performance
In the Context of Small PoC Trials for Biomarker Discovery
 Is there a robust method for generating point estimate(s)
for predictive performance?
• Many methods (leave-one-out cross-validation, repeated k-fold
cross-validation, bootstrapping)
- Molinaro et al. (2005); Varma and Simon (2006); Jiang and Simon (2007)
• Mostly for misclassification rate
 How to evaluate the variability of such point estimate(s)
(confidence intervals)?
• Fewer methods
• Only for misclassification rate
4 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Binary Predictive Model Performance Evaluation
Moving Parts
Linear
Discriminant
Analysis
Univeriate
Welch t-test
Sample
size
Classifier
Feature
selection
Objectives of
simulation study :
Ratio of
class
sizes
Predictive
Performance
# of truly
predictive
markers
Effect
Size
(ES)
Resampling
methods
# of
candidate
markers
5 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
• To understand the
behavior of confidence
interval estimation
method
• Identify a resampling
method for point
estimates in our
setting
Resampling Methods for Point Estimate
Repeated Stratified k-fold Cross-Validation
1
2
3
4
5
6
7
8
9
10
4
6
3
8
7
1
10
2
9
5
Test set
• Leave-One-Out Cross-Validation (LOOCV): k = n.
• LOOCV estimate is unbiased but has large variance.
• Publications have shown that repeated 5-fold and 10-fold CV
produce good point estimates of misclassification error rate.
6 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
5-fold
Resampling Methods for Point Estimate
Stratified bootstrapping with 0.632 correction
1
2
3
4
6
5
7
9
8
10
Sampling with replacement
1
6
1
3
10
7
3
6
9
8
7
2
5
4
Test
Training
•
𝑃(observing a subject in bootstrap sample) = lim (1 − 1 −
•
0.632 correction:
𝑛→∞
1 𝑛
)
𝑛
θ632+ = 0.632 ∙ θBootstrap + 0.368 ∙ θResubstitution
7 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
= 1 − 𝑒 −1 = 0.632
Resampling Method for Confidence Intervals
Bootstrap Case Cross-Validation with Bias Correction (Jiang, Varma and Simon 2007)
Need double resampling to estimate the sampling distribution of 𝜃
Bootstrap
LOOCV
...
LOOCV
8 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
...
...
Bootstrap
LOOCV
Data
𝜃1
𝜃𝐵
θ𝐿𝑂𝑂𝐶𝑉 − θ𝐵𝐶𝐶𝑉
(1-2α)%
empirical
confidence
interval:
(θα ,θ1−α )
Bias correction
Simulation Study
Predictive Performance Metrics for Binary Prediction
 Predictive performance measures
• Positive Predictive Value (PPV), Negative Predictive Value (NPV),
Sensitivity, Specificity
• Overall measure: Accuracy, Area Under ROC Curve (AUC)
 Evaluation of point estimate
• Bias, variance, RMSE
 Evaluation of confidence interval
• Empirical coverage probability
9 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Simulation Study
Simulating Continuous Biomarker Data
 𝑛 samples from class 1 and class 0 with ratio 𝑟
• 𝑛 = 20, 30, 40 …
• 𝑟 = 1: 1, 1: 2, 1: 3 …
 𝑝 = 1000 continuous biomarker variables follow a multivariate normal (MVN)
distribution:
𝒙𝑝×1 ~𝑀𝑉𝑁(𝝁𝑝×1 , Σ𝑝×𝑝 )
 10 predictive markers with mean of 1 for class 1 subjects, mean 0 for class 0
subjects; other markers have mean 0 for all subjects.
 Include the top 2 markers from feature selection in the model.
 Varying diagonal elements in Σ to change effect size for predictive markers.
 Predictive markers are correlated; others have low to none correlation.
• Correlation coefficient among predictive markers is 0.4.
10 500 simulations
Simulation Study
Calculating the True Value of Predictive Performance Metrics
 What is the true AUC of the predictive model built upon a dataset of
sample size n=40 with class 1 and class 0 ratio of 1:1?
 To assess bias of the resampling-based estimates
Generate an independent dataset
Dlarge of large size (n=10000) from
the same simulation scheme
Train the predictive model
on dataset of n=40
Predict the class labels of Dlarge
Evaluate against the true class
labels of the large dataset DLarge
11 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Simulation Results
Point Estimate
AUC (n=80, ES=1.4)
AUC (1:1, ES=1.4)
AUC (n=80, 1:1)
LOOCV
5-fold CV
B632
RMSE
RMSE
RMSE
LOOCV
5-fold CV
B632
LOOCV
5-fold CV
B632
Sample size
•
•
Ratio
Effect Size
Three methods perform similarly based on point estimate bias.
0.632 Bootstrapping has the smallest variance and hence the smallest
RMSE.
12 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Simulation Results
Empirical Coverage Probability with Varying Sample size
Ratio =1:1
ES = 1.4
n=20
n=40
n=60
n=80
n=120
13 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Simulation Results
Empirical Coverage Probability with Varying Ratio
n = 80
ES = 1.4
C1:C0=1:1
C1:C0=1:2
C1:C0=1:3
C1:C0=1:4
C1:C0=1:5
14 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Simulation Results
Empirical Coverage Probability with Varying Effect Size
n = 80
Ratio = 1:1
ES=0.8
ES=1
ES=1.4
ES=2
15 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013
Summary and Discussion
 0.632 corrected bootstrapping has the lowest RMSE for point estimation.
 As expected, confidence interval coverage is the best for larger sample sizes
and class size ratio of 1:1.
• Coverage appears reasonable for sample size >= 60.
 Small effect size leads to over coverage; large effect size leads to under
coverage.
 Need to better understand
• Different behavior of AUC relative to other metrics
• Resampling method BCCV-BC
 Challenges and opportunities for new statistical methods
• Robust way of integrating different marker modalities
• Explore new methods for confidence interval estimation
16 | Ascona Workshop 2013 | Jiang X, Lewitzky S, Schumacher M | May 14, 2013