EVALUATING THE CLINICAL UTILITY OF PREDICTION MODELS IN A HETEROGENEOUS

EVALUATING THE CLINICAL UTILITY OF PREDICTION MODELS IN A HETEROGENEOUS
MULTICENTER POPULATION USING DECISION-ANALYTIC MEASURES: THE RANDOMEFFECT WEIGHTED NET BENEFIT
Laure Wynants, MSc1, Dirk Timmerman, PhD, MD2, Sabine Van Huffel, PhD1 and Ben Van Calster,
PhD2, (1)KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical
Systems, Signal Processing and Data Analytics, Leuven, Belgium, (2)KU Leuven Department of
Development and Regeneration, Leuven, Belgium
Purpose: To investigate methods to evaluate the clinical utility of a prediction model in
heterogeneous multicenter datasets based on decision-analytic measures.
Method: We focus on the Net Benefit (NB) statistic from Vickers and Elkin (Med Decis Making
2006). NB is defined as (TP-w*FP)/N, with TP the number of true positives, FP the number of false
positives, and w the ‘harm-to-benefit ratio’ of treating a false positive versus a true positive. This
ratio equals the odds of the risk threshold t used to classify patients as positive or negative. A
model’s NB can be compared to the default strategies of classifying all as positive (treat all) or
negative (treat none). We averaged center-specific NBs for specific values of t using random-effect
weights 1/(se2+τ2), with se2 the within-center and τ2 the between-center variance of NB. We also
calculated center-specific Relative Utilities, i.e. normalized differences between the model’s NB and
the best default strategy. These were also averaged using random-effect weights. We present a case
study in which a prediction model (LR2) for malignancy of ovarian tumors is evaluated in a dataset of
5914 women recruited at 13 oncology referral centers and 11 non-oncology centers. We computed
separate weighted averages of NB for oncology and non-oncology centers, thereby re-estimating τ2
in each sub-population.
Result: There was considerable heterogeneity in NB: 97.4% (at t=0.1, 95% CI 96.9% to 97.9%) of the
variance in NB was due to differences between centers. The NB corresponding to t=0.1 was 0.339
(95% prediction interval 0.063-0.616) for oncology centers and 0.111 (0.026-0.196) for non-oncology
centers (see figure). LR2 was always better than the best default strategy in the average nononcology center. However, in the average oncology center, the NB of LR2 was lower than the NB of
treating all patients when t ≤0.1. Risks of malignancy were underestimated in a number of oncology
centers. Refitting LR2 to resolve calibration issues improved the clinical utility.
Conclusion: We conclude that NB can be highly heterogeneous in multicenter studies. NB may
increase because of increased prevalence of malignancy, and decrease due to insufficient calibration
or reduced classification performance of the model in specific centers. This heterogeneity should be
recognized and explored using appropriate techniques.
Random-Effect Weighted Net Benefit Facilitates the Evaluation of Prediction Models
in a Heterogeneous Multicenter Population
Laure Wynants, MSc(1,2), Dirk Timmerman, PhD, MD(3,4), Sabine Van Huffel, PhD(1,2) and Ben Van Calster(3,5)
(1) KU
Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium, (2) KU Leuven iMinds Medical IT Department, Leuven, Belgium, (3) KU Leuven Department of Development and Regeneration, Leuven, Belgium, (4) Department of Obstetrics and Gynecology,
University Hospitals Leuven, Leuven, Belgium, (5) Center for Medical Decision Sciences, Department of Public Health, Erasmus Medical Center, Rotterdam, The Netherlands.
Decision-analytic measures are used for the evaluation of clinical
prediction models
Collect
patient
data on an
event and
predictors
Develop a
mathematical model
to predict
the event
Evaluate
the model
in new
patients
The random-effect weighted NB is lower than the NB from the
pooled analysis
Refitting the LR2 model improves the calibration and the NB
NB model -max NB treat all , NB (treat none)
o Relative Utility=
outcome prevalence-max NB treat all , NB (treat none)
Use the
model in
clinical
practice
Update
the model
if needed
Extension: Random-effect weighted Relative Utility
o An alternative measure of clinical utility.
o Easier to compare models at low cut-offs.
o If RU<0, the model is “harmful”, i.e. it is better to use one of the default strategies.
o Model evaluation is traditionally done in terms of:
o More complex dependence on prevalence: negative association for t<prevalence, positive for
• Correctness of classification of events and non-events (C-index,
t≥prevalence.
sensitivity, specificity,…).
• Accuracy of predicted risks of experiencing the outcome (calibration).
o Decision-analytic measures quantify the clinical utility of the model, e.g. Net Benefit (NB).
o Net Benefit=
number of true positives-w×number of false positives
total number of observations
• w is the ‘harm-to-benefit ratio’ of treating a false positive versus a true
positive.
• w= t (1-t), where t is the risk threshold t used to classify patients as
positive (events) or negative (non-events).
• A model can be compared to the default strategies of classifying all as
o The NB was remarkably low in a number of oncological centers.
positive (treat all) or negative (treat none).
o The probability of cancer was underestimated in these centers.
• The higher the NB, the more clinical utility.
o There is considerable between-center heterogeneity (I2=97.4 at t=0.1, 95% CI 96.9 to 97.9).
o We refitted the LR2 model, re-estimating predictor effects, allowing for separate intercepts for
o A few large centers with high NBs have a high impact in the pooled analysis, which is
Random-effect weights can be used to combine center-specific
NBs in a multicenter dataset
o Multicenter studies: to enhance representativeness of data or reduce recruitment times.
o Models may have more clinical utility in one center than another.
moderated by the random-effect weighting.
The heterogeneity in NB can be partly explained by differences
between oncological and non-oncological centers
oncological and non-oncological centers, and a random center effect.
o This yielded more accurate predicted probabilities and hence better NBs.
NB is used to compare competing prediction models
o Random-effect weights 1/(se2+τ2) to combine NBs, with se2 the within-center and τ2 the
insufficient calibration or reduced classification performance of the model in specific centers.
o Hence, NB can be highly heterogeneous in multicenter studies.
Acknowledgements
between-center variance of NB.
This work was supported by a PhD fellowship from the Flanders’ Agency for Innovation by Science and
Technology (IWT Vlaanderen) to LW; a postdoctoral fellowship of the Research Foundation-Flanders
(FWO) to BVC; a fundamental clinical research fellowship of the Research Foundation-Flanders (FWO) to
DT.
Bubble size represents the number of observations per center
τ2>0
References
τ2=0
0
0,1
0,2
0,3
Weights
0,4
0,5
0,6
o We compared the NB computed from a multicenter dataset (pooled analysis) to the weighted
average of center-specific NBs using random-effect weighting.
o We computed se2 using the regular bootstrap.
o Proc Mixed (SAS) was used to compute the weighted NB.
A case study on the clinical utility of the LR2 model to classify
ovarian masses
A logistic regression model to compute the probability of malignancy of an ovarian mass
based on ultrasound and clinical characteristics.
o
o NB may increase because of increased prevalence of malignancy, and decrease due to
o This heterogeneity should be recognized and explored using appropriate techniques.
o Random-effect weighting assumes a distribution of NBs instead of one “true” NB.
o
Conclusion
Evaluated in an international dataset of 5.914 women recruited at 13 oncology referral centers
o NB increases with the prevalence of events.
and 11 non-oncology centers (regional or tertiary).
o Malignant tumors are more prevalent in oncology centers than in non-oncology centers.
o
LR2 and LR2-refitted perform well for all risk thresholds.
o
LR2-refitted and simple rules outperform the other models at relevant risk thresholds for
cancer detection (0.03≤t≤0.20).
Baker SG. Putting Risk Prediction in Perspective: Relative Utility Curves. Journal of the National Cancer
Institute. 2009; 101(22):1538-1542.
Higgins JP, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. Journal of
the Royal Statistical Society Series A, (Statistics in Society). 2009; 172(1):137-159.
Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002; 21(11):15391558.
Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not exist. Stat Med.
2000; 19(4):431-440.
Kaijser J, Bourne T, Valentin L et al. Improving strategies for diagnosing ovarian cancer: a summary of the
International Ovarian Tumor Analysis (IOTA) studies. Ultrasound Obstet Gynecol. 2013; 41(1):9.
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and
Updating. New York, NY: Springer US; 2009.
Timmerman D, Testa AC, Bourne T et al. Logistic Regression Model to Distinguish Between the Benign
and Malignant Adnexal Mass Before Surgery: A Multicenter Study by the International Ovarian Tumor
Analysis Group. Journal of Clinical Oncology. 2005; 23(34):8794-8801.
van Klaveren D, Steyerberg E, Perel P et al. Assessing discriminative ability of risk models in clustered
data. BMC Med Res Methodol. 2014; 14(1).
Vickers AJ, Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med
Decis Mak. 2006; 26(6):565-574.
Vickers A, Cronin A, Elkin E et al. Extensions to decision curve analysis, a novel method for evaluating
diagnostic tests, prediction models and molecular markers. BMC Medical Informatics and Decision
Making. 2008; 8(1).