EVALUATING THE CLINICAL UTILITY OF PREDICTION MODELS IN A HETEROGENEOUS MULTICENTER POPULATION USING DECISION-ANALYTIC MEASURES: THE RANDOMEFFECT WEIGHTED NET BENEFIT Laure Wynants, MSc1, Dirk Timmerman, PhD, MD2, Sabine Van Huffel, PhD1 and Ben Van Calster, PhD2, (1)KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium, (2)KU Leuven Department of Development and Regeneration, Leuven, Belgium Purpose: To investigate methods to evaluate the clinical utility of a prediction model in heterogeneous multicenter datasets based on decision-analytic measures. Method: We focus on the Net Benefit (NB) statistic from Vickers and Elkin (Med Decis Making 2006). NB is defined as (TP-w*FP)/N, with TP the number of true positives, FP the number of false positives, and w the ‘harm-to-benefit ratio’ of treating a false positive versus a true positive. This ratio equals the odds of the risk threshold t used to classify patients as positive or negative. A model’s NB can be compared to the default strategies of classifying all as positive (treat all) or negative (treat none). We averaged center-specific NBs for specific values of t using random-effect weights 1/(se2+τ2), with se2 the within-center and τ2 the between-center variance of NB. We also calculated center-specific Relative Utilities, i.e. normalized differences between the model’s NB and the best default strategy. These were also averaged using random-effect weights. We present a case study in which a prediction model (LR2) for malignancy of ovarian tumors is evaluated in a dataset of 5914 women recruited at 13 oncology referral centers and 11 non-oncology centers. We computed separate weighted averages of NB for oncology and non-oncology centers, thereby re-estimating τ2 in each sub-population. Result: There was considerable heterogeneity in NB: 97.4% (at t=0.1, 95% CI 96.9% to 97.9%) of the variance in NB was due to differences between centers. The NB corresponding to t=0.1 was 0.339 (95% prediction interval 0.063-0.616) for oncology centers and 0.111 (0.026-0.196) for non-oncology centers (see figure). LR2 was always better than the best default strategy in the average nononcology center. However, in the average oncology center, the NB of LR2 was lower than the NB of treating all patients when t ≤0.1. Risks of malignancy were underestimated in a number of oncology centers. Refitting LR2 to resolve calibration issues improved the clinical utility. Conclusion: We conclude that NB can be highly heterogeneous in multicenter studies. NB may increase because of increased prevalence of malignancy, and decrease due to insufficient calibration or reduced classification performance of the model in specific centers. This heterogeneity should be recognized and explored using appropriate techniques. Random-Effect Weighted Net Benefit Facilitates the Evaluation of Prediction Models in a Heterogeneous Multicenter Population Laure Wynants, MSc(1,2), Dirk Timmerman, PhD, MD(3,4), Sabine Van Huffel, PhD(1,2) and Ben Van Calster(3,5) (1) KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium, (2) KU Leuven iMinds Medical IT Department, Leuven, Belgium, (3) KU Leuven Department of Development and Regeneration, Leuven, Belgium, (4) Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium, (5) Center for Medical Decision Sciences, Department of Public Health, Erasmus Medical Center, Rotterdam, The Netherlands. Decision-analytic measures are used for the evaluation of clinical prediction models Collect patient data on an event and predictors Develop a mathematical model to predict the event Evaluate the model in new patients The random-effect weighted NB is lower than the NB from the pooled analysis Refitting the LR2 model improves the calibration and the NB NB model -max NB treat all , NB (treat none) o Relative Utility= outcome prevalence-max NB treat all , NB (treat none) Use the model in clinical practice Update the model if needed Extension: Random-effect weighted Relative Utility o An alternative measure of clinical utility. o Easier to compare models at low cut-offs. o If RU<0, the model is “harmful”, i.e. it is better to use one of the default strategies. o Model evaluation is traditionally done in terms of: o More complex dependence on prevalence: negative association for t<prevalence, positive for • Correctness of classification of events and non-events (C-index, t≥prevalence. sensitivity, specificity,…). • Accuracy of predicted risks of experiencing the outcome (calibration). o Decision-analytic measures quantify the clinical utility of the model, e.g. Net Benefit (NB). o Net Benefit= number of true positives-w×number of false positives total number of observations • w is the ‘harm-to-benefit ratio’ of treating a false positive versus a true positive. • w= t (1-t), where t is the risk threshold t used to classify patients as positive (events) or negative (non-events). • A model can be compared to the default strategies of classifying all as o The NB was remarkably low in a number of oncological centers. positive (treat all) or negative (treat none). o The probability of cancer was underestimated in these centers. • The higher the NB, the more clinical utility. o There is considerable between-center heterogeneity (I2=97.4 at t=0.1, 95% CI 96.9 to 97.9). o We refitted the LR2 model, re-estimating predictor effects, allowing for separate intercepts for o A few large centers with high NBs have a high impact in the pooled analysis, which is Random-effect weights can be used to combine center-specific NBs in a multicenter dataset o Multicenter studies: to enhance representativeness of data or reduce recruitment times. o Models may have more clinical utility in one center than another. moderated by the random-effect weighting. The heterogeneity in NB can be partly explained by differences between oncological and non-oncological centers oncological and non-oncological centers, and a random center effect. o This yielded more accurate predicted probabilities and hence better NBs. NB is used to compare competing prediction models o Random-effect weights 1/(se2+τ2) to combine NBs, with se2 the within-center and τ2 the insufficient calibration or reduced classification performance of the model in specific centers. o Hence, NB can be highly heterogeneous in multicenter studies. Acknowledgements between-center variance of NB. This work was supported by a PhD fellowship from the Flanders’ Agency for Innovation by Science and Technology (IWT Vlaanderen) to LW; a postdoctoral fellowship of the Research Foundation-Flanders (FWO) to BVC; a fundamental clinical research fellowship of the Research Foundation-Flanders (FWO) to DT. Bubble size represents the number of observations per center τ2>0 References τ2=0 0 0,1 0,2 0,3 Weights 0,4 0,5 0,6 o We compared the NB computed from a multicenter dataset (pooled analysis) to the weighted average of center-specific NBs using random-effect weighting. o We computed se2 using the regular bootstrap. o Proc Mixed (SAS) was used to compute the weighted NB. A case study on the clinical utility of the LR2 model to classify ovarian masses A logistic regression model to compute the probability of malignancy of an ovarian mass based on ultrasound and clinical characteristics. o o NB may increase because of increased prevalence of malignancy, and decrease due to o This heterogeneity should be recognized and explored using appropriate techniques. o Random-effect weighting assumes a distribution of NBs instead of one “true” NB. o Conclusion Evaluated in an international dataset of 5.914 women recruited at 13 oncology referral centers o NB increases with the prevalence of events. and 11 non-oncology centers (regional or tertiary). o Malignant tumors are more prevalent in oncology centers than in non-oncology centers. o LR2 and LR2-refitted perform well for all risk thresholds. o LR2-refitted and simple rules outperform the other models at relevant risk thresholds for cancer detection (0.03≤t≤0.20). Baker SG. Putting Risk Prediction in Perspective: Relative Utility Curves. Journal of the National Cancer Institute. 2009; 101(22):1538-1542. Higgins JP, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society Series A, (Statistics in Society). 2009; 172(1):137-159. Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002; 21(11):15391558. Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not exist. Stat Med. 2000; 19(4):431-440. Kaijser J, Bourne T, Valentin L et al. Improving strategies for diagnosing ovarian cancer: a summary of the International Ovarian Tumor Analysis (IOTA) studies. Ultrasound Obstet Gynecol. 2013; 41(1):9. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, NY: Springer US; 2009. Timmerman D, Testa AC, Bourne T et al. Logistic Regression Model to Distinguish Between the Benign and Malignant Adnexal Mass Before Surgery: A Multicenter Study by the International Ovarian Tumor Analysis Group. Journal of Clinical Oncology. 2005; 23(34):8794-8801. van Klaveren D, Steyerberg E, Perel P et al. Assessing discriminative ability of risk models in clustered data. BMC Med Res Methodol. 2014; 14(1). Vickers AJ, Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med Decis Mak. 2006; 26(6):565-574. Vickers A, Cronin A, Elkin E et al. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Medical Informatics and Decision Making. 2008; 8(1).
© Copyright 2024