Internal and external validation of predictive models: a simulation study of bias and precision in small samples

EW Steyerberg, SE Bleeker, HA Moll… - Journal of clinical …, 2003 - Elsevier
EW Steyerberg, SE Bleeker, HA Moll, DE Grobbee, KGM Moons
Journal of clinical epidemiology, 2003Elsevier
We performed a simulation study to investigate the accuracy of bootstrap estimates of
optimism (internal validation) and the precision of performance estimates in independent
validation samples (external validation). We combined two data sets containing children
presenting with fever without source (n= 376+ 179= 555; 120 bacterial infections). Random
samples were drawn from this combined data set for the development (n= 376) and
validation (n= 179) of logistic regression models. The models included statistically significant …
We performed a simulation study to investigate the accuracy of bootstrap estimates of optimism (internal validation) and the precision of performance estimates in independent validation samples (external validation). We combined two data sets containing children presenting with fever without source (n=376+179=555; 120 bacterial infections). Random samples were drawn from this combined data set for the development (n=376) and validation (n=179) of logistic regression models. The models included statistically significant predictors for infection selected from a set of 57 candidate predictors. Model development, including the selection of predictors, and validation were repeated in a bootstrapping procedure. The resulting expected optimism estimate in the receiver operating characteristic (ROC) area was compared with the observed optimism according to independent validation samples. The average apparent ROC area was 0.74, which was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed decrease in the validation samples was 0.09 to 0.65. Omitting the selection of predictors from the bootstrap procedure led to a severe underestimation of the optimism (decrease 0.006). The standard error of the observed ROC area in the independent validation samples was large (0.05). We recommend bootstrapping for internal validation because it gives reasonably valid estimates of the expected optimism in predictive performance provided that any selection of predictors is taken into account. For external validation, substantial sample sizes should be used for sufficient power to detect clinically important changes in performance as compared with the internally validated estimate.
Elsevier