Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

Comparison of Regression and Machine Learning Methods for Variable Selection to Develop a Clinical Prediction Model

Vey, Johannes Alois

[thumbnail of Vey_JohannesAlois_08_09_1992_Dissertation.pdf]
Preview
PDF, English
Download (12MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

Abstract

Clinical prediction models (CPMs) are increasingly finding their way into healthcare as they provide a prediction of a clinical outcome. Thus, they can support the decision-making of healthcare providers and their patients. A common challenge in developing a CPM is the identification of important predictor variables. To conduct variable selection and develop a CPM, a wide variety of methods exists. While some are based on traditional regression methods, others belong to the field of machine learning. Therefore, this thesis aimed to compare different variable selection methods in the development of a CPM for a continuous outcome in low-dimensional data and to provide recommendations for practical application.

Initially, a simulation study design was developed to provide a comparison as fair as possible and to examine the strengths and weaknesses of the different methods. For this, four different data-generating processes with increasing complexity were developed, which contain realistic data structures as well as relevant challenges. Based on the simulated datasets, the following established and widely used methods were compared: linear regression with stepwise selection (LMSS), regularized linear regression with elastic net penalty (ENET), gradient boosting with linear regression models (GBM) and decision trees (GBT) as base learners, and the Boruta (RFB) as well as the Hapfelmeier (RFH) method for the random forest. Moreover, the multivariable fractional polynomials (MFP) regression model was applied as a benchmark. All methods selected an increasing number of variables as the sample size of the dataset increased. While LMSS, RFB, and RFH achieved better results regarding the correct inclusion of predictors and correct exclusion of non-predictors, ENET, GBM, and GBT selected nearly all variables. LMSS revealed the best selection properties in the scenarios with low complexity and also identified predictors with non-linear functional form. In the more complex scenarios, the true inclusion frequency of predictors with non-linear relations to non-predictor variables decreased, especially for LMSS. RFB and RFH achieved the best selection properties in the scenarios of the greatest complexity. The performance regarding the predictive accuracy in test data generally improved as the sample size increased and was similar for all methods. However, while ENET, GBM, and GBT achieved good calibration by inherently utilizing regularization, RFB and RFH were suggested to be under- and LMSS overfitted in certain scenarios, respectively. In addition to the simulation study, the methods were applied to a real dataset to develop a CPM for intraoperative blood loss during liver transplantations. The developed simulation design is freely available and can be used for further research, e.g., for investigations of additional methods. Furthermore, the design can be extended to include additional aspects, such as more variables.

This thesis provides a comparison of traditional regression and machine learning methods for variable selection in the development of a CPM for a continuous outcome in low-dimensional data. A sample size of 250-500 observations is required for all methods to identify predictors sufficiently and achieve adequate predictive accuracy. LMSS can be recommended for data with low complexity as well as, possibly with adaptions, for more complex data structures. RFB and RFH are recommended for more complex data structures, particularly when interactions between variables might exist.

Document type: Dissertation
Supervisor: Kieser, Prof. Dr. sc. hum. Meinhard
Place of Publication: Heidelberg
Date of thesis defense: 16 December 2025
Date Deposited: 22 Jan 2026 14:05
Date: 2026
Faculties / Institutes: Medizinische Fakultät Heidelberg > Institut für Medizinische Biometrie
DDC-classification: 310 General statistics
610 Medical sciences Medicine
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative