A Specialized Ensemble Learning Framework for Cardiovascular Risk Prediction in Type-2 Diabetes Using Advanced Feature Engineering
Keywords:
Type-2 Diabetes, Cardiovascular Disease, Risk Prediction, Machine Learning, Ensemble Models, Feature Engineering.Abstract
Cardiovascular disease (CVD) remains the dominant cause of death and disability in people with Type-2 Diabetes (T2D). Commonly used tools such as SCORE2-Diabetes, QRISK, and the UKPDS engine are helpful but often fall short when applied to heterogeneous diabetic populations, particularly where data are imbalanced or key disease-specific markers are missing. In this study, we developed a prediction framework designed specifically for T2D patients, drawing on multiple data sources including electronic health records, subsets of the UK Biobank, and a large open diabetes dataset (≈100,000 cases) from Kaggle. Patients with prior CVD were excluded. The predictors covered demographics, conventional cardiovascular risk factors, and diabetes-related measures such as HbA1c variability, duration of diabetes, treatment exposure, renal function (eGFR), and albuminuria. Preprocessing included Word2Vec embeddings for richer feature representation, SMOTE-ENN to correct class imbalance, and statistical filtering to retain significant variables. The model combined a range of classifiers support vector machines, logistic regression, decision trees, boosting methods, random forests, and artificial neural networks into a heterogeneous ensemble using majority voting. Compared with traditional scores, the ensemble achieved stronger discrimination (AUC 0.89, 95% CI 0.87–0.91), better calibration (slope 0.98), and a lower Brier score (0.126). Clinical benefit was also higher, with an NRI of +0.18 and a net gain of +0.21 at the 10% risk threshold. Overall, the results show that incorporating diabetes-specific markers with ensemble learning improves cardiovascular risk prediction and offers a pathway toward more tailored prevention strategies in T2D.



