Effects of Data Balancing in Diabetes Mellitus Detection: A Comparative XGBoost and Random Forest Learning Approach
DOI:
https://doi.org/10.37933/nipes/7.1.2025.1Abstract
Diabetes is a prevalent chronic disorder, which has contributed to many underlying health challenges – as the World Health Organization has dubbed it the world’s deadliest disease and a silent killer. As a non-communicable disease – it is difficult to diagnose at an early stage due to its types (that morph through many stages) that is broadly classified into type-I, type-II, gestational and pre-diabetes. Diabetes account for over 2-million deaths annually due to failed internal organs, high-blood pressure, etc. Thus, immediate action has become imperative for early detection and warning to (pre)carrier patients. There is also the problem inherent in real-world datasets due to imbalanced class(es) distributions rippling across poor generalization, high misclassification rates and low accuracy. Our study posits the utilization of data balancing techniques using the PIMA Indian Diabetes (PID) dataset to ascertain the impact of data balancing. We use 6-known schemes (RUS, UPS, SMOTE, ADASyn, SMOTE-Tomek and SMOTEEN) to resolve dataset imbalance in PID and evaluate how well these schemes fit with improved performance. The study explores tree-based XGBoost and Random Forest ensemble in identifying diabetes. The empirical (comparative) results from balancing approaches show that XGBoost performed best with SMOTE-Tomek; while the Random Forest model performed best with SMOTEEN.