Effects of Data Balancing in Diabetes Mellitus Detection: A Comparative XGBoost and Random Forest Learning Approach

Ojugo Arnold; Fidelis Obukohwo Aghware; Margaret Dumebi Okpor; Maureen Ifeanyi Akazue; Bridget Ogheneovo Malasowe; Tabitha Chukwudi Aghaunor; Eferhire Valentine Ugbotu; Rita Erhovwo Ako; Victor Ochuko Geteloma; Christopher Chukwufunaya Odiakaose; Patrick Ogholuwarami Ejeh; Sunny Innocent Onyemenem

doi:10.37933/nipes/7.1.2025.1

Authors

Ojugo Arnold Federal University of Petroleum Resources Effurun
Fidelis Obukohwo Aghware University of Delta, Agbor
Margaret Dumebi Okpor Delta State University of Science and Technology Ozoro
Maureen Ifeanyi Akazue Delta State University Abraka
Bridget Ogheneovo Malasowe University of Delta, Agbor
Tabitha Chukwudi Aghaunor Robert Morris University, Pittburg, Pennsylvania
Eferhire Valentine Ugbotu University of Salford
Rita Erhovwo Ako Federal University of Petroleum Resources Effurun
Victor Ochuko Geteloma Federal University of Petroleum Resources Effurun
Christopher Chukwufunaya Odiakaose Dennis Osadebay University Asaba
Patrick Ogholuwarami Ejeh Dennis Osadebay University Asaba
Sunny Innocent Onyemenem Federal College of Education (Technical), Asaba

DOI:

https://doi.org/10.37933/nipes/7.1.2025.1

Abstract

Diabetes is a prevalent chronic disorder, which has contributed to many underlying health challenges – as the World Health Organization has dubbed it the world’s deadliest disease and a silent killer. As a non-communicable disease – it is difficult to diagnose at an early stage due to its types (that morph through many stages) that is broadly classified into type-I, type-II, gestational and pre-diabetes. Diabetes account for over 2-million deaths annually due to failed internal organs, high-blood pressure, etc. Thus, immediate action has become imperative for early detection and warning to (pre)carrier patients. There is also the problem inherent in real-world datasets due to imbalanced class(es) distributions rippling across poor generalization, high misclassification rates and low accuracy. Our study posits the utilization of data balancing techniques using the PIMA Indian Diabetes (PID) dataset to ascertain the impact of data balancing. We use 6-known schemes (RUS, UPS, SMOTE, ADASyn, SMOTE-Tomek and SMOTEEN) to resolve dataset imbalance in PID and evaluate how well these schemes fit with improved performance. The study explores tree-based XGBoost and Random Forest ensemble in identifying diabetes. The empirical (comparative) results from balancing approaches show that XGBoost performed best with SMOTE-Tomek; while the Random Forest model performed best with SMOTEEN.

Effects of Data Balancing in Diabetes Mellitus Detection: A Comparative XGBoost and Random Forest Learning Approach

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

cover-side

Indexed by

Information

Current Issue

Browse