Effects of Data Balancing in Diabetes Mellitus Detection: A Comparative XGBoost and Random Forest Learning Approach

Authors

  • Ojugo Arnold Federal University of Petroleum Resources Effurun
  • Fidelis Obukohwo Aghware University of Delta, Agbor
  • Margaret Dumebi Okpor Delta State University of Science and Technology Ozoro
  • Maureen Ifeanyi Akazue Delta State University Abraka
  • Bridget Ogheneovo Malasowe University of Delta, Agbor
  • Tabitha Chukwudi Aghaunor Robert Morris University, Pittburg, Pennsylvania
  • Eferhire Valentine Ugbotu University of Salford
  • Rita Erhovwo Ako Federal University of Petroleum Resources Effurun
  • Victor Ochuko Geteloma Federal University of Petroleum Resources Effurun
  • Christopher Chukwufunaya Odiakaose Dennis Osadebay University Asaba
  • Patrick Ogholuwarami Ejeh Dennis Osadebay University Asaba
  • Sunny Innocent Onyemenem Federal College of Education (Technical), Asaba

DOI:

https://doi.org/10.37933/nipes/7.1.2025.1

Abstract

Diabetes is a prevalent chronic disorder, which has contributed to many underlying health challenges – as the World Health Organization has dubbed it the world’s deadliest disease and a silent killer. As a non-communicable disease – it is difficult to diagnose at an early stage due to its types (that morph through many stages) that is broadly classified into type-I, type-II, gestational and pre-diabetes. Diabetes account for over 2-million deaths annually due to failed internal organs, high-blood pressure, etc. Thus, immediate action has become imperative for early detection and warning to (pre)carrier patients. There is also the problem inherent in real-world datasets due to imbalanced class(es) distributions rippling across poor generalization, high misclassification rates and low accuracy. Our study posits the utilization of data balancing techniques using the PIMA Indian Diabetes (PID) dataset to ascertain the impact of data balancing. We use 6-known schemes (RUS, UPS, SMOTE, ADASyn, SMOTE-Tomek and SMOTEEN) to resolve dataset imbalance in PID and evaluate how well these schemes fit with improved performance. The study explores tree-based XGBoost and Random Forest ensemble in identifying diabetes. The empirical (comparative) results from balancing approaches show that XGBoost performed best with SMOTE-Tomek; while the Random Forest model performed best with SMOTEEN.

Downloads

Published

2025-03-05

How to Cite

Arnold, O., Aghware, F. O., Okpor, M. D., Akazue, M. I., Malasowe, B. O., Aghaunor, T. C., Ugbotu, E. V., Ako, R. E., Geteloma, V. O., Odiakaose, C. C., Ejeh, P. O., & Onyemenem, S. I. (2025). Effects of Data Balancing in Diabetes Mellitus Detection: A Comparative XGBoost and Random Forest Learning Approach . NIPES - Journal of Science and Technology Research, 7(1), 1–11. https://doi.org/10.37933/nipes/7.1.2025.1

Issue

Section

Articles