An Enhanced and Pertinent Diagnostic System for Diabetes Mellitus Using Machine Learning

Authors

  • Pravin S. Rahate Assistant Professor, Department of Computer Engineering, Fr. C. Rodrigues Institute of Technology, Navi Mumbai, Maharashtra, India
  • Nilesh S. Bhelkar Assistant Professor, Department of Artificial Intelligence and Data Science, MCT’s Rajiv Gandhi Institute of Technology, Andheri, Mumbai, Maharashtra, India
  • Manoj Patil Associate Professor, Department of Computer Engineering, MCT’s Rajiv Gandhi Institute of Technology, Andheri, Mumbai, Maharashtra, India
  • Rahul S. Pachade Associate Professor, Department of Artificial Intelligence and Data Science, Shah and Anchor Kutchhi Engineering College, Chembur, Mumbai, Maharashtra, India

Keywords:

Diabetes Diagnosis, Machine Learning, Ensemble Learning, Feature Selection, Boruta Algorithm, Stacking Classifier, Explainable AI, PIMA Dataset, Random Forest, SMOTE.

Abstract

Diabetes mellitus (DM) has emerged as one of the most significant public health challenges of the 21st century, affecting an estimated 463 million adults worldwide in 2019, with projections indicating a rise to 700 million by 2045 . As a chronic metabolic disorder characterized by elevated blood glucose levels, diabetes is associated with severe complications including cardiovascular disease, kidney failure, neuropathy, and vision loss, making early and accurate diagnosis critically important for effective intervention . This manuscript presents an enhanced and pertinent diagnostic system for diabetes that integrates advanced machine learning techniques to achieve superior prediction accuracy while maintaining clinical interpretability. The proposed methodology encompasses a comprehensive five-stage pipeline: (1) robust data preprocessing including outlier detection and handling of class imbalance; (2) Boruta-based feature selection to identify the most salient predictors; (3) K-Means++ clustering for data stratification; (4) Stacking ensemble learning combining multiple base classifiers; and (5) explainable AI frameworks (LIME and SHAP) for model transparency. Experimental validation on the PIMA Indian Diabetes Dataset (PIDD) demonstrates that the proposed stacking ensemble model achieves 98% accuracy, significantly outperforming single classifiers including Logistic Regression (77%), Random Forest (86.76%), and XGBoost . The Boruta-SMOTE-ENN-Tabu model further identifies critical risk factors including family history, age, central obesity, hyperlipidemia, and body mass index . Random Forest emerges as the most efficient individual technique, achieving the best accuracy among single classifiers . The integration of PSO-optimized weighted majority voting achieves 93.22% accuracy with 94.12% precision . This research contributes a clinically viable, interpretable, and high-performance diagnostic system capable of early diabetes detection, thereby enabling timely intervention and improved patient outcomes.

Downloads

Published

04-12-2024