Pendekatan Hybrid K-Means SMOTE dan Logistic Regression Untuk Deteksi Dini Diabetes Mellitus Pada Imbalanced Data


Authors

  • Abdus Salam Universitas 17 Agustus 1945 Jakarta, Jakarta Utara, Indonesia
  • Lukman Azhari Universitas Muhammadiyah Tangerang, Tangerang, Indonesia
  • Ri Sabti Septarini Universitas Muhammadiyah Tangerang, Tangerang, Indonesia
  • Nofitri Heriyani Universitas Muhammadiyah Tangerang, Tangerang, Indonesia

DOI:

https://doi.org/10.47065/bulletincsr.v5i3.502

Keywords:

Diabetes Mellitus; K-Means SMOTE; Logistic Regression; Medical Classification; Imbalanced Data

Abstract

The increasing global prevalence of Diabetes Mellitus necessitates more accurate early detection efforts, particularly through machine learning-based approaches. However, one of the main challenges in medical classification lies in data imbalance, where the number of diabetic cases is significantly lower than that of non-diabetic ones. This study aims to develop a hybrid model by integrating Logistic Regression and K-Means SMOTE to enhance the sensitivity of early detection for Diabetes Mellitus, especially toward the minority class. Logistic Regression is chosen for its computational efficiency and interpretability, while K-Means SMOTE plays a role in balancing class distribution by generating synthetic samples in a structured manner based on clusters of minority class data. The dataset used consists of 2,000 records with 9 health-related features, obtained from the Kaggle platform. Evaluation results indicate that the model utilizing K-Means SMOTE achieves the best performance, with an accuracy of 82.00%, an F1-score of 72.73% for the Diabetes class, and the highest ROC-AUC score of 87.48%. Compared to models without oversampling and with standard SMOTE, this approach improves model generalization and sensitivity to positive cases. These findings have practical implications for the development of fairer and more effective machine learning-based early detection systems, particularly for implementation in healthcare facilities with limited resources.

Downloads

Download data is not yet available.

References

N. Singh, A. Kumari, and L. Kishore, “New-insight Management Implications of Diabetic Autonomic Neuropathy: Future Perspectives,” Int. J. Res. Pharm. Allied Sci., vol. 3, no. 6, pp. 63–71, 2024, doi: 10.71431/IJRPAS.2025.4106.

Reuters, “More than 800 million adults have diabetes globally, many untreated, study suggests,” reuters.com. Accessed: Apr. 15, 2025. [Online]. Available: https://www.reuters.com/business/healthcare-pharmaceuticals/more-than-800-million-adults-have-diabetes-globally-many-untreated-study-2024-11-13

A. Aminuddin, Yenny Sima, Nurril Cholifatul Izza, Nur Syamsi Norma Lalla, and Darmi Arda, “Edukasi Kesehatan Tentang Penyakit Diabetes Melitus bagi Masyarakat,” Abdimas Polsaka, pp. 7–12, 2023, doi: 10.35816/abdimaspolsaka.v2i1.25.

R. Rianto and P. I. Santosa, Data Preparation untuk Machine Learning & Deep Learning. Yogyakarta: Penerbit Andi, 2024.

V. R. Konasani and S. Kadre, Machine Learning and Deep Learning Using Python and TensorFlow. New York: McGraw Hill LLC, 2021.

L. Safitri and Z. Fatah, “Implementasi Prediksi Penyakit Diabetes Menggunakan Metode Decision Tree,” JUSIFOR J. Sist. Inf. dan Inform., vol. 2, no. 2, pp. 125–132, 2023, doi: 10.70609/jusifor.v3i2.5788 .

A. W. Mucholladin, F. A. Bachtiar, and M. T. Furqon, “Klasifikasi Penyakit Diabetes menggunakan Metode Support Vector Machine,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 5, no. 2, pp. 622–633, 2021.

N. Maulidah, R. Supriyadi, D. Y. Utami, F. N. Hasan, A. Fauzi, and A. Christian, “Prediksi Penyakit Diabetes Melitus Menggunakan Metode Support Vector Machine dan Naive Bayes,” Indones. J. Softw. Eng., vol. 7, no. 1, pp. 63–68, 2021, doi: 10.31294/ijse.v7i1.10279.

S. P. Nainggolan and A. Sinaga, “Comparative Analysis of Accuracy of Random Forest and Gradient Boosting Classifier Algorithm for Diabetes Classification,” Sebatik, vol. 27, no. 1, pp. 97–102, 2023, doi: 10.46984/sebatik.v27i1.2157.

A. P. Silalahi and H. G. Simanullang, “Supervised Learning Metode K-Nearest Neighbor Untuk Prediksi Diabetes Pada Wanita,” METHOMIKA J. Manaj. Inform. dan Komputerisasi Akunt., vol. 7, no. 1, pp. 144–149, 2023, doi: 10.46880/jmika.vol7no1.pp144-149.

S. Sutarman, R. Siringoringo, D. Arisandi, E. Kurniawan, and E. B. Nababan, “Model Klasifikasi Dengan Logistic Regression Dan Recursive Feature Elimination Pada Data Tidak Seimbang,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 4, pp. 735–742, 2024, doi: 10.25126/jtiik.1148198.

C. Haryawan and Y. M. K. Ardhana, “Analisa Perbandingan Teknik Oversampling SMOTE Pada Imbalanced Data,” J. Inform. dan Rekayasa Elektron., vol. 6, no. 1, pp. 73–78, 2023, doi: 10.36595/jire.v6i1.834.

N. Indrani et al., “Classification of Natural Disaster Reports from Social Media using K-Means SMOTE and Multinomial Naïve Bayes,” J. Comput. Sci. Informatics Eng., vol. 7, no. 1, pp. 60–67, 2023, doi: 10.29303/jcosine.v7i1.503.

C. V. Angkoso, M. A. N. Thrisna, B. D. Satoto, and A. Kusumaningsih, “Optimasi Klasifikasi Sentimen Menggunakan Random Forest dengan Preprocessing K-Means Clustering dan SMOTE,” JEPIN (Jurnal Edukasi dan Penelit. Inform., vol. 10, no. 3, pp. 389–400, 2024.

R. I. Borman, F. Rossi, Y. Jusman, A. A. A. Rahni, S. D. Putra, and A. Herdiansah, “Identification of Herbal Leaf Types Based on Their Image Using First Order Feature Extraction and Multiclass SVM Algorithm,” in International Conference on Electronic and Electrical Engineering and Intelligent System (ICE3IS), IEEE, 2021, pp. 12–17.

J. Dasilva, “Diabetes Dataset,” Kaggle. [Online]. Available: https://www.kaggle.com/datasets/johndasilva/diabetes

R. I. Borman, D. E. Kurniawan, Styawati, I. Ahmad, and D. Alita, “Classification of Maturity Levels of Palm Fresh Fruit Bunches Using the Linear Discriminant Analysis Algorithm,” AIP Conf. Proc., vol. 2665, no. 1, pp. 30023.1-30023.8, 2023, doi: 10.1063/5.0126513.

A. Bisri and M. Man, “Machine Learning Algorithms Based on Sampling Techniques for Raisin Grains Classification,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 7–14, 2023, doi: 10.30630/joiv.7.1.970.

X. Zhu et al., “An automatic identification method of imbalanced lithology based on Deep Forest and K-means SMOTE,” Geoenergy Sci. Eng., vol. 224, no. February, p. 211595, 2023, doi: 10.1016/j.geoen.2023.211595.

W. F. Hidayat, T. Asra, and A. Setiadi, “Klasifikasi Penyakit Daun Kentang Menggunakan Model Logistic Regression,” Indones. J. Softw. Eng., vol. 8, no. 2, pp. 173–179, 2022.

S. Suhliyyah, H. H. Handayani, and K. A. Baihaqi, “Implementasi Algoritma Logistic Regression Untuk Klasifikasi Penyakit Stroke,” Syntax J. Inform., vol. 12, no. 01, pp. 15–23, 2023.

Z. Abidin, R. I. Borman, F. B. Ananda, P. Prasetyawan, F. Rossi, and Y. Jusman, “Classification of Indonesian Traditional Snacks Based on Image Using Convolutional Neural Network (CNN) Algorithm,” in International Conference on Electronic and Electrical Engineering and Intelligent System (ICE3IS), IEEE, 2022, pp. 18–23.

Y. Liu, Y. Li, and D. Xie, “Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks,” J. Stat. Comput. Simul., vol. 94, no. 1, pp. 183–203, Jan. 2024, doi: 10.1080/00949655.2023.2238235.

H. Hairani, “Peningkatan Kinerja Metode SVM Menggunakan Metode KNN Imputasi dan K-Means-SMOTE untuk Klasifikasi Kelulusan Mahasiswa Universitas Bumigora,” J. Teknol. Inf. dan Ilmu Komput., vol. 8, no. 4, pp. 713–718, 2021, doi: 10.25126/jtiik.2021843428.

S. rahmah Jabir, H. Azis, D. Widyawatia, and A. U. Tenripada, “Prediksi Potensi Donatur Menggunakan Model Logistic Regression,” Indones. J. Data Sci., vol. 4, no. 1, pp. 31–37, 2023, doi: 10.56705/ijodas.v4i1.64.


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Pendekatan Hybrid K-Means SMOTE dan Logistic Regression Untuk Deteksi Dini Diabetes Mellitus Pada Imbalanced Data

Dimensions Badge

ARTICLE HISTORY

Published: 2025-04-25

Abstract View: 50 times
PDF Download: 19 times

How to Cite

Salam, A., Azhari, L., Septarini, R. S., & Heriyani, N. (2025). Pendekatan Hybrid K-Means SMOTE dan Logistic Regression Untuk Deteksi Dini Diabetes Mellitus Pada Imbalanced Data. Bulletin of Computer Science Research, 5(3), 219-227. https://doi.org/10.47065/bulletincsr.v5i3.502

Issue

Section

Articles