Increasing Accuracy of Classification in C4.5 Algorithm by Applying Principle Component Analysis for Diabetes Diagnosis

Michael Sitanggang; Elmanani Simamora; Froilan D. Mobo

doi:10.25217/numerical.v6i2.2610

Authors

Michael Sitanggang Mathematics Departement, State of University Medan
Elmanani Simamora Mathematics Departement, State of University Medan
Froilan D. Mobo Department of Research, Devt and Extension Philippines Merchant Marine Academy Philippines

DOI:

https://doi.org/10.25217/numerical.v6i2.2610

Keywords:

Classification, Decision Tree C4.5, Diabetes Mellitus, Machine Learning, PCA

Abstract

The data revolution in medical records has increased the automation of medical devices in determining the factors that cause any disease, but it also poses challenges to their analysis. According to WHO, about 6% of the world's population of more than 420 million people live with type 1 or type 2 diabetes and this number has estimated to rise beyond half a billion by 2030, which means that one of the ten adults in the future is suffering from diabetes. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used Decision Tree C4.5 to predict diabetes mellitus. This research used a diabetic dataset obtained from UCI machine learning repository with 419 instances and 16 attributes. In this dataset, mostly of attributes are numeric types that are continuous. This research results of the improved C4.5 algorithm by applying PCA. Many algorithms have been proposed to overcome misclassification and overfitting on classifications Decision Tree C4.5. Feature reduction is one option that is intended to eliminate irrelevant data and overcome outliers in the data so as to increase classification accuracy. Based on the results of the experiment, the application of PCA in C4.5 resulted in an increase in accuracy of 6.55% were achieved.

References

Hestiana, D. W. (2017). Faktor-faktor yang berhubungan dengan kepatuhan dalam pengelolaan diet pada pasien rawat jalan diabetes mellitus tipe 2 di Kota Semarang. JHE (Journal of Health Education), 2(2), 137–145.

Nasution, F., Andilala, A., & Siregar, A. A. (2021). Faktor Risiko Kejadian Diabetes Mellitus. Jurnal Ilmu Kesehatan, 9(2), 94–102.

Kandhasamy, J. P., & Balamurali, S. (2015). Performance analysis of classifier models to predict diabetes mellitus. Procedia Computer Science, 47, 45–51.

Witten, I., Frank, E., & Hall, M. (2011). Data Mining. Practical Machine Learning Tools and Techniques, ISBN 978-0123748560.

Santhosh, K. (2013). Modified C4. 5 algorithm with improved information entropy. International Journal of Engineering Research & Technology, 2(14), 485–512.

Rim, P., & Liu, E. (2020). Optimizing the C4. 5 Decision Tree Algorithm using MSD-Splitting. International Journal of Advanced Computer Science and Applications, 11(10), 41–47.

Liu, J., Ning, B., & Shi, D. (2019). Application of improved decision tree c4. 5 algorithms in the judgment of diabetes diagnostic effectiveness. 1237(2), 022116.

Muslim, M. A., Sugiharti, E., Prasetiyo, B., & Alimah, S. (2017). Penerapan Dizcretization dan Teknik Bagging Untuk Meningkatkan Akurasi Klasifikasi Berbasis Ensemble pada Algoritma C4. 5 dalam Mendiagnosa Diabetes. Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, 135–143.

I. S. Damanik, A. P. Windarto, A. Wanto, S. R. Andani, and W. Saputra, “Decision tree optimization in C4. 5 algorithm using genetic algorithm,” 2019, vol. 1255, no. 1, p. 012012.

R. S. Wahono, “Penerapan Naive Bayes untuk Mengurangi Data Noise pada Klasifikasi Multi Kelas dengan Decision Tree,” Journal of Intelligent Systems, vol. 1, no. 2, pp. 136–142, 2015.

S. T. Ikram and A. K. Cherukuri, “Improving accuracy of intrusion detection model using PCA and optimized SVM,” Journal of computing and information technology, vol. 24, no. 2, pp. 133–148, 2016.

Muhammad, M. U., Jiadong, R., Muhammad, N. S., Hussain, M., & Muhammad, I. (2019). Principal component analysis of categorized polytomous variable-based classification of diabetes and other chronic diseases. International Journal of Environmental Research and Public Health, 16(19), 3593.

Jolliffe, I. T. (2002). Principal component analysis for special types of data. Springer.

Tamonob, A. M., Saefuddin, A., & Wigena, A. H. (2015). Nonlinear Principal Component Analysis and Principal Component Analysis with Successive Interval in K-Means Cluster Analysis. 20(2).

Han, J., Kamber, M., & Mining, D. (2006). Concepts and techniques. Morgan Kaufmann, 340, 94104–3205.

S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” arXiv preprint arXiv:1811.12808, 2018.

Theerthagiri, P., & Vidya, J. (2021). Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms.

Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.

C. Sammut and G. I. Webb, Encyclopedia of machine learning. Springer Science & Business Media, 2011.

H. J. Escalante, “A comparison of outlier detection algorithms for machine learning,” 2005, pp. 228–237.