Bagging Based Ensemble Classification Method on Imbalance Datasets

Abstract
Authors
Keywords
Conclusion
References

In the last few years, the problem of class imbalances is a challenging problem in data mining community. The class imbalance occurs when one of the classes in the data has a larger number than others. That condition causing the classification being not optimum because the larger class gave more influences in the classification. Some cases of class imbalance issues become a very important thing, for example, to detect cheating in banking operations, network trouble, cancer diagnose, and prediction of technical failure. This study conducts a bagging based ensemble method to overcome the problem of class imbalance on 14 datasets. The purpose of this research is to see the ability of some bagging based ensemble methods on overcoming the class imbalance problem. The results obtained by using OverBagging method are more stable than other bagging based methods in various datasets.

Published In : IJCSN Journal Volume 6, Issue 6

Date of Publication : December 2017

Pages : 670-676

Figures :04

Tables : 05

Mr. L. Hakim : master student in Department of Statistics, Bogor Agricultural University. His main interests is on data mining and bioinformatics.

Dr. B. Sartono : Currently worked as a lecture in Department of Statistics, Bogor Agricultural University. His main interests is on data mining and experimental design.

A. Saefuddin : received the M.Sc. and Ph.D.. In University of Guelph, Canada. He is a professor in Department of Statistics, Bogor Agricultural University. He is also serving as the Rector of Al – Azhar University Indonesia in Jakarta. His expertize is on genetic and biostatistics.

Ensemble, Boosting, Bagging, Class Imbalance, Classification

Overall, bagging based methods can improve results in minority classes as evidenced by their higher sensitivity values compared to the CART method. Although the overall value of specificity in the CART method is superior to that of the bagging method. This illustrates that the CART method is not able to predict the minority class well. The OverBagging method is a stable method for various datasets in both extreme and non-extreme classes. However, OverBagging method takes a long time in computing process. Another stable method is the Roughly Balanced Bagging method because the Roughly Balanced Bagging method as a whole is able to predict the minority class better when compared to other methods except in the extreme data Bagging Ensemble Variation is better when compared with the method of Roughly Balanced Bagging. But the Bagging Ensemble Variation not incapable of predicting trees with equal number of opportunities.

[1] Ramyachitra D. Manikanda P, “ Imbalanced Dataset Classification And Solutions: A Review” International Journal of Computing and Business Research (IJCBR). Vol.5, issue.4, pp. 12-23, 2014. [2] Shaza M Abd Elrahman1 and Ajith Abraham, “A Review of Class Imbalance Problem” Journal of Network and Innovative Computing. Vol. 1, pp. 332-340, 2013. [3] Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Ko lcz, “Special Issue on Learning from Imbalanced Data Sets” SIGKDD Explor. Newsl. Vol. 6, pp. 1-6, 2004. [4] Mikel Galar,Fransico, “A review on Ensembles for the class Imbalance Problem: Bagging,Boosting and Hybrid- Based Approaches” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Application And Reviews, Vol.42,No.4 July 2012. [5] Rushi Longadge, 2 Snehlata S. Dongre, Latesh Malik “Class Imbalance Problem in Data Mining: Review” International Journal of Computer Science and Network (IJCSN). Vol.2, pp. 83-88, 2013. [6] Yuliana Permatasari, “Penanganan Masalah Kelas Tidak Seimbang dengan RUSBoost dan UnderBagging (Studi Kasus: Mahasiswa Drop Out SPs IPB Program Magister)” Thesis, Bogor Agriculture University: Bogor. [7] Lior Rokach, “. Ensemble-based classifiers” Artif. Intell. Vol. 33, 1-39, 2010. [8] Eric Bauer and Ron Kohavi, “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants” Kluwer Academic Publishers. Boston. Manufactured in The Netherland. Vol. 36, pp. 15-139, 1999. [9] Achmad Bisri and Romi Satria Wahono, “Penerapan Adaboost untuk Penyelesaian Ketidakseimbangan Kelaspada Penentuan Kelulusan Mahasiswa dengan Metode Decision Tree. Journal of Intelligent Systems. Vol 1, pp. 27-32, 2015. [10] Rozianiwati Yusof , Khairul Azhar Kasmiran, Aida Mustapha, Norwati Mustapha, Nor Asma Mohd Zin, “Techniques For Handling Imbalanced Datasets When Producing Classifier Models” Journal of Theoretical and Applied Information Technology, Vol. 95, pp. 1425-1440, 2017. [11] Zhongbin Sun, QinbaoSong, XiaoyanZhu, HeliSun, BaowenXu , YumingZhou, “A novel ensemble method for classifying imbalanced data” Pattern Recognition, Vol. 48, pp. 1623-1637, 2015. [12] Yubin Park, Member and Joydeep Ghosh, “ Ensembles of a-Trees for Imbalanced Classification Problems” Journal Of Latex Class Files, Vol. 6, pp. 1-14. 2007. [13] Sergio Gónzalez a, Salvador García, Marcelino Lázaro , Aníbal R. Figueiras-Vidal and Francisco Herrera, “Class Switching according to Nearest Enemy Distance for learning from highly imbalanced data-sets” Science direct. Vol.70, pp. 12-24, 2017. [14] Bradley Efron and Robert J. Tibshirani, “An Introduction to the Bootstrap” Chapman & Hall. New York, 1993. [15] Esteban Alfaro, Matias Gamez and Noelia García, “An R Package for Classification with Boosting and Bagging”, Journal of Statistical Software Vol.54, issue. 32, pp. 11- 35, 2013. [16] L. Breiman” Bagging Predictors”, Machine Learning. Vol. 24, pp. 123-140, 1996. [17] R. Barandela, R. M. Valdovinos, and J. S. S´anchez, “New applications of ensembles of classifiers,” Pattern Anal. App, Vol. 6, pp. 245–256, 2003. [18] J. Blaszczynski , J. Stefanowski, Szajek, ”Local Neighbourhood in Generalizing Bagging for Imbalanced Data”, COPEM ECML-PKKD. Workshop Proceedings. Solving Complex Machine Learning Problems with Ensemble Methods.2013. [19] S.Wang and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” in IEEE Symp. Comput. Intell. Data Mining, pp. 324–331, 2009. [20] Y.Liu, NV. Chawla, M.Harper, E. Shriberg and A.Stolcke, “A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech” Computer Speech and Language (20): pp. 468–494, 2006. [21] NV.Chawla, KW. Bowyer, LO. Hall and WP.Kegelmeyer,”SMOTE:synthetic minority oversampling technique”, Journal of Artifical Intelligence Research, vol. 16, pp. 341–378, 2002. [22] C. Li, “Classifying Imbalanced Data Using A Bagging Ensemble Variation (BEV)”, Conference: Proceedings of the 45th Annual Southeast Regional Conference, March 2007. [23] S. Hido, H. Kashima, and Y. Takahashi, “Roughly balanced bagging for imbalanced data”, Stat. Anal. Data Min, Vol. 2, pp. 412–426, 2009. [24] AD. Lynam, “Prediction of Oestrus in Dairy Cows: An Application of Machine Learning to Skewed Data”, Degree of Master of Science at the University of Waikato, 2009. [25] Z. Zhang, B. Krawczyk , S. Garcia, AR. Perez and F. Herrera, “Empowering One-vs-One Decomposition with Ensemble Learning for Multi-Class Imbalanced Data”, Knowledge-Based Systems. Vol. pp. 106, 251–263, 2016. [26] B. Krawczyk, M. Wozniak and G. Schaefer, “Costsensitive decision tree ensembles for effective imbalanced classification”, Applied Soft Computing. Vol. 14, pp. 554-562, 2014. [27] L. Peng, H. Zhang, Y. Chen and B. Yang, “Imbalanced Traffic Identification Using an Imbalanced Data Gravitation-based Classification Model”, Computer Communications. Vol. 102, pp. 177-189, 2017. [28] FJD. Pintor, MJF. Gomes, A. Troncoso and FM. Alvarez, “A New Methodology Based on Imbalanced Classification for Predicting Outliers in Electricity Demand Time Series”, Article Energies, pp. 1-10, 2016. [29] Yi Wang and Zhiguo Gong, “Hierarchical Classification of Web Pages Using Support Vector Machine” , International Conference on Asian Digital Libraries, pp 12-21, 2008. [30] Mateusz Lango and Jerzy Stefanowski, “Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data ”, Journal of Intelligent Information Systems, Vol. 49, Issue. 141, pp. 1-31, 2017. [31] Mateusz Lango and Jerzy Stefanowski, “Applicability of Roughly Balanced Bagging for Complex Imbalanced Data”, Proceedings of the 4th Workshop on New Frontiers in Mining Complex Patterns (NFMCP), pp. 62- 73. 2015.