Empirical Study of Different Classifiers with Feature Extraction for E-Mail Spam Filtering

Abstract
Authors
Keywords
Conclusion
References

E-mail or electronic mail is a principal mode of communication for quite some time in both professional and personal use. But over the last few years email spam has rapidly increased. Several techniques have been adopted for spam filtering. Among the various approaches developed to eliminate spam, filtering is an important and popular one. In this paper, an empirical study is done using some email datasets. In the first step datasets were taken and various classifiers like naive bayes, SVM, k-NN and decision tree were implemented and the performances were observed. In the next level, the important features were extracted from the datasets and then performances of the classifiers were observed. The objective of this paper is to highlight the findings through the empirical study, which will also help us to determine a good classifier for spam filtering. It also illustrates the information regarding feature extraction and different classifiers.

Published In : IJCSN Journal Volume 3, Issue 3

Date of Publication : 01 June 2014

Pages : 71 - 76

Figures : 02

Tables : 03

Publication Link : Empirical Study of Different Classifiers with Feature Extraction for E-Mail Spam Filtering

Himadri Sekhar Atta : is from final year in Master of Technology in Computer Science and Engineering department of Institute of Engineering and Management, Kolkata, West Bengal. He passed his Bachelor of Technology degree in 2012 in Computer Science and Engineering department from Kanad Institute of Engineering and Management, Burdwan, West Bengal.

Classifiers

Feature Extraction

Filtering

Spam

Spam Filtering

The experiment results clearly show the effect of different classifiers for classifying a mail as spam or legitimate. The use of feature extraction also helped to identify the important attributes or features for classifying. It also increased the performance rate of the classifiers for the prediction of the class. So it is a better approach to build a spam filter using feature extraction. From the results it can also be concluded that the performance of SVM and k-NN classifiers were good enough than all other classifiers. So these classifiers will surely help us to implement a good spam filter. We could work out a spam filter which will directly access an incoming mail online, remove the unnecessary URLs(if present) or features and determine it as a spam mail or legitimate mail.

[1] Subramaniam, Thamarai, Hamid A. Jalab, and Alaa Y. Taqa. "Overview of textual anti-spam filtering techniques." International Journal of the Physical Sciences 5.12 (2010): 1869-1882.

[2] Blanzieri, Enrico, and Anton Bryl. "A survey of learning-based techniques of email spam filtering." Artificial Intelligence Review 29.1 (2008): 63-92.

[3] Wang, Xiao-lin. "Learning to classify email: a survey." 2005 International Conference on Machine Learning and Cybernetics. Vol. 9. 2005.

[4] Wang, Zi-Qiang, et al. "An efficient SVM-based spam filtering algorithm."Machine Learning and Cybernetics, 2006 International Conference on. IEEE, 2006.

[5] MAAWG. Messaging anti-abuse working group. Email metrics report. Third & fourth quarter 2006. Available at http://www.maawg.org/about/ MAAWGMetric 2006 3 4 report.pdf Accessed: 04.06.07, 2006.

[6] Siponen, Mikko, and Carl Stucke. "Effective antispam strategies in companies: An international study." System Sciences, 2006. HICSS'06. Proceedings of the 39th Annual Hawaii International Conference on. Vol. 6. IEEE, 2006.

[7] Moustakas, Evangelos, Chandrasekaran Ranganathan, and Penny Duquenoy. "Combating Spam through Legislation: A Comparative Analysis of US and European Approaches." CEAS. 2005.

[8] Kiritchenko, Svetlana, and Stan Matwin. "Email classification with co-training."Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 2011.

[9] Raad, Mostafa, et al. "Impact of spam advertisement through e-mail: A study to assess the influence of the anti-spam on the e-mail marketing." Afr. J. Bus. Manage 4.11 (2010): 2362-2367.

[10] Ying, K. O. N. G., and Z. H. A. O. Jie. "Learning to Filter Unsolicited Commercial E-Mail." International Proceedings of Computer Science & Information Technology 49 (2012).

[11] Kiritchenko, Svetlana, and Stan Matwin. "Email classification with co-training."Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 2011.

[12] Androutsopoulos, Ion, et al. "An evaluation of naive bayesian anti-spam filtering." arXiv preprint cs/0006013 (2000).

[13] Metsis, Vangelis, Ion Androutsopoulos, and Georgios Paliouras. "Spam filtering with naive bayes-which naive bayes?." CEAS. 2006.

[14] Almeida, Tiago A., Jurandy Almeida, and Akebo Yamakami. "Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers." Journal of Internet Services and Applications 1.3 (2011): 183-200.

[15] Cormack, Gordon V., Mark D. Smucker, and Charles LA Clarke. "Efficient and effective spam filtering and re-ranking for large web datasets." Information retrieval 14.5 (2011): 441-465.

[16] Lewis, David D., et al. "Training algorithms for linear text classifiers."Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1996.

[17] Sebastiani, Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR) 34.1 (2002): 1-47.

[18] Dagan, Ido, Yael Karov, and Dan Roth. "Mistakedriven learning in text categorization." Proceedings of the second conference on empirical methods in NLP. 1997.

[19] Androutsopoulos, Ion, et al. "An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages."Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2000.

[20] Yang, Zhen, et al. "An approach to spam detection by naive Bayes ensemble based on decision induction." Intelligent Systems Design and Applications, 2006. ISDA'06. Sixth International Conference on. Vol. 2. IEEE, 2006.

[21] Hamsapriya, T., and Ms D. Karthika Renuka. "Email classification for Spam Detection using Word Stemming." (2010).

[22] Elssied, Nadir Omer Fadl, Othman Ibrahim, and Ahmed Hamza Osman. "A Novel Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification." (2014).

[23] Zhang, Min-Ling, and Zhi-Hua Zhou. "ML-KNN: A lazy learning approach to multi-label learning." Pattern recognition 40.7 (2007): 2038- 2048.

[24] Fdez-Riverola, Florentino, et al. "Applying lazy learning algorithms to tackle concept drift in spam filtering." Expert Systems with Applications 33.1 (2007): 36-48.