Academia.eduAcademia.edu

Outline

Feature reduction techniques for Arabic text categorization

2009, Journal of the American Society for Information Science and Technology

https://doi.org/10.1002/ASI.21173

Abstract

This paper presents and compares three feature reduction techniques that were applied to Arabic text. The techniques include stemming, light stemming, and word clusters. The effects of the aforementioned techniques were studied and analyzed on the K-nearest-neighbor classifier. Stemming reduces words to their stems. Light stemming, by comparison, removes common affixes from words without reducing them to their stems. Word clusters group synonymous words into clusters and each cluster is represented by a single word. The purpose of employing the previous methods is to reduce the size of document vectors without affecting the accuracy of the classifiers. The comparison metric includes size of document vectors, classification time, and accuracy (in terms of precision and recall). Several experiments were carried out using four different representations of the same corpus: the first version uses stem-vectors, the second uses light stem-vectors, the third uses word clusters, and the fourth uses the original words (without any transformation) as representatives of documents. The corpus consists of 15,000 documents that fall into three categories: sports, economics, and politics. In terms of vector sizes and classification time, the stemmed vectors consumed the smallest size and the least time necessary to classify a testing dataset that consists of 6,000 documents. The light stemmed vectors superseded the other three representations in terms of classification accuracy.

References (20)

  1. Aljlayl, M., & Frieder, O. (2002). On Arabic search: Improving the retrieval effectiveness via a light stemming approach. In Proceedings of the ACM 11th Conference on Information and Knowledge Management (pp. 340- 347). New York: ACM Press.
  2. Al-Shalabi, R., Kanaan, G., & Al-Serhan, H. (2003, December). A new approach for extracting Arabic roots. Paper presented at the Interna- tional Arab Conference on Information Technology (ACIT), Alexandra, Egypt.
  3. Basu, A., Walters, C., & Shepherd, M. (2003). Support vector machines for text categorization. In Proceedings of the 36th Annual Hawaii Inter- national Conference on System Sciences (pp. 103-109). Los Alamitos, California: IEEE Press. Retrieved July 2, 2009, from http://ieeexplore. ieee.org/stamp/stamp.jsp?tp=&arnumber=1174243&isnumber=26341
  4. Bednar, P. (2006, January). Active learning of SVM and decision tree classifiers for text categorization. Paper presented at the Fourth Slovakian- Hungarian Joint Symposium on Applied Machine Intelligence, Herlany, Slovakia.
  5. Correa, R.F., & Ludermir, T.B. (2002, November). Automatic text catego- rization: Case study. Paper presented at the VII Brazilian Symposium on Neural Networks, Pernambuco, Brazil.
  6. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learn- ing algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (pp. 148-155). New York: ACM Press.
  7. Duwairi, R.M. (2006). Machine learning for Arabic text categorization. Jour- nal of the American Society for Information Science and Technology, 57(8), 1005-1010.
  8. Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic docu- ment categorization based on the naïve Bayes algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Ara- bic Script-based Languages (pp. 51-58). Retrieved July 2, 2009, from http://www.arabicscript.org/W5/pdf/proceedings.pdf
  9. Eyheramendy, S., Lewis, D., & Madiagn, D. (2003). On the naïve Bayes model for text categorization. Paper presented at the Ninth International Conference on Artificial Intelligence and Statistics, Key West, FL.
  10. Gongde, G., Hui, W., David, A.B., Yaxin, B., & Kieran, G. (2004). An kNN model-based approach and its application in text categorization. In A. Gelbukh (Ed.), Proceedings of the Fifth International Confer- ence on Intelligent Text Processing and Computational Linguistics (CICLing) (pp. 559-570). Lecture Notes in Computer Science, Vol. 2945. Berlin/Heidelberg, Germany: Springer.
  11. Ker, S., & Chen, J. (2000). A text categorization based on summariza- tion technique. In J. Klavans & J. Gonzalo (Eds.), Proceedings of the ACL-2000 Workshop on Recent Advances in Natural Language Process- ing and Information Retrieval, held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (pp. 79-83). New Brunswick, NJ: The Association for Computational Linguistics.
  12. Mejia-Lavalle, M., & Morales, E. (2006). Feature selection consider- ing attribute inter-dependencies. In International Workshop on Feature Selection for Data Mining: Interfacing Machine Learning and Statistics (pp. 50-58). Providence, RI: American Mathematical Society.
  13. Pierre, J. (2000, September). Practical issues for automated categorization of web pages. Paper presented at the 2000 Workshop on the Semantic Web, Lisbon, Portugal. Retrieved May 29, 2009, from http://citeseer.ist. psu.edu/pierre00practical.html
  14. Sawaf, H., Zaplo, J., & Ney, H. (2001, July). Statistical classification methods for Arabic news articles. Paper presented at the Arabic Natural Language Processing Workshop. Toulonse, France.
  15. Sebastiani, F. (2005). Text categorization. In A. Zanasi (Ed.). Text mining and its applications to intelligence, CRM and knowledge management (pp. 109-129). Southampton, UK: WIT Press.
  16. Seo, Y., Ankolekar, A., & Sycara, K. (2004). Feature selection for extracting semantically rich words. Technical Report CMU-RI-TR-04-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
  17. Shang, W., Huoang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1-5.
  18. Shankar, S., & Karypis, G. (2000). A feature weight adjustment algorithm for document categorization. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press.
  19. Yan, J., Liu N., Zhang B., Yan S., Chen Z., Cheng Q., Fan W., & Ma W. (2005). OCFS: Optimal orthogonal centroid feature selection for text cat- egorization. In Proceedings of the 28th Annual International ACM SIGIR Conference (SIGIR'2005) (pp. 122-129). New York: ACM Press.
  20. Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In J.D.H. Fisher (Ed.). The Fourteenth Inter- national Conference on Machine Learning (ICML'97) (pp. 412-420). San Francisco: Morgan Kaufmann.