Text Classification for Intelligent
2004
Abstract
In the application domain of stock portfolio management, software agents that evaluate the risks associated with the individual companies of a portfolio should be able to read electronic news articles that are written to give investors an indication of the financial outlook of a company. There is a positive correlation between news reports on a company's financial outlook and the company's attractiveness as an investment. However, because of the volume of such reports, it is impossible for financial analysts or investors to track and read each one. Therefore, it would be very helpful to have a system that automatically classifies news reports that reflect positively or negatively on a company's financial outlook. To accomplish this task, we treat the analysis of news articles as a text classification problem. We developed a text classification algorithm that classifies financial news article by using a combination of a reduced but highly informative word feature sets and a variant of weighted majority algorithm. By clustering words represented in latent semantic vector space by LSA into groups with similar concepts, we are able to find semantically coherent word groups. A learning method with unlabeled data "Self-Confident" sampling was proposed to handle the problem of expensive data labeling. Vote entropy is the criterion that information-theoretically assigns a label to an unlabeled document. In comparison with naive Bayes classification boosted by Expectation Maximization (EM), the proposed method showed a better performance in terms of accuracy. Two criteria are used to evaluate methods: how well they improve their performances with unlabeled data after being initially trained on a small number of human-labeled articles and how well they classify the latest financial news articles which are mostly not seen during the training. The contribution of this work lies in the new classification method that we propose and in the sampling technique we used for improving classification accuracy.
References (16)
- W. Cohen and Y. Singer. Context-sensitive learning methods for text categoriza- tion. In Proceedings of International ACM Conference on Research and Devel- opment in Information Retrieval, pages 307-315, 1996.
- D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201-221, 1994.
- I. Dagan and P. Engelson. Committee-based sampling for training probabilistic classifiers. In Proceedings of International Conference on Machine Learning, 1995.
- K. Decker, K. Sycara, A. Pannu, and M. Williamson. Designing behaviors for information agents. In Proceedings of International Conference on Autonomous Agents, pages 404-413, 1997.
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.
- S. Dumais. Using svms for text categorization. IEEE Intelligent Systems, 13(4), 1998.
- A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of European Conference on Machine Learning, pages 137-142, 1998.
- D. Lewis and W. Gale. Training text classifiers by uncertainty sampling. In Proceedings of International ACM Conference on Research and Development in Information Retrieval, pages 3-12, 1994.
- R. Liere and P. Tadepalli. Active learning with committees for text categorization. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
- A. McCallum and K. Nigam. Employing em and pool-based active learning for text classification. In Proceedings of International Conference on Machine Learn- ing, pages 359-367, 1998.
- K. Nigam, J. Lafferty, and McCallum. Using maximum entropy for text catego- rization. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61-67, 1999.
- K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103-134, 2000.
- K. Sycara, K. Decker, and A. Pannu. Distributed intelligent agents. IEEE Expert, 1996.
- Y. Yang and X. Liu. A re-examination of text categorization methods. In Pro- ceedings of International ACM Conference on Research and Development in In- formation Retrieval, pages 42-49, 1999.
- Y. Yang and J. Pedersen. A comparative study on feature selection in text cat- egorization. In Proceedings of International Conference on Machine Learning, pages 412-420, 1997.