Machine learning in automated text categorization

Fabrizio Sebastiani

doi:10.1145/505282.505283

Outline

Machine Learning in Automated Text Categorization

JOSEPH ALEXANDER

https://doi.org/10.1145/505282.505283

visibility

…

description

47 pages

link

1 file

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

References (146)

AMATI, G. AND CRESTANI, F. 1999. Probabilistic learning for selective dissemination of informa- tion. Inform. Process. Man. 35, 5, 633-654.
ANDROUTSOPOULOS, I., KOUTSIAS, J., CHANDRINOS, K. V., AND SPYROPOULOS, C. D. 2000. An experimen- tal comparison of naive Bayesian and keyword- based anti-spam filtering with personal e-mail messages. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 160-167.
APTÉ, C., DAMERAU, F. J., AND WEISS, S. M. 1994. Automated learning of decision rules for text categorization. ACM Trans. on Inform. Syst. 12, 3, 233-251.
ATTARDI, G., DI MARCO, S., AND SALVI, D. 1998. Cat- egorization by context. J. Univers. Comput. Sci. 4, 9, 719-736.
BAKER, L. D. AND MCCALLUM, A. K. 1998. Distribu- tional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM Interna- tional Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 96-103.
BELKIN, N. J. AND CROFT, W. B. 1992. Information filtering and information retrieval: two sides of the same coin? Commun. ACM 35, 12, 29- 38.
BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG, G., AND SCHWANTNER, M. 1988. The automatic index- ing system AIR/PHYS. From research to appli- cation. In Proceedings of SIGIR-88, 11th ACM International Conference on Research and De- velopment in Information Retrieval (Grenoble, France, 1988), 333-342. Also reprinted in Sparck Jones and Willett [1997], pp. 513-517.
BORKO, H. AND BERNICK, M. 1963. Automatic docu- ment classification. J. Assoc. Comput. Mach. 10, 2, 151-161.
CAROPRESO, M. F., MATWIN, S., AND SEBASTIANI, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Doc- ument Management: Theory and Practice, A. G. Chin, ed. Idea Group Publishing, Hershey, PA, 78-102.
CAVNAR, W. B. AND TRENKLE, J. M. 1994. N-gram- based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Docu- ment Analysis and Information Retrieval (Las Vegas, NV, 1994), 161-175.
CHAKRABARTI, S., DOM, B. E., AGRAWAL, R., AND RAGHAVAN, P. 1998a. Scalable feature selec- tion, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. J. Very Large Data Bases 7, 3, 163-178.
CHAKRABARTI, S., DOM, B. E., AND INDYK, P. 1998b. Enhanced hypertext categorization using hyper- links. In Proceedings of SIGMOD-98, ACM In- ternational Conference on Management of Data (Seattle, WA, 1998), 307-318.
CLACK, C., FARRINGDON, J., LIDWELL, P., AND YU, T. 1997. Autonomous document classification for business. In Proceedings of the 1st International Conference on Autonomous Agents (Marina del Rey, CA, 1997), 201-208.
CLEVERDON, C. 1984. Optimizing convenient on- line access to bibliographic databases. Inform. Serv. Use 4, 1, 37-47. Also reprinted in Willett [1988], pp. 32-41.
COHEN, W. W. 1995a. Learning to classify English text with ILP methods. In Advances in Inductive Logic Programming, L. De Raedt, ed. IOS Press, Amsterdam, The Netherlands, 124-143.
COHEN, W. W. 1995b. Text categorization and rela- tional learning. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 124-132.
COHEN, W. W. AND HIRSH, H. 1998. Joins that gen- eralize: text classification using WHIRL. In Pro- ceedings of KDD-98, 4th International Confer- ence on Knowledge Discovery and Data Mining (New York, NY, 1998), 169-173.
COHEN, W. W. AND SINGER, Y. 1999. Context- sensitive learning methods for text categoriza- tion. ACM Trans. Inform. Syst. 17, 2, 141- 173.
COOPER, W. S. 1995. Some inconsistencies and mis- nomers in probabilistic information retrieval. ACM Trans. Inform. Syst. 13, 1, 100-111.
CREECY, R. M., MASAND, B. M., SMITH, S. J., AND WALTZ, D. L. 1992. Trading MIPS and memory for knowledge engineering: classifying census re- turns on the Connection Machine. Commun. ACM 35, 8, 48-63.
CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J., AND CAMPBELL, I. 1998. "Is this document rele- vant? . . . probably." A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528-552.
DAGAN, I., KAROV, Y., AND ROTH, D. 1997. Mistake- driven learning in text categorization. In Pro- ceedings of EMNLP-97, 2nd Conference on Em- pirical Methods in Natural Language Processing (Providence, RI, 1997), 55-63.
DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W., LANDAUER, T. K., AND HARSHMAN, R. 1990. In- dexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 6, 391-407.
DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001. HMM-based passage models for document clas- sification and ranking. In Proceedings of ECIR- 01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
DÍAZ ESTEBAN, A., DE BUENAGA RODRÍGUEZ, M., URE ÑA L ÓPEZ, L. A., AND GARCÍA VEGA, M. 1998. In- tegrating linguistic resources in an uniform way for text classification tasks. In Proceed- ings of LREC-98, 1st International Conference on Language Resources and Evaluation (Grenada, Spain, 1998), 1197-1204.
DOMINGOS, P. AND PAZZANI, M. J. 1997. On the the optimality of the simple Bayesian classifier un- der zero-one loss. Mach. Learn. 29, 2-3, 103-130.
DRUCKER, H., VAPNIK, V., AND WU, D. 1999. Auto- matic text categorization and its applications to text retrieval. IEEE Trans. Neural Netw. 10, 5, 1048-1054.
DUMAIS, S. T. AND CHEN, H. 2000. Hierarchical clas- sification of Web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 256-263.
DUMAIS, S. T., PLATT, J., HECKERMAN, D., AND SAHAMI, M. 1998. Inductive learning algorithms and representations for text categorization. In Pro- ceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Man- agement (Bethesda, MD, 1998), 148-155.
ESCUDERO, G., MÀRQUEZ, L., AND RIGAU, G. 2000. Boosting applied to word sense disambiguation. In Proceedings of ECML-00, 11th European Con- ference on Machine Learning (Barcelona, Spain, 2000), 129-141.
FIELD, B. 1975. Towards automatic indexing: auto- matic assignment of controlled-language index- ing and classification from free indexing. J. Doc- ument. 31, 4, 246-265.
FORSYTH, R. S. 1999. New directions in text catego- rization. In Causal Models and Intelligent Data Management, A. Gammerman, ed. Springer, Heidelberg, Germany, 151-185.
FRASCONI, P., SODA, G., AND VULLO, A. 2002. Text categorization for multi-page documents: A hybrid naive Bayes HMM approach. J. Intell. Inform. Syst. 18, 2/3 (March-May), 195-217.
FUHR, N. 1985. A probabilistic model of dictionary- based automatic indexing. In Proceedings of RIAO-85, 1st International Conference "Re- cherche d'Information Assistee par Ordinateur" (Grenoble, France, 1985), 207-216.
FUHR, N. 1989. Models for retrieval with proba- bilistic indexing. Inform. Process. Man. 25, 1, 55- 72. FUHR, N. AND BUCKLEY, C. 1991. A probabilistic learning approach for document indexing. ACM Trans. Inform. Syst. 9, 3, 223-248.
FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG, G., SCHWANTNER, M., AND TZERAS, K. 1991. AIR/X-a rule-based multistage indexing system for large subject fields. In Proceed- ings of RIAO-91, 3rd International Conference "Recherche d'Information Assistee par Ordina- teur" (Barcelona, Spain, 1991), 606-623.
FUHR, N. AND KNORZ, G. 1984. Retrieval test evaluation of a rule-based automated index- ing (AIR/PHYS). In Proceedings of SIGIR-84, 7th ACM International Conference on Research and Development in Information Retrieval (Cambridge, UK, 1984), 391-408.
FUHR, N. AND PFEIFER, U. 1994. Probabilistic in- formation retrieval as combination of abstrac- tion inductive learning and probabilistic as- sumptions. ACM Trans. Inform. Syst. 12, 1, 92-115.
F ÜRNKRANZ, J. 1999. Exploiting structural infor- mation for text classification on the WWW. In Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis (Amsterdam, The Netherlands, 1999), 487-497.
GALAVOTTI, L., SEBASTIANI, F., AND SIMI, M. 2000. Experiments on the use of feature selec- tion and negative evidence in automated text categorization. In Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, Portugal, 2000), 59-68.
GALE, W. A., CHURCH, K. W., AND YAROWSKY, D. 1993. A method for disambiguating word senses in a large corpus. Comput. Human. 26, 5, 415-439.
G ÖVERT, N., LALMAS, M., AND FUHR, N. 1999. A probabillistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management (Kansas City, MO, 1999), 475-482.
GRAY, W. A. AND HARLEY, A. J. 1971. Computer- assisted indexing. Inform. Storage Retrieval 7, 4, 167-174.
GUTHRIE, L., WALKER, E., AND GUTHRIE, J. A. 1994. Document classification by machine: theory and practice. In Proceedings of COLING-94, 15th International Conference on Computational Lin- guistics (Kyoto, Japan, 1994), 1059-1063.
HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I. B., AND SCHMANDT, L. M. 1990. Tcs: a shell for content-based text categorization. In Proceed- ings of CAIA-90, 6th IEEE Conference on Arti- ficial Intelligence Applications (Santa Barbara, CA, 1990), 320-326.
HEAPS, H. 1973. A theory of relevance for au- tomatic document classification. Inform. Con- trol 22, 3, 268-278.
HERSH, W., BUCKLEY, C., LEONE, T., AND HICKMAN, D. 1994. OHSUMED: an interactive retrieval evalu- ation and new large text collection for research. In Proceedings of SIGIR-94, 17th ACM Interna- tional Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 192-201.
HULL, D. A. 1994. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM Interna- tional Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 282-289.
HULL, D. A., PEDERSEN, J. O., AND SCH ÜTZE, H. 1996. Method combination for document filtering. In Proceedings of SIGIR-96, 19th ACM Interna- tional Conference on Research and Development in Information Retrieval (Z ürich, Switzerland, 1996), 279-288.
ITTNER, D. J., LEWIS, D. D., AND AHN, D. D. 1995. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Sympo- sium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 301-315.
IWAYAMA, M. AND TOKUNAGA, T. 1995. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 273-281.
IYER, R. D., LEWIS, D. D., SCHAPIRE, R. E., SINGER, Y., AND SINGHAL, A. 2000. Boosting for document routing. In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management (McLean, VA, 2000), 70-77.
JOACHIMS, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text cat- egorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, TN, 1997), 143-151.
JOACHIMS, T. 1998. Text categorization with sup- port vector machines: learning with many rel- evant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 137-142.
JOACHIMS, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of ICML-99, 16th International Con- ference on Machine Learning (Bled, Slovenia, 1999), 200-209.
JOACHIMS, T. AND SEBASTIANI, F. 2002. Guest editors' introduction to the special issue on automated text categorization. J. Intell. Inform. Syst. 18, 2/3 (March-May), 103-105.
JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Ir- relevant features and the subset selection prob- lem. In Proceedings of ICML-94, 11th Interna- tional Conference on Machine Learning (New Brunswick, NJ, 1994), 121-129.
JUNKER, M. AND ABECKER, A. 1997. Exploiting the- saurus knowledge in rule induction for text clas- sification. In Proceedings of RANLP-97, 2nd In- ternational Conference on Recent Advances in Natural Language Processing (Tzigov Chark, Bulgaria, 1997), 202-207.
JUNKER, M. AND HOCH, R. 1998. An experimen- tal evaluation of OCR text representations for learning document classifiers. Internat. J. Docu- ment Analysis and Recognition 1, 2, 116-122.
KESSLER, B., NUNBERG, G., AND SCH ÜTZE, H. 1997. Automatic detection of text genre. In Proceed- ings of ACL-97, 35th Annual Meeting of the Asso- ciation for Computational Linguistics (Madrid, Spain, 1997), 32-38.
KIM, Y.-H., HAHN, S.-Y., AND ZHANG, B.-T. 2000. Text filtering by boosting naive Bayes classifiers. In Proceedings of SIGIR-00, 23rd ACM Interna- tional Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 168-175.
KLINKENBERG, R. AND JOACHIMS, T. 2000. Detect- ing concept drift with support vector machines. In Proceedings of ICML-00, 17th International Conference on Machine Learning (Stanford, CA, 2000), 487-494.
KNIGHT, K. 1999. Mining online text. Commun. ACM 42, 11, 58-61.
KNORZ, G. 1982. A decision theory approach to optimal automated indexing. In Proceedings of SIGIR-82, 5th ACM International Conference on Research and Development in Information Retrieval (Berlin, Germany, 1982), 174-193.
KOLLER, D. AND SAHAMI, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97, 14th International Con- ference on Machine Learning (Nashville, TN, 1997), 170-178.
KORFHAGE, R. R. 1997. Information Storage and Retrieval. Wiley Computer Publishing, New York, NY.
LAM, S. L. AND LEE, D. L. 1999. Feature reduc- tion for neural network based text categoriza- tion. In Proceedings of DASFAA-99, 6th IEEE International Conference on Database Advanced Systems for Advanced Application (Hsinchu, Taiwan, 1999), 195-202.
LAM, W. AND HO, C. Y. 1998. Using a generalized instance set for automatic text categorization. In Proceedings of SIGIR-98, 21st ACM Interna- tional Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 81-89.
LAM, W., LOW, K. F., AND HO, C. Y. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial In- telligence (Nagoya, Japan, 1997), 745-750.
LAM, W., RUIZ, M. E., AND SRINIVASAN, P. 1999. Auto- matic text categorization and its applications to text retrieval. IEEE Trans. Knowl. Data Engin. 11, 6, 865-879.
LANG, K. 1995. NEWSWEEDER: learning to filter net- news. In Proceedings of ICML-95, 12th Interna- tional Conference on Machine Learning (Lake Tahoe, CA, 1995), 331-339.
LARKEY, L. S. 1998. Automatic essay grading us- ing text categorization techniques. In Pro- ceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 90-95.
LARKEY, L. S. 1999. A patent search and classifica- tion system. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, CA, 1999), 179-187.
LARKEY, L. S. AND CROFT, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Z ürich, Switzerland, 1996), 289-297.
LEWIS, D. D. 1992a. An evaluation of phrasal and clustered representations on a text categoriza- tion task. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Devel- opment in Information Retrieval (Copenhagen, Denmark, 1992), 37-50.
LEWIS, D. D. 1992b. Representation and Learn- ing in Information Retrieval. Ph. D. thesis, De- partment of Computer Science, University of Massachusetts, Amherst, MA.
LEWIS, D. D. 1995a. Evaluating and optmizing au- tonomous text classification systems. In Pro- ceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 246- 254.
LEWIS, D. D. 1995b. A sequential algorithm for training text classifiers: corrigendum and addi- tional data. SIGIR Forum 29, 2, 13-19.
LEWIS, D. D. 1995c. The TREC-4 filtering track: description and analysis. In Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, MD, 1995), 165-180.
LEWIS, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information re- trieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 4-15.
LEWIS, D. D. AND CATLETT, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of ICML-94, 11th International Con- ference on Machine Learning (New Brunswick, NJ, 1994), 148-156.
LEWIS, D. D. AND GALE, W. A. 1994. A sequential algorithm for training text classifiers. In Pro- ceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 3-12. See also Lewis [1995b].
LEWIS, D. D. AND HAYES, P. J. 1994. Guest editorial for the special issue on text categorization. ACM Trans. Inform. Syst. 12, 3, 231.
LEWIS, D. D. AND RINGUETTE, M. 1994. A compar- ison of two learning algorithms for text cat- egorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), 81-93.
LEWIS, D. D., SCHAPIRE, R. E., CALLAN, J. P., AND PAPKA, R. 1996. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Z ürich, Switzerland, 1996), 298-306.
LI, H. AND YAMANISHI, K. 1999. Text classification using ESC-based stochastic decision lists. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Man- agement (Kansas City, MO, 1999), 122-130.
LI, Y. H. AND JAIN, A. K. 1998. Classification of text documents. Comput. J. 41, 8, 537-546.
LIDDY, E. D., PAIK, W., AND YU, E. S. 1994. Text cat- egorization for multiple users based on seman- tic features from a machine-readable dictionary. ACM Trans. Inform. Syst. 12, 3, 278-295.
LIERE, R. AND TADEPALLI, P. 1997. Active learning with committees for text categorization. In Pro- ceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence (Providence, RI, 1997), 591-596.
LIM, J. H. 1999. Learnable visual keywords for im- age classification. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, CA, 1999), 139-145.
MANNING, C. AND SCH ÜTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
MARON, M. 1961. Automatic indexing: an experi- mental inquiry. J. Assoc. Comput. Mach. 8, 3, 404-417.
MASAND, B. 1994. Optimising confidence of text classification by evolution of symbolic expres- sions. In Advances in Genetic Programming, K. E. Kinnear, ed. MIT Press, Cambridge, MA, Chapter 21, 459-476.
MASAND, B., LINOFF, G., AND WALTZ, D. 1992. Clas- sifying news stories using memory-based rea- soning. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Devel- opment in Information Retrieval (Copenhagen, Denmark, 1992), 59-65.
MCCALLUM, A. K. AND NIGAM, K. 1998. Employ- ing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, WI, 1998), 350-358.
MCCALLUM, A. K., ROSENFELD, R., MITCHELL, T. M., AND NG, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Pro- ceedings of ICML-98, 15th International Confer- ence on Machine Learning (Madison, WI, 1998), 359-367.
MERKL, D. 1998. Text classification with self- organizing maps: Some lessons learned. Neuro- computing 21, 1/3, 61-77.
MITCHELL, T. M. 1996. Machine Learning. McGraw Hill, New York, NY.
MLADENIĆ, D. 1998. Feature subset selection in text learning. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 95-100.
MLADENIĆ, D. AND GROBELNIK, M. 1998. Word se- quences as features in text-learning. In Pro- ceedings of ERK-98, the Seventh Electrotechni- cal and Computer Science Conference (Ljubljana, Slovenia, 1998), 145-148.
MOULINIER, I. AND GANASCIA, J.-G. 1996. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Nat- ural Language Processing, S. Wermter, E. Riloff, and G. Schaler, eds. Springer Verlag, Heidelberg, Germany, 343-354.
MOULINIER, I., RASKINIS, G., AND GANASCIA, J.-G. 1996. Text categorization: a symbolic approach. In Proceedings of SDAIR-96, 5th Annual Sympo- sium on Document Analysis and Information Retrieval (Las Vegas, NV, 1996), 87-99.
MYERS, K., KEARNS, M., SINGH, S., AND WALKER, M. A. 2000. A boosting approach to topic spotting on subdialogues. In Proceedings of ICML-00, 17th International Conference on Ma- chine Learning (Stanford, CA, 2000), 655- 662.
NG, H. T., GOH, W. B., AND LOW, K. L. 1997. Fea- ture selection, perceptron learning, and a us- ability case study for text categorization. In Pro- ceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, PA, 1997), 67-73.
NIGAM, K., MCCALLUM, A. K., THRUN, S., AND MITCHELL, T. M. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2/3, 103-134.
OH, H.-J., MYAENG, S. H., AND LEE, M.-H. 2000. A practical hypertext categorization method using links and incrementally available class informa- tion. In Proceedings of SIGIR-00, 23rd ACM In- ternational Conference on Research and Develop- ment in Information Retrieval (Athens, Greece, 2000), 264-271.
PAZIENZA, M. T., ed. 1997. Information Extraction. Lecture Notes in Computer Science, Vol. 1299. Springer, Heidelberg, Germany.
RILOFF. E. 1995. Little words can make a big dif- ference for text classification. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 130-136.
RILOFF, E. AND LEHNERT, W. 1994. Information ex- traction as a basis for high-precision text classifi- cation. ACM Trans. Inform. Syst. 12, 3, 296-333.
ROBERTSON, S. E. AND HARDING, P. 1984. Probabilis- tic automatic indexing by learning from human indexers. J. Document. 40, 4, 264-270.
ROBERTSON, S. E. AND SPARCK JONES, K. 1976. Rel- evance weighting of search terms. J. Amer. Soc. Inform. Sci. 27, 3, 129-146. Also reprinted in Willett [1988], pp. 143-160.
ROTH, D. 1998. Learning to resolve natural language ambiguities: a unified approach. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence (Madison, WI, 1998), 806-813.
RUIZ, M. E. AND SRINIVASAN, P. 1999. Hierarchical neural networks for text categorization. In Pro- ceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999), 281-282.
SABLE, C. L. AND HATZIVASSILOGLOU, V. 2000. Text- based approaches for non-topical image catego- rization. Internat. J. Dig. Libr. 3, 3, 261-275.
SALTON, G. AND BUCKLEY, C. 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Man. 24, 5, 513-523. Also reprinted in Sparck Jones and Willett [1997], pp. 323-328.
SALTON, G., WONG, A., AND YANG, C. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620. Also reprinted in Sparck Jones and Willett [1997], pp. 273-280.
SARACEVIC, T. 1975. Relevance: a review of and a framework for the thinking on the notion in information science. J. Amer. Soc. Inform. Sci. 26, 6, 321-343. Also reprinted in Sparck Jones and Willett [1997], pp. 143-165.
SCHAPIRE, R. E. AND SINGER, Y. 2000. BoosTexter: a boosting-based system for text categorization. Mach. Learn. 39, 2/3, 135-168.
SCHAPIRE, R. E., SINGER, Y., AND SINGHAL, A. 1998. Boosting and Rocchio applied to text filtering. In Proceedings of SIGIR-98, 21st ACM Interna- tional Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 215-223.
SCH ÜTZE, H. 1998. Automatic word sense discrimina- tion. Computat. Ling. 24, 1, 97-124.
SCH ÜTZE, H., HULL, D. A., AND PEDERSEN, J. O. 1995. A comparison of classifiers and document repre- sentations for the routing problem. In Proceed- ings of SIGIR-95, 18th ACM International Con- ference on Research and Development in Infor- mation Retrieval (Seattle, WA, 1995), 229-237.
SCOTT, S. AND MATWIN, S. 1999. Feature engineer- ing for text classification. In Proceedings of ICML-99, 16th International Conference on Ma- chine Learning (Bled, Slovenia, 1999), 379-388.
SEBASTIANI, F., SPERDUTI, A., AND VALDAMBRINI, N. 2000. An improved boosting algorithm and its application to automated text categorization. In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management (McLean, VA, 2000), 78-85.
SINGHAL, A., MITRA, M., AND BUCKLEY, C. 1997. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM Interna- tional Conference on Research and Development in Information Retrieval (Philadelphia, PA, 1997), 25-32.
SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY, C. 1996. Document length normalization. Inform. Process. Man. 32, 5, 619-633.
SLONIM, N. AND TISHBY, N. 2001. The power of word clusters for text classification. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).
SPARCK JONES, K. AND WILLETT, P., eds. 1997. Readings in Information Retrieval. Morgan Kaufmann, San Mateo, CA.
TAIRA, H. AND HARUNO, M. 1999. Feature selection in SVM text categorization. In Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence (Orlando, FL, 1999), 480-486.
TAURITZ, D. R., KOK, J. N., AND SPRINKHUIZEN-KUYPER, I. G. 2000. Adaptive information filtering using evolutionary computation. Inform. Sci. 122, 2-4, 121-140.
TUMER, K. AND GHOSH, J. 1996. Error correlation and error reduction in ensemble classifiers. Connection Sci. 8, 3-4, 385-403.
TZERAS, K. AND HARTMANN, S. 1993. Automatic indexing based on Bayesian inference networks. In Proceedings of SIGIR-93, 16th ACM Interna- tional Conference on Research and Development in Information Retrieval (Pittsburgh, PA, 1993), 22-34.
VAN RIJSBERGEN, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 2, 106-119.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London, UK. Available at http://www.dcs.gla.ac.uk/Keith.
WEIGEND, A. S., WIENER, E. D., AND PEDERSEN, J. O. 1999. Exploiting hierarchy in text catagoriza- tion. Inform. Retr. 1, 3, 193-216.
WEISS, S. M., APTÉ, C., DAMERAU, F. J., JOHNSON, D. E., OLES, F. J., GOETZ, T., AND HAMPP, T. 1999. Maximizing text-mining performance. IEEE Intell. Syst. 14, 4, 63-69.
WIENER, E. D., PEDERSEN, J. O., AND WEIGEND, A. S. 1995. A neural network approach to topic spot- ting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Informa- tion Retrieval (Las Vegas, NV, 1995), 317-332.
WILLETT, P., ed. 1988. Document Retrieval Sys- tems. Taylor Graham, London, UK.
WONG, J. W., KAN, W.-K., AND YOUNG, G. H. 1996. ACTION: automatic classification for full-text documents. SIGIR Forum 30, 1, 26-41.
YANG, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 13-22.
YANG, Y. 1995. Noise reduction in a statistical ap- proach to text categorization. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 256-263.
YANG, Y. 1999. An evaluation of statistical ap- proaches to text categorization. Inform. Retr. 1, 1-2, 69-90.
YANG, Y. AND CHUTE, C. G. 1994. An example-based mapping method for text categorization and re- trieval. ACM Trans. Inform. Syst. 12, 3, 252-277.
YANG, Y. AND LIU, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999), 42-49.
YANG, Y. AND PEDERSEN, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, TN, 1997), 412-420.
YANG, Y., SLATTERY, S., AND GHANI, R. 2002. A study of approaches to hypertext categorization. J. In- tell. Inform. Syst. 18, 2/3 (March-May), 219-241.
YU, K. L. AND LAM, W. 1998. A new on-line learn- ing algorithm for adaptive text filtering. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, MD, 1998), 156-160.

                                                          

The amount of information available online is increasing exponentially. While this information is a valuable resource, its sheer volume limits its value. Many research projects and companies are exploring the use of personalized applications that manage this deluge by tailoring the information presented to individual users. These applications all need to gather, and exploit, some information about individuals in order to be effective. This area is broadly called user profiling. This chapter surveys some of the most popular techniques for collecting information about users, representing, and building user profiles. In particular, explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs. We discuss in detail user profiles represented as weighted keywords, semantic networks, and weighted concepts. We review how each of these profiles is constructed and give examples of projects that employ each of these techniques. Finally, a brief discussion of the importance of privacy protection in profiling is presented. This chapter discusses user profiles specifically designed for providing personalized information access. Other types of profiles, build using different construction techniques, are described elsewhere in this book. In particular, Chapter 4 [40] discusses generic user modeling systems that are broader in scope, not necessarily focused on Internet applications. Related research on collaborative recommender systems, discussed in Chapter 9 of this book [81], combines information from multiple users in order to provide improved information services. Concern over privacy protection is growing in parallel with the demand for personalized features. These two trends seem to be in direct opposition to each other, so privacy protection must be a crucial component of every personalization system. A detailed discussion can be found in Chapter 21 of this book [39]. There are a wide variety of applications to which personalization can be applied and a wide variety of different devices available on which to deliver the personalized information. Early personalization research focused on personalized filtering and/or rating systems for e-mail [49], electronic newspapers [14, 16], Usenet newsgroups [41, 58, 86, 91, 106], and Web documents [4]. More recently, personalization efforts have focused on improving navigation effectiveness by providing browsing assistants [9, 13], and adaptive Web sites [69]. Because search is one of the most common activities performed today, many projects are now focusing on personalized Web search [46, 88, 92] and more details on the subject can be found in Chapter 6 of this book [52]. However, personalized approaches to searching other types of collections, e.g., short stories [76], Java source code [100], and images [14] have also been explored. Commercial products are also adopting personalized features, for example, Yahoo!'s personalized Web portals [110] and Google Lab's personalized search [30]. The aforementioned systems are just a few examples that illustrate the breadth of applications to which personalized approaches are being investigated. Nichols [63] and Oard and Marchionini [64] provide a general overview of some the issues and approaches to personalized rating and filtering and Pretschner [71] describes approximately 45 personalization systems. Most personalization systems are based on some type of user profile, a data instance of a user model that is applied to adaptive interactive systems. User profiles may include demographic information, e.g., name, age, country, education level, etc, and may also represent the interests or preferences of either a group of users or a single person. Personalization of Web portals, for example, may focus on individual users, for example, displaying news about specifically chosen topics or the market summary of specifically selected stocks, or a groups of users for whom distinctive characteristics where identified, for example, displaying targeted advertising on ecommerce sites. In order to construct an individual user's profile, information may be collected explicitly, through direct user intervention, or implicitly, through agents that monitor user activity. Although profiles are typically built only from topics of interest to the user, some projects have explored including information about non-relevant topics in the profile [35, 104]. In these approaches, the system is able to use both kinds of topics to identify relevant documents and discard non-relevant documents at the same time. Profiles that can be modified or augmented are considered dynamic, in contrast to static profiles that maintain the same information over time. Dynamic profiles that Explicit info Data Collection Technology Or Application Profile Constructor User Implicit info Keyword profile Semantic Net profile Concept profile Personalized Services As shown in Figure 2.1, the user profiling process generally consists of three main phases. First, an information collection process is used to gather raw information about the user. As described in Section 2.2, depending on the information collection process selected, different types of user data can be extracted. The second phase focuses on user profile construction from the user data. Section 2.3 summarizes a variety of ways in which profiles may be represented and Section 2.4 some of the ways a profile may be constructed. The final phase, in which a technology or application exploits information in the user profile in order to provide personalized services, is discussed in Parts II and III of this book. 2.2 Collecting Information About Users The first phase of a profiling technique collects information about individual users. A basic requirement of such a system is that it must be able to uniquely identify users. This task is described in more detail in Section 2.2.1. The information collected may be explicitly input by the user or implicitly gathered by a software agent. It may be collected on the user's client machine or gathered by the application server itself. Depending on how the information is collected, different data about the users may be extracted. Several options, and their impacts, are discussed in Section 2.2.2. In

Background: Patient healthcare trajectory is a recent emergent topic in the literature, encompassing broad concepts. However, the rationale for studying patients' trajectories, and how this trajectory concept is defined remains a public health challenge. Our research was focused on patients' trajectories based on disease management and care, while also considering medico-economic aspects of the associated management. We illustrated this concept with an example: a myocardial infarction (MI) occurring in a patient's hospital trajectory of care. The patient follow-up was traced via the prospective payment system. We applied a semi-automatic text mining process to conduct a comprehensive review of patient healthcare trajectory studies. This review investigated how the concept of trajectory is defined, studied and what it achieves. Methods: We performed a PubMed search to identify reports that had been published in peer-reviewed journals between January 1, 2000 and October 31, 2015. Fourteen search questions were formulated to guide our review. A semi-automatic text mining process based on a semantic approach was performed to conduct a comprehensive review of patient healthcare trajectory studies. Text mining techniques were used to explore the corpus in a semantic perspective in order to answer non-a priori questions. Complementary review methods on a selected subset were used to answer a priori questions. Results: Among the 33,514 publications initially selected for analysis, only 70 relevant articles were semi-automatically extracted and thoroughly analysed. Oncology is particularly prevalent due to its already well-established processes of care. For the trajectory thema, 80% of articles were distributed in 11 clusters. These clusters contain distinct semantic information, for example health outcomes (29%), care process (26%) and administrative and financial aspects (16%). Conclusion: This literature review highlights the recent interest in the trajectory concept. The approach is also gradually being used to monitor trajectories of care for chronic diseases such as diabetes, organ failure or coronary artery and MI trajectory of care, to improve care and reduce costs. Patient trajectory is undoubtedly an essential approach to be further explored in order to improve healthcare monitoring.

Health care professionals produce abundant textual information in their daily clinical practice and this information is stored in many diverse sources and, generally, in textual form. The extraction of insights from all the gathered information, mainly unstructured and lacking normalization, is one of the major challenges in computational medicine. In this respect, text mining (TM) assembles different techniques to derive valuable insights from unstructured textual data so it has led to be especially relevant in medicine. The aim of this paper is therefore to provide an extensive review of existing techniques and resources to perform TM tasks in medicine. In this review, more than 90 relevant research studies have been analyzed, describing the most important practical applications, terminological resources, tools, and open challenges of TM in medicine. This article is categorized under: Application Areas > Health Care Algorithmic Development > Biological Data Mining Algorithmic Development > Hierarchies and Trees Algorithmic Development > Model Combining K E Y W O R D S medicine, text mining, text mining tools 1 | INTRODUCTION Over the last decades, the quantity of available information daily produced in medicine is growing considerably with a special emphasis on that generated by health care professionals in their general daily practices (Feldman, Hazekamp, & Chawla, 2016). The health of the patients is regularly described by thousands of doctors; the results come in the form of textual information that is stored in different format files such as clinical records, discharge summaries, clinical monitoring sheets, or radiological reports. As a consequence of these unstructured textual data sources, the extraction of useful knowledge for decision-making and the reusability of such information is hampered. Currently, the main problem to be faced by any health care professional is not simply obtaining any available clinical information from databases but a promising subset including the most relevant and useful information. The final aim is therefore to transform this information into knowledge so professionals in the field might leverage their daily practice. Nevertheless, this is not a trivial task since clinical information is very different from any other and it usually includes some special features: high ambiguity and complex vocabulary; absence of terminological standardization; short sentences that may contain grammatical errors; overuse of acronyms; structured and unstructured data are usually combined; and texts are normally written in a narrative form. The discovery of hidden knowledge in this amount of unstructured information is essential to provide support on the decision-taking process that is carried out by the professionals every day. In this regard, the term text mining (TM) gathers the most useful techniques to derive high-quality structured information from unstructured textual data (Feldman & Sanger,

The development of COVID-19 vaccines has been a great relief in many countries that have been affected by the pandemic. As a result, many governments have made significant efforts to purchase and administer vaccines to their populations. However, accommodating such vaccines is typically confronted with people's reluctance and fear. Like any other important event, COVID-19 vaccines have attracted people's discussions on social media and impacted their opinions about vaccination. Objective The goal of this study is twofold: First, it conducts a sentiment analysis around COVID-19 vaccines by automatically analyzing Arabic users' tweets. This analysis has been spread over time to better capture the changes in vaccine perceptions. This will provide us with some insights into the most popular and accepted vaccine(s) in the Arab countries, as well as the reasons behind people's reluctance to take the vaccine. Second, it develops models to detect any vaccine-related tweets, to help with gathering all information related to people's perception of the virus, and potentially detecting vaccinerelated tweets that are not necessarily tagged with the virus's main hashtags. Methods Arabic Tweets were collected by the authors, starting from January 1st, 2021, until April 20th, 2021. We deployed various Natural Language Processing (NLP) to distill our selected tweets. The curated dataset included in the analysis consisted of 1,098,376 unique tweets. To achieve the first goal, we designed state-of-the-art sentiment analysis techniques to extract knowledge related to the degree of acceptance of all existing vaccines and what are the main obstacles preventing the wide audience from accepting them. To achieve the second goal, we tackle the detection of vaccine-related tweets as a binary classification problem, where various Machine Learning (ML) models were designed to identify such tweets regardless of whether they use the vaccine hashtags or not. Results Generally, we found that the highest positive sentiments were registered for Pfizer-BioNTech, followed by Sinopharm-BIBP and Oxford-AstraZeneca. In addition, we found that 38% of the overall tweets showed negative sentiment, and only 12% had a positive sentiment. It is important to note that the majority of the sentiments vary between neutral and negative, showing the lack of conviction of the importance of vaccination among the large majority of tweeters. This paper extracts the top concerns raised by the tweets and advocates for taking them into account when advertising for the vaccination. Regarding the identification of vaccine-related tweets, the Logistic Regression model scored the highest accuracy of 0.82. Our findings are concluded with implications for public health authorities and the scholarly community to take into account to improve the vaccine's acceptance.

Machine Learning in Automated Text Categorization

Sign up for access to the world's latest research

Abstract

Related papers

References (146)

Related papers

Cited by