Academia.eduAcademia.edu

Outline

Self Organization of a Massive Document Collection

2002, Neural Networks, …

https://doi.org/10.1109/72.846729

Abstract
sparkles

AI

Self-organization of massive document collections is explored, emphasizing the challenges of traditional keyword searches and the potential of organizing documents visually in a 2D space. The study discusses the implementation of self-organizing maps (SOM) for effective data retrieval, especially in exploratory data analysis, showcasing its computational advantages over classical methods like multidimensional scaling. Experimental results comparing various projection methods illustrate the effectiveness of SOM in improving document classification and retrieval accuracy.

Key takeaways
sparkles

AI

  1. The WEBSOM methodology organizes 7 million patent abstracts into a 2D similarity map for enhanced data exploration.
  2. Utilizing self-organizing maps (SOM) reduces computational complexity significantly compared to traditional methods.
  3. Dimensionality reduction techniques, including random projections, maintain classification accuracy while speeding up processing.
  4. Document maps enhance user interaction by visually displaying related documents based on content similarity.
  5. Exploratory data analysis techniques enable users to discover relevant information beyond initial search queries.

References (52)

  1. J. W. Tukey, Exploratory Data Analysis. Reading, MA: Ad- dison-Wesley, 1977.
  2. G. Young and A. S. Householder, "Discussion of a set of points in terms of their mutual distances," Psychometrica, vol. 3, pp. 19-22, 1938.
  3. W. S. Torgerson, "Multidimensional scaling: I. Theory and method," Psychometrica, vol. 17, pp. 401-419, 1952.
  4. J. B. Kruskal and M. Wish, "Multidimensional scaling," Sage Univ. Paper Series on Quantitative Applications in the Social Sciences, New- bury Park, CA, Tech. Rep. 07-011,, 1978.
  5. J. de Leeuw and W. Heiser, "Theory of multidimensional scaling," in Handbook of Statistics, P. R. Krishnaiah and L. N. Kanal, Eds. Amsterdam, The Netherlands: North-Holland, 1982, vol. 2, pp. 285-316.
  6. M. Wish and J. D. Carrol, "Multidimensional scaling and its applica- tions," in Handbook of Statistics, P. R. Krishnaiah and L. N. Kanal, Eds. Amsterdam, The Netherlands: North-Holland, 1982, vol. 2, pp. 317-345.
  7. F. W. Young, "Multidimensional scaling," in Encyclopedia of Statistical Sciences, S. Kotz, N. L. Johnson, and C. B. Reads, Eds. New York: Wiley, 1985, vol. 5, pp. 649-659.
  8. J. W. Sammon, Jr., "A nonlinear mapping for data structure analysis," IEEE Trans. Comput., vol. C-18, pp. 401-409, 1969.
  9. T. Kohonen, "Self-organized formation of topologically correct feature maps," Biol. Cybern., vol. 43, no. 1, pp. 59-69, 1982.
  10. "Clustering, taxonomy, and topological maps of patterns," in Proc.
  11. Sixth Int. Conf. Pattern Recognition, Munich, Germany, Oct. 19-22, 1982, pp. 114-128.
  12. Self-Organizing Maps, 2nd ed. Berlin, Germany: Springer, 1997.
  13. X. Lin, D. Soergel, and G. Marchionini, "A self-organizing semantic map for information retrieval," in Proc. 14th Annu. Int. ACM/SIGIR Conf. Research and Development in Information Retrieval, 1991, pp. 262-269.
  14. J. C. Scholtes, "Unsupervised learning and the information retrieval problem," in Proc. IJCNN '91, Int. Joint Conf. Neural Networks, vol. I, Singapore, 1991, pp. 95-100.
  15. D. Merkl and A. M. Tjoa, "The representation of semantic similarity between documents by using maps: Application of an artificial neural network to organize software libraries," in Proc. FID '94, General Assembly Conf. Congress Int. Federation Information Documentation, 1994.
  16. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, "Newsgroup explo- ration with WEBSOM method and browsing interface," Helsinki Univ. Technol., Lab. Comput. Inform. Sci., Espoo, Finland, Tech. Rep. A32, 1996.
  17. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "Creating an order in digital libraries with self-organizing maps," in Proc. WCNN '96, World Congr. Neural Network , San Diego, CA, Sept. 15-18, 1996, pp. 814-817.
  18. T. Kohonen, S. Kaski, K. Lagus, and T. Honkela, "Very large two-level SOM for the browsing of newsgroups," in Proc. ICANN '96, Int. Conf. Artificial Neural Networks, Bochum, Germany, July 16-19, 1996, pp. 269-274.
  19. K. Lagus, T. Honkela, S. Kaski, and T. Kohonen, "Self-organizing maps of document collections: A new approach to interactive exploration," in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, E. Simoudis, J. Han, and U. Fayyad, Eds. Menlo Park, CA: AAAI, 1996, pp. 238-243.
  20. T. Kohonen, "Exploration of very large databases by self-organizing maps," in Proc. ICNN '97, Int. Conf. Neural Networks, Houston, TX, 1997, pp. PL1-PL6.
  21. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "WEBSOM-Self- organizing maps of document collections," Neurocomputing, vol. 21, pp. 101-117, 1998.
  22. K. Lagus, T. Honkela, S. Kaski, and T. Kohonen, "WEBSOM for textual data mining," Artif. Intell. Rev., vol. 13, pp. 345-364, 1999.
  23. T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela, "Self organization of a massive text document collection," in Kohonen Maps, E. Oja and S. Kaski, Eds. Amsterdam, The Nether- lands: Elsevier, 1999, pp. 171-182.
  24. T. Kohonen, "Comparison of SOM point densities based on different criteria," Neural Comput., vol. 11, no. 8, pp. 2171-2185, 1999.
  25. "New developments of Learning vector Quantization and the self- organizing map," in Symposium on Neural Networks; Alliances and Per- spectives in Senri. Osaka, Japan: Senri Int. Information Institute, 1992.
  26. Y. Cheng, "Convergence and ordering of Kohonen's batch map," Neural Comput., vol. 9, no. 8, pp. 1667-1676, 1997.
  27. A. Gersho, "Asymptotically optimal block quantization," IEEE Trans. Inform. Theory, vol. 25, pp. 373-380, July, 1979.
  28. R. M. Gray, "Vector quantization," IEEE ASSP Mag., pp. 4-29, Apr. 1984.
  29. J. Makhoul, S. Roucos, and H. Gish, "Vector quantization in speech coding," Proc. IEEE, vol. 73, pp. 1551-1588, Nov. 1985.
  30. T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, "SOM_PAK: The self-organizing map program package," Helsinki Univ. Technol., Lab. Computer Information Sci., Rep. A31, Jan., 1996.
  31. D. Koller and M. Sahami, "Toward optimal feature selection," in Ma- chine Learning: Proc. Thirteenth Int. Conf. (ICML '96), L. Saitta, Ed., 1996, pp. 284-292.
  32. H. Chen, C. Schuffels, and R. Orwig, "Internet categorization and search: A self-organizing approach," J. Vis. Commun. Image Represent., vol. 7, no. 1, pp. 88-102, 1996.
  33. S. Lesteven, Ponçot, and F. Murtagh, "Neural networks and infor- mation extraction in astronomical information retrieval," Vistas Astron., vol. 40, p. 395, 1996.
  34. X. Lin, "Map displays for information retrieval," J. Amer. Soc. Inform. Sci., vol. 48, pp. 40-54, 1997.
  35. D. Merkl, "Text classification with self-organizing maps: Some lessons learned," Neurocomputing, vol. 21, pp. 61-77, 1998.
  36. H. Chen, J. Nunamaker, Jr., R. Orwig, and O. Titkova, "Information vi- sualization for collaborative computing," IEEE Computer, pp. 75-82, Aug. 1998.
  37. H. Ritter and T. Kohonen, "Self-organizing semantic maps," Biol. Cy- bern., vol. 61, no. 4, pp. 241-254, 1989.
  38. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, "WEBSOM-Self-or- ganizing maps of document collections," in Proc. WSOM '97, Workshop Self-Organizing Maps, Espoo, Finland, June 4-6, 1997, pp. 310-315.
  39. G. Salton and M. J. McGill, Introduction to Modern Information Re- trieval. New York: McGraw-Hill, 1983.
  40. S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, "In- dexing by latent semantic analysis," J. Amer. Soc. Inform. Sci., vol. 41, pp. 391-407, 1990.
  41. S. Kaski, "Dimensionality reduction by random mapping: Fast similarity computation for clustering," D.Sc. thesis, Helsinki Univ. Technol., Fin- land, Mar. 1997.
  42. Proc. IJCNN '98 Int. Joint Conf. Neural Networks, vol. 1, 1998, pp. 413-418.
  43. P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, "Clustering in large graphs and matrices," in Proc. 10th ACM-SIAM Symp. Discrete Algorithms, San Francisco, CA, 1999, pp. 291-299.
  44. C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, "La- tent semantic indexing: A probabilistic analysis," in Proc. Seventeenth ACM SIGACT-SIGMOID-SIGART Symp. Principles of Database Sys- tems, Seattle, WA, June 1-4, 1998, pp. 159-168.
  45. T. Kohonen, "Self-organization of very large document collections: State of the art," in Proc. ICANN98, 8th Int. Conf. Artificial Neural Networks, vol. 1, L. Niklasson, M. Bodén, and T. Ziemke, Eds., 1998, pp. 65-74.
  46. D. Roussinov and H. Chen, "A scalable self-organizing map algorithm for textual classification: A neural network approach to thesaurus gener- ation," CC-AI-Commun., Cogn. Artif. Intell., vol. 15, pp. 81-111, 1998.
  47. J. S. Rodrigues and L. B. Almeida, "Improving the learning speed in topological maps of patterns," in Proc. INNC '90, Int. Neural Networks Conf. , 1990, pp. 813-816.
  48. P. Koikkalainen, "Progress with the tree-structured self-organizing map," in Proc. ECAI'94, 11th Eur. Conf. Artificial Intelligence, A. G. Cohn, Ed., 1994, pp. 211-215.
  49. "Fast deterministic self-organizing maps," in Proc. ICANN'95, Int. Conf. Artificial Neural Networks, vol. II, F. Fogelman-Soulié and P. Gallinari, Eds., Nanterre, France, 1995, pp. 63-68.
  50. T. Kohonen, "Things you haven't heard about the Self-Organizing Map," in Proc. ICNN'93, Int. Conf. Neural Networks, 1993, pp. 1147-1156.
  51. K. Koskenniemi, "Two-level morphology: A general computational model for word-form recognition and production," Ph.D. thesis, Univ. Helsinki, Dept. General Linguistics, 1983.
  52. K. Lagus and S. Kaski, "Keyword selection method for characterizing text document maps," in Proc. ICANN99, Ninth Int. Conf. Artificial Neural Networks, vol. 1, 1999, pp. 371-376.