Automatic clustering for the web usage mining
2003
Abstract
In this paper we present an approach based on two hybrid clustering methods for Web Usage Mining (WUM). The WUM process contains three steps: pre-processing, data mining and result analysis. First, we give a brief description of the WUM process and Web data, followed in section 2 by the presentation of the pre-processing step and the data warehouse that we employed. Two hybrid clustering methods based on Principal Components Analysis (PCA), Multiple Classification Analysis (MCA) and Dynamic Clustering, are used for analysing the Web logs taken from INRIA's Web servers. The results obtained after applying these methods and the corresponding interpretations are presented in section four of the article. Finally, we provide some perspectives and future work.
References (16)
- F. Bonchi, F. Giannotti, C. Gozzi, G. Manco, M. Nanni, D. Pedreschi, C. Renso, and S. Ruggieri. Web log data warehousing and mining for intelligent web caching. Data Knowledge Engineering, 39(2):165-189, 2001.
- I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Visualization of navi- gation patterns on a web site using model-based clustering. In In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 280-284, Boston, Massachusetts, 2000.
- R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, May 2000.
- E. Diday La methode des nuees dynamiques. Revue de Statistique Appliquée, XIX(2):19-34, 1971.
- Y. Fu, K. Sandhu, and M. Shih. A generalization-based approach to clustering of web usage sessions. In Proc. of the 1999 KDD Workshop on Web Mining, San Diego, CA. Springer-Verlag, volume 1836 of LNAI, pages 21-38. Springer, 2000.
- Ralph Kimball. Entrepôts de données. Editions Vuibert, 2001.
- Y. Lechevallier, D. Tanasa, B. Trousse, and R. Verde Classification automatique : Applications au Web Mining. In 10èmes Rencontres de la Société Francophone de Classification (SFC03), Neuchâtel, September 2003.
- A. Luotonen. The common log file format. http://www.w3.org/Daemon/User/Config/ Logging.html, 1995.
- B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Discovery and evaluation of aggre- gate usage profiles for web personalization. Data Mining and Knowledge Discovery, 6(1):61-82, January 2002.
- M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In AAAI/IAAI, pages 727-732, 1998.
- CISIA. SPAD Reference Manuals. In Centre International de Statistique et d'Informatique Appliquees., France 1997.
- M. Spiliopoulou, L. C. Faulstich, and K. Winkler. A data miner analyzing the naviga- tional behaviour of web users. In Proc. of the Workshop on Machine Learning in User Modelling of the ACAI'99 Int. Conf., Creta, Greece, July 1999.
- F. Säuberlich and K.-P. Huber. A framework for web usage mining on anonymous logfile data. In Exploratory Data Analysis in Empirical Research, Proceedings of the 25th Annual Conference of the Gesellschaft für Klassifikation e.V., March 2001, pages 229-239. Springer-Verlag, 2002.
- D. Tanasa and B. Trousse. Le prétraitement des fichiers logs web dans le "Web Usage Mining" multi-sites. In Journées Francophones de la Toile (JFT'2003), July 2003.
- Osmar R. Zaiane, Man Xin, and Jiawei Han. Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. In Advances in Digital Libraries, pages 19-29, 1998.
- T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Man- agement of Data, Montreal, June 1996, pages 103-114. ACM Press, 1996. AxIS Team-Project, INRIA Sophia Antipolis, 2004, Route des Lucioles, BP 93, 06902 Sophia Antipolis, FRANCE E-mail: Brigitte.Trousse@sophia.inria.fr