Concept Mining: A Conceptual Understanding based Approach
2009
Abstract
This thesis would not be possible without the support of many individuals, to whom I would like to express my gratitude. I will always be indebted to my supervisors, Prof. Fakhri Karray and Prof. Mohamed Kamel, for their support, encouragement, guidance, and most importantly trust. Prof. Karray's trust and support was instrumental in giving me confidence to achieve many accomplishments. I would like to thank Prof. Karray for his encouragement and guidance throughout my research. Prof. Kamel's input and guidance was invaluable to the quality and contribution of the work presented in this thesis, as well as in other publications. I would like to thank Prof. Kamel for his advice, valuable insights and feedback throughout my research. Without them this research would not have been possible. I would like also to thank many faculty members of the University of Waterloo, most notably my committee members, Prof. Chrysanne DiMarco, Prof. Krzysztof Czarnecki and Prof. Kostas Kontogiannis for their valuable input and suggestions.
References (126)
- A Vector Space with Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . .
- 1 Concept-based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 2 Conceptual Ontological Graph (COG) Representation . . . . . . . . . . . . . . . . .
- 1 Concept-based Model For Text Clustering . . . . . . . . . . . . . . . . . . . . . . .
- 2 Clustering Improvements (F-Measure) . . . . . . . . . . . . . . . . . . . . . . . . .
- 3 Clustering Improvements (Entropy) . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 4 Clustering Range of Improvements (F-Measure) . . . . . . . . . . . . . . . . . . . .
- 5 Clustering Range of Improvements (Entropy) . . . . . . . . . . . . . . . . . . . . . .
- 6 Concept-based Model For Text Categorization . . . . . . . . . . . . . . . . . . . . .
- 7 Categorization Improvements (F-Micro) . . . . . . . . . . . . . . . . . . . . . . . .
- 8 Categorization Improvements (F-Macro) . . . . . . . . . . . . . . . . . . . . . . . .
- 9 Categorization Improvements (Error) . . . . . . . . . . . . . . . . . . . . . . . . . .
- 10 Categorization Range of Improvements (F-Micro) . . . . . . . . . . . . . . . . . . .
- 11 Categorization Range of Improvements (F-Macro) . . . . . . . . . . . . . . . . . . .
- 12 Categorization Range of Improvements (Error) . . . . . . . . . . . . . . . . . . . . .
- 13 Concept-based Model For Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . .
- 14 Retrieval Improvements (bpref) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 15 Retrieval Improvements (P10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 16 Retrieval Improvements (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
- 17 Retrieval Range of Improvements (bpref) . . . . . . . . . . . . . . . . . . . . . . . .
- 18 Retrieval Range of Improvements (P10) . . . . . . . . . . . . . . . . . . . . . . . . .
- 19 Retrieval Range of Improvements (MAP) . . . . . . . . . . . . . . . . . . . . . . . .
- 20 Standard Deviation (F-Measure) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 21 Standard Deviation (Entropy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 22 Standard Deviation (F-Macro) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 23 Standard Deviation (F-Micro) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 24 Standard Deviation (Error) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 25 Standard Deviation (bpref) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 26 Standard Deviation (P10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- 27 Standard Deviation (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Concept-based Mining Model", Dynamic and Advanced Data Mining for Progressing Technological Development: Innovations and Systemic Approaches, by IGI Publishing (formerly called "Idea Group Publishing"), IRM Press, Information Science Publishing, CyberTech Publishing, and Information Science Ref- erence. 2007.
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Concept-based Text Clustering", Evolv- ing Application Domains of Data Warehousing and Mining: Trends and Solutions, by IGI Publishing (formerly called "Idea Group Publishing"), IRM Press, Information Science Pub- lishing, CyberTech Publishing, and Information Science Reference. 2007. Journal Articles -Submitted
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "An Efficient Concept-based Mining Model for Enhancing Text Clustering", IEEE Transactions on Knowledge and Data Engineering (TKDE).
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "An Efficient Model For Enhancing Text Classification Using Sentence Semantics", Special Issue of Computational Intelligence Jour- nal. (ADMA08 paper has been selected)
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "An Efficient Concept-based Method for Semantic Text Analysis", Computational Linguistics Journal.
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "An Efficient Concept-based Retrieval Model For Enhancing Search Engine Quality", Knowledge and Information Systems Journal (KAIS). Journal Articles -In Preparation
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Text Categorization Using Concept-based Model", ACM Transactions on Knowledge Discovery From Data (TKDD). Conference Proceedings
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "A Concept-based Model for Enhancing Text Categorization", 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), USA 2007, pp. 629-637. [Full paper in research track with acceptance rate: less than 8%]
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Text Clustering Using Concept- based Mining Model", IEEE International Conference on Data Mining (ICDM), Hong Kong 2006, pp. 1043-1048. [Full paper in research track with acceptance rate: less than 10%]
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Search Engine Quality Using Concept-based Text Retrieval", IEEE/WIC/ACM International Conference on Web Intelli- gence (WI), USA 2007, pp. 26-32. [Acceptance rate: 17%]
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Text Classification Using Sen- tence Semantics", Advanced Data Mining and Applications (ADMA), 2008, pp. 87-98. [Acceptance rate: 26%] Best Paper Award Nomination
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Text Retrieval Performance Using Conceptual Ontological Graph", Workshop on Ontology Mining and Knowledge Dis- covery from Semi-Structured Documents (MSD), IEEE International Conference on Data Mining, ICDM, Hong Kong 2006, pp. 39-44. [Acceptance rate: 30%]
- Shady Shehata, Fakhri Karray, Mohamed Kamel, "A Concept-based Graph Representation for Enhancing Text Categorization", Text Mining Workshop (TMW09), SIAM International Conference on Data Mining (SDM), 2009.
- Shady H. Shehata, Fakhri Karray, Mohamed Kamel, "Concept-based Mining Model for Learning Objects", LORNET Scientific Conference on Learning Systems of the Future: In- tegrating Knowledge and Services (I2LOR06), Montreal, Canada 2006.
- Christopher Brooks, Scott Bateman, Wengang Liu, Gordon McCalla, Jim Greer, Dragan Gaevic, Timmy Eap, Griff Richards, Khaled Hammouda, Shady H. Shehata, Mohamed Kamel, Fakhri Karray, Jelena Jovanovic, "Issues and Directions with Educational Metadata", LORNET Scientific Conference on Learning Systems of the Future: Integrating Knowledge and Services (I2LOR06), Montreal, Canada 2006.
- Shady H. Shehata, Fakhri Karray, Mohamed Kamel, "Concept Mining using Conceptual Ontological Graph (COG)", LORNET Scientific Conference on Portals and Services for Knowledge Management and Learning on the Semantic Web (I2LOR05), Montreal, Canada 2005.
- Shady H. Shehata, Fakhri Karray, Mohamed Kamel, Anas Vaqar and Hazem Shehata. "A Framework for Ontology Construction from Text Documents", International Conference of e-Learning Applications, 2005.
- Shady H. Shehata, Jan Bakus, Fakhri Karray and Mohamed Kamel, "The Effect of Verb Argument Structure on Document Classification", International Conference in Machine In- telligence (ACIDCA-ICMI), 2005.
- Yu Sun, Fakhri Karray, Shady H. Shehata, Otman Basir, Mohamed Kamel and Jiping Sun, "Measures of Fuzzy Event for Determination of Semantic Meaning", International Conference in Machine Intelligence (ACIDCA-ICMI), 2005.
- Shady H. Shehata, Fakhri Karray, Mohamed Kamel, "Multi-Agent Framework for Enhanc- ing Ontology based on Fuzzy Inferencing", LORNET Scientific Conference on Towards the Educational Semantic Web (I2LOR -04), Montreal, Canada, November 18th and 19th, 2004.
- Posters • Shady Shehata, Fakhri Karray, Mohamed Kamel, "Enhancing Text Categorization, Retrieval and Clustering Using Concept-based Model", 4th annual LORNET Scientific Conference I2LOR-07 in Montreal (Nov 4-7), 2007. First Position Award For Best Poster Bibliography
- M. Lynch, "e-business analytics, Indepth Report," 20 November 2000.
- T. Berners-Lee, J. Hendler, and O. Lassila, "The semantic web." Scientific American, 2001.
- R. Feldman and I. Dagan, "Knowledge discovery in textual databases (kdt)," in First Interna- tional Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, 1995, pp. 112-117.
- G. Salton, A. Wong, and C. Yang, "A vector space model for automatic indexing," Commu- nications of the ACM, vol. 18, no. 11, pp. 613-620, November 1975.
- C. Manning. and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
- C. Y. Suen, "N-gram statistics for natural language understanding and text processing," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 164-172, April 1979.
- M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, July 1975.
- M. Steinbach, G. Karypis, and V. Kumar, "Scatter/gather:a cluster-based approach to brows- ing large document collections," in 16th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993.
- --, "A comparison of document clustering techniques," in Knowledge Discovery and Data Mining (KDD) Workshop on TextMining, August 2000.
- K. Cios, W. Pedrycs, and R. Swiniarski, "Data mining methods for knowledge discovery." Boston: Kluwer Academic Publishers, 1998.
- C. J. van Rijsbergen, Information Retrieval. London, second edition.: Buttersworth, 1979.
- G. Kowalski, Information Retrieval Systems Theory and Implementation. Kluwer Academic Publishers, 1997.
- C. Buckley and A. F. Lewit, "Optimization of inverted vector searches," in SIGIR, 1985, pp. 97-110.
- D. R. Cutting, J. O. Pedersen, D. Karger, and J. W. Tukey, "Scatter/gather: A cluster-based approach to browsing large document collections," in Proceedings of the Fifteenth Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 318-329.
- O. Zamir, O. Etzioni, O. Madani, and R. M. Karp, "Fast and intuitive clustering of web documents," in KDD, 1997, pp. 287-290.
- D. Koller and M. Sahami, "Hierarchically classifying documents using very few words," in ICML, D. H. Fisher, Ed. Morgan Kaufmann, 1997, pp. 170-178.
- C. Aggarwal, S. Gates, and P. Yu, "On the merits of bilding categorization systems by su- pervised clustering," in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, S. Chaudhuri and D. Madigan, Eds. N.Y.: ACM Press, Aug. 15-18 1999, pp. 352-356.
- A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs: Prentice Hall, 1988.
- L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Anal- ysis, ser. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons Inc., 1990.
- K. J. Cios, W. Pedrycz, and R. W. Swiniarski, "Data mining methods for knowledge discov- ery," IEEE Transactions on Neural Networks, vol. 9, no. 6, pp. 1533-1534, 1998.
- B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, 1991.
- D. R. Hill, "A vector clustering technique," in FID-IFIP , Samuelson(ed), N-H 1968, 1967.
- C. E. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, no. 3, pp. 379-423, July 1948, continued 27(4):623-656, October 1948.
- B. Larsen and C. Aone, "Fast and effective text mining using linear-time document clustering," in KDD, 1999, pp. 16-22.
- Z. Xu, X. Xu, K. Yu, and V. Tresp, "A hybrid relevance-feedback approach to text retrieval," in ECIR, ser. Lecture Notes in Computer Science, F. Sebastiani, Ed., vol. 2633. Springer, 2003, pp. 281-293.
- P. R. Halmos, Naive Set Theory. New York: Springer, 1974.
- J. E. Goin, "Classification bias of the K-nearest neighbor algorithm," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 3, pp. 379-381, May 1984.
- V. Vapnik, Statistical Learning Theory. Wiley, 1998.
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge University Press, 2000.
- A. F. Smeaton, "An overview of information retrieval," in Information Retrieval and Hyper- text, M. Agosti and A. F. Smeaton, Eds. Dordrecht, NL: Kluwer Academic Publishers, 1997, pp. 3-25.
- S. Robertson, S. Walker, and M. Beaulieu, "Okapi at trec-7: Automatic ad hoc, filtering, vlc and interactive track."
- S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, "Okapi at TREC," in Text REtrieval Conference, 1992, pp. 21-30.
- D. Hull, "Using statistical testing in the evaluation of retrieval experiments," in Proceedings of Special Interest Group on Information Retrieval (ACM SIGIR), 1993.
- C. Buckley and E. M. Voorhees, "Retrieval evaluation with incomplete information," in SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 2004, pp. 25-32. [Online]. Available: http://portal.acm.org/citation.cfm?id=1009000
- J. F. Allen, Natural Language Understanding, second edition ed. Redwood City: CA: Ben- jamin/Cummings, 1994.
- D. Jurafsky and J. H. Martin, Speech and Language Processing. Prentice Hall Inc., 2000.
- C. Fillmore, "The case for case," in Universals in Linguistic Theory, E. Bach and R. Harms, Eds. Holt and Rinehart and Winston, 1968, pp. 1-88.
- L. Levin, "Operations on lexical forms: Unaccusative rules in germanic languages," Ph.D. dissertation, MIT and Department of Linguistics and Philosophy, 1985b.
- L. Beth, English Verb Classes and Alternations A Preliminary Investigation. The University of Chicago Press, 1993.
- C. F. Baker, C. J. Fillmore, and J. B. Lowe, "The berkeley framenet project," in Proceedings of the 1998 COLING-ACL Conference, Montreal, Canada, 1998, pp. 86-90.
- K. Paul and P. Martha, "From treebank to propbank," in Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), Spain, 2002.
- P. Kingsbury, M. Palmer, and M. Marcus, "Adding semantic annotation to the penn tree- bank," in Proceedings of the Human Language Technology Conference (HLT'02), 2002.
- K. Kipper, H. Dang, and M. Palmer, "Class-based construction of a verb lexicon," in AAAI- 2000 Seventeenth National Conference on Artificial Intelligence, Austin, TX, 2000.
- P. Kingsbury and K. Kipper, "Deriving verb-meaning clusters from syntactic structure," in HLT/NAACL Workshop on Text Meaning, 2003.
- R. Al-Halimi, R. C. Berwick, J. F. M. Burg, M. Chodorow, C. Fellbaum, et al., WordNet: An Electronic Lexical Database, C. Fellbaum, Ed. Cambridge: MIT Press, 1998.
- D. Gildea and D. Jurafsky, "Automatic labeling of semantic roles," Computational Linguistics, vol. 28, no. 3, 2002.
- M. Collins, "Head-driven statistical model for natural language parsing," Ph.D. dissertation, University of Pennsylvania, 1999.
- S. Pradhan, W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafsky, "Shallow semantic parsing using support vector machines," in the Proceedings of the Human Language Technology/North American Association for Computational Linguistics (HLT/NAACL-2004), Boston, 2004.
- M. Surdeanu, "Using predicate argument structures for information extraction," in ACL, Sapporo, Japan, 2003.
- E. Schapire, "The boosting approach to machine learning," in Proceedings of the MSRI Work- shop on Nonlinear Estimation and Classification, Berkeley, CA, 2002.
- A. Cynthia, R. Levy, and C. D. Manning, "A generative model for semantic role labeling," in European Conference on Machine Learning (ECML), 2003, pp. 397-408.
- K. Hacioglu, S. Pradhan, W. Ward, J. H. Martin, and D. Jurafsky, "Semantic role labeling by tagging syntactic chunks," in the Proceedings of the Eighth Conference on Natural Language Learning (CoNLL-2004), Boston, 2004.
- M. King, "Evaluating natural language processing systems," Commun. ACM, vol. 39, no. 1, pp. 73-79, 1996.
- J. F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Founda- tions. Pacific Grove, CA: Brooks Cole Publishing Co., August 1999.
- T. R. Gruber, "A translation approach to portable ontologies," Journal of Knowledge Acqui- sition, vol. 5, no. 2, pp. 199-220, 1993.
- --, "Toward principles for the design of ontologies used for knowledge sharing," in workshop on Formal Ontology, March 1993.
- S. U. O. W. G. S. WG), OpenCyc Project, IEEE.
- G. Frank, A. Farquhar, R. Fikes, V. Heiberg, and S. Zamler, The World Fact Book Knowledge Base Project, Stanford University.
- NLM's Unified Medical Language Project, National Library of Medicine, 1999.
- D. Brickley and R. V. Guha, RDF Vocabulary Description Language 1.0: RDF Schema, World Wide Web Consortium (W3C) Recommendation, 2004.
- M. Paolucci, O. Shehory, and K. Sycara, "Interleaving planning and execution in a multiagent team planning environment," Electronic Transactions of Artificial Intelligence, 2001.
- D. Fensel, F. Harmelen, I. Horrocks, D. L. McGuinness, and P. F. Patel-Schneider, "Oil: An ontology infrastructure for the semantic web," IEEE Intelligent Systems, 2001.
- S. Bechhofer, C. Goble, and I. Horrocks, "Daml+oil is not enough," First Semantic Web Working Symposium (SWWS'01), 2001.
- M. Dean et al., OWL Web Ontology Language Reference, World Wide Web Consortium (W3C) Recommendation, 2003.
- M. Viezzer, "Ontologies and problem-solving methods and ontology learning," in European Conference on Artificial Intelligence (ECAI), August 2000.
- N. Houser, D. D. Roberts, and J. V. Evra, Studies in the Logic of Charles Sanders Peirce. Bloomington: Indiana University Press, 1997.
- M. Minsky, Semantic Information Processing. Cambridge, MA: MIT Press, 1968.
- P. Kingsbury and M. Palmer, "Propbank: the next level of treebank," in Proceedings of Treebanks and Lexical Theories, 2003.
- A. Strehl, J. Ghosh, and R. Mooney, "Impact of similarity measures on web-page clustering," in Proceedings of 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI), 2000, pp. 58-64.
- W. Francis and H. Kucera, Manual of information to accompany a standard corpus of present- day edited american english, for use with digital computers, 1964.
- A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, N.J., 1988.
- S. Y. Lu and K. S. Fu, "A sentence to sentence clustering procedure for pattern analysis," IEEE Transactions on Systems Mans and Cybernetics, vol. 8, pp. 381-389, 1978.
- "Apache jakarta lucene search engine (version 1.3), http://lucene.apache.org/."
- I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma, "Terrier: A High Performance and Scalable Information Retrieval Platform," in Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006.
- T. Joachims, "Text categorization with support vector machines: learning with many rele- vant features," in Proceedings of ECML-98, 10th European Conference on Machine Learning, C. Nédellec and C. Rouveirol, Eds., no. 1398. Chemnitz, DE: Springer Verlag, Heidelberg, DE, 1998, pp. 137-142.