Academia.eduAcademia.edu

Outline

Web-based affiliation matching

2009

Abstract

Authors of scholarly publications state their affiliation in various forms. This kind of heterogeneity makes bibliographic analysis tasks on institutions impossible unless a comprehensive cleaning and consolidation of affiliation data is performed. We investigate automatic approaches to consolidate affiliation data to reduce manual work and support scalability of affiliation analysis. In particular, we propose to set up a reference database of affiliation strings found in publications. A key step in this task is the matching of different affiliation strings to determine whether or not they match. For affiliation matching we investigate web based similarity measures utilizing the cognitive power of current search engines. They determine the similarity of affiliations based on how the URLs in the result sets of affiliation web searches overlap. We evaluate the effectiveness of affiliation matching based on URL overlap as well as for the combined use with the Soft TF-IDF similarity measure.

References (17)

  1. Arasu, A., Kaushik, R. A grammar-based entity representation framework for data cleaning. Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD), 2009
  2. Aumueller, D. Towards web supported identification of top affiliations from scholarly papers. Proc. German Database Conf. (Database systems in Business, Technology and Web (BTW 2009), 2009
  3. Bollegala, D., Matsuo, Y., Ishizuka, M. Measuring semantic similarity between Words using web search engines. Proc. WWW Conf., 2007
  4. Christen, P., Goiser, K. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining. Springer, 2007
  5. Cohen, W., Ravikumar, P, Fienberg, S. A Comparison of String Metrics for Matching Names and Records. Data Cleaning and Object Consolidation, 19(1), 2003
  6. Elmacioglu, E. et al. Web based linkage. Proc. Web information and data management (WIDM), 2007
  7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. Duplicate Record Detection: A Survey. Knowledge and Data Engineering, 2007
  8. Gligorov, R. et al. Using Google distance to weight approximate ontology matches. Proc. WWW Conf. 2007
  9. Google Inc. Google AJAX Search API <http://code.google.com/apis/ajaxsearch>
  10. Kalahnikov, D. V., Mehrotra, S., Chen, Z. Exploiting relationships for domain-independent data cleaning. Proc. SIAM International Conference on Data Mining (SDM), 2005
  11. McCann, R., Shen, W., Doan, A. Matching Schemas in Online Communities: A Web 2.0 Approach. Proc. Data Engineering (ICDE), 2008
  12. Michalowski, M., Thakkar, S., Knoblock, C. A. Automatically utilizing secondary sources to align information across sources. AI Magazine, Spring 2005
  13. Pereira, D. A. et al. Using web information for author name disambiguation. Proc. Joint Conference on Digital Libraries (JCDL), 2009
  14. Rahm, E., Thor, A. Citation analysis of database publications. SIGMOD Record, Dec. 2005
  15. Tan, Y.F. et al. Efficient Web-Based Linkage of Short to Long Forms. Proc. ACM Workshop on the Web and Databases (WebDB), Vancouver, 2008
  16. Torvik, V. I., Smalheiser N. R. Author name disambiguation in MEDLINE. ACM Transactions on Konwledge Discovery from Data. 3(3) July 2009
  17. Yahoo! Inc. Yahoo Search BOSS <http://developer.yahoo.com/search/boss>