Web-based affiliation matching
2009
Abstract
Authors of scholarly publications state their affiliation in various forms. This kind of heterogeneity makes bibliographic analysis tasks on institutions impossible unless a comprehensive cleaning and consolidation of affiliation data is performed. We investigate automatic approaches to consolidate affiliation data to reduce manual work and support scalability of affiliation analysis. In particular, we propose to set up a reference database of affiliation strings found in publications. A key step in this task is the matching of different affiliation strings to determine whether or not they match. For affiliation matching we investigate web based similarity measures utilizing the cognitive power of current search engines. They determine the similarity of affiliations based on how the URLs in the result sets of affiliation web searches overlap. We evaluate the effectiveness of affiliation matching based on URL overlap as well as for the combined use with the Soft TF-IDF similarity measure.
References (17)
- Arasu, A., Kaushik, R. A grammar-based entity representation framework for data cleaning. Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD), 2009
- Aumueller, D. Towards web supported identification of top affiliations from scholarly papers. Proc. German Database Conf. (Database systems in Business, Technology and Web (BTW 2009), 2009
- Bollegala, D., Matsuo, Y., Ishizuka, M. Measuring semantic similarity between Words using web search engines. Proc. WWW Conf., 2007
- Christen, P., Goiser, K. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining. Springer, 2007
- Cohen, W., Ravikumar, P, Fienberg, S. A Comparison of String Metrics for Matching Names and Records. Data Cleaning and Object Consolidation, 19(1), 2003
- Elmacioglu, E. et al. Web based linkage. Proc. Web information and data management (WIDM), 2007
- Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. Duplicate Record Detection: A Survey. Knowledge and Data Engineering, 2007
- Gligorov, R. et al. Using Google distance to weight approximate ontology matches. Proc. WWW Conf. 2007
- Google Inc. Google AJAX Search API <http://code.google.com/apis/ajaxsearch>
- Kalahnikov, D. V., Mehrotra, S., Chen, Z. Exploiting relationships for domain-independent data cleaning. Proc. SIAM International Conference on Data Mining (SDM), 2005
- McCann, R., Shen, W., Doan, A. Matching Schemas in Online Communities: A Web 2.0 Approach. Proc. Data Engineering (ICDE), 2008
- Michalowski, M., Thakkar, S., Knoblock, C. A. Automatically utilizing secondary sources to align information across sources. AI Magazine, Spring 2005
- Pereira, D. A. et al. Using web information for author name disambiguation. Proc. Joint Conference on Digital Libraries (JCDL), 2009
- Rahm, E., Thor, A. Citation analysis of database publications. SIGMOD Record, Dec. 2005
- Tan, Y.F. et al. Efficient Web-Based Linkage of Short to Long Forms. Proc. ACM Workshop on the Web and Databases (WebDB), Vancouver, 2008
- Torvik, V. I., Smalheiser N. R. Author name disambiguation in MEDLINE. ACM Transactions on Konwledge Discovery from Data. 3(3) July 2009
- Yahoo! Inc. Yahoo Search BOSS <http://developer.yahoo.com/search/boss>