Bayesian Non-Exhaustive Classification A Case Study
2016, Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
https://doi.org/10.1145/2983323.2983714Abstract
The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal × Normal × Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.
References (33)
- REFERENCES
- F. Akova, M. Dundar, V. J. Davisson, E. D. Hirleman, A. K. Bhunia, J. P. Robinson, and B. Rajwa. A machine-learning approach to detecting unknown bacterial serovars. Statistical Analysis and Data Mining, pages 289-301, 2010.
- D. Aldous. Exchangeability and related topics. 1985.
- T. W. Anderson, editor. An Introduction to Multivariate Statistical Analysis. 1984.
- R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In European Chapter of the Association for Comp. Linguistics, pages 9-16, 2006.
- L. Cen, E. C. Dragut, L. Si, and M. Ouzzani. Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion. In SIGIR, pages 741-744, 2013.
- P.-Y. Chen, B. Zhang, M. A. Hasan, and A. O. Hero. Incremental method for spectral clustering of increasing orders. KDD Workshop on Mining and Learning with Graphs, 2016.
- S. Choudhury, K. Agarwal, S. Purohit, B. Zhang, M. Pirrung, W. Smith, and M. Thomas. Nous: Construction and querying of dynamic knowledge graphs. arXiv preprint arXiv:1606.02314, 2016.
- A. Davis, A. Veloso, A. S. da Silva, W. Meira, Jr., and A. H. F. Laender. Named entity disambiguation in streaming data. In ACL, 2012.
- A. P. de Carvalho, A. A. Ferreira, A. H. F. Laender, and M. A. Goncalves. Incremental unsupervised name disambiguation in cleaned digital libraries. JIDM, pages 289-304, 2011.
- M. Dundar, F. Akova, A. Qi, and B. Rajwa. Bayesian nonexhaustive learning for online discovery and modeling of emerging classes. In ICML, pages 113-120, 2012.
- T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., pages 209-230, 1973.
- T. Greene and W. S.Rayens. Partially pooled covariance matrix estimation in discriminant analysis. Communications in Statistics -Theory and Methods, pages 3679-3702, 1989.
- H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Joint Conf. on Digital Libraries, 2004.
- H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In ACM Joint Conf. on Digital Libraries, pages 334-343, 2005.
- L. Hermansson, T. Kerola, F. Johansson, V. Jethava, and D. Dubhashi. Entity disambiguation in anonymized graphs using graph kernels. In CIKM, pages 1037-1046, 2013.
- J. Hoffart, Y. Altun, and G. Weikum. Discovering emerging entities with ambiguous names. In WWW, 2014.
- M. Khabsa, P. Treeratpituk, and C. L. Giles. Online person name disambiguation with constraints. JCDL, 2015.
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556-562. 2001.
- D. Li and M. Becchi. Deploying graph algorithms on gpus: An adaptive solution. In IPDPS, 2013.
- D. J. Michaud. Adventures in computer forensics. SANS Institute, 2001.
- D. J. Miller and J. Browning. A mixture model and em-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Transactions on PAMI, pages 1468-1483, 2003.
- Y. Qian, Q. Zheng, T. Sakai, J. Ye, and J. Liu. Dynamic author name disambiguation for growing digital libraries. Journal of Inf. Retr., pages 379-412, 2015.
- T. K. Saha, B. Zhang, and M. Al Hasan. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, pages 1-14, 2015.
- G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1986.
- J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, pages 639-650, 1994.
- Y. Song, J. Huang, I. G. Councill, J. Li, and C. L. Giles. Efficient topic-based unsupervised name disambiguation. In JCDL, pages 342-351, 2007.
- J. Tang, A. C. M. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE, pages 975-987, 2012.
- A. Veloso, A. A. Ferreira, M. A. Goncalves, A. H. F. Laender, and W. M. Jr. Cost-effective on-demand associative author name disambiguation. Inf. Process. Manage., 2012.
- X. Wang, J. Tang, H. Cheng, and P. S. Yu. Adana: Active name disambiguation. In ICDM, pages 794-803, 2011.
- B. Zhang, S. Choudhury, M. A. Hasan, X. Ning, K. Agarwal, S. Purohit, and P. G. P. Cabrera. Trust from the past: Bayesian personalized ranking based link prediction in knowledge graphs. SDM Workshop on Mining Networks and Graphs, 2016.
- B. Zhang, N. Mohammed, V. Dave, and M. A. Hasan. Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158, 2015.
- B. Zhang, T. K. Saha, and M. A. Hasan. Name disambiguation from link data in a collaboration graph. In ASONAM, 2014.