Biographical Semi-Supervised Relation Extraction Dataset
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
https://doi.org/10.1145/3477495.3531742Abstract
Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developed Biographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set. Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well. CCS CONCEPTS • Computing methodologies → Information extraction; Language resources.
References (40)
- 1913. The quarterly army list for the quarter ending April 1914.
- Christoph Alt, Marc Hübner, and Leonhard Hennig. 2019. Improving Relation Extraction by Pre-trained Language Representations. In Automated Knowledge Base Construction (AKBC). https://openreview.net/forum?id=BJgrxbqp67
- Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of ACL 2019. ACL, Florence, Italy, 2895-2905. https://aclanthology. org/P19-1279
- Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An Unsupervised Ap- proach to Biography Production Using Wikipedia. In Proceedings of ACL 2008: HLT. 807-815.
- Andrew Chisholm, Will Radford, and Ben Hachey. 2017. Learning to generate one-sentence biographies from Wikidata. CoRR abs/1702.0 (2017). http://arxiv. org/abs/1702.06235 _eprint: 1702.06235.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL 2019: HLT. ACL, Minneapolis, Minnesota, 4171-4186. https://www.aclweb.org/anthology/N19-1423
- Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved Relation Ex- traction with Feature-Rich Compositional Embedding Models. In Proceedings of EMNLP 2015. ACL, Lisbon, Portugal, 1774-1784. https://aclanthology.org/D15- 1205
- Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-Way Classification of Seman- tic Relations between Pairs of Nominals. In Proceedings of the 5th Interna- tional Workshop on Semantic Evaluation. ACL, Uppsala, Sweden, 33-38. https: //aclanthology.org/S10-1006
- Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of ACL 2011: HLT. ACL, Portland, Oregon, USA, 541-550. https://aclanthology.org/P11-1055
- Alexander Hogue, Joel Nothman, and James R Curran. 2014. Unsupervised Biographical Event Extraction Using Wikipedia Traffic. In Proceedings of the Australasian Language Technology Association Workshop 2014. 41-49.
- Chen Jia, Yuefeng Shi, Qinrong Yang, and Yue Zhang. 2020. Entity Enhanced BERT Pre-training for Chinese NER. In Proceedings of EMNLP 2020. ACL, Online, 6384-6396. https://doi.org/10.18653/v1/2020.emnlp-main.518
- Jing Jiang. 2012. Information Extraction from Text. Springer US, Boston, MA, 11-41. https://doi.org/10.1007/978-1-4614-3223-4_2
- Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of ACL 8 (2020), 64-77. https://aclanthology.org/ 2020.tacl-1.5
- Yudong Liu, Zhongmin Shi, and Anoop Sarkar. 2007. Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proceedings of NAACL 2007: HLT (Rochester, New York) (NAACL-Short '07). Association for Computational Linguistics, USA, 97-100.
- Makoto Miwa and Mohit Bansal. 2016. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of ACL 2016. Association for Computational Linguistics, Berlin, Germany, 1105-1116. https://doi.org/10. 18653/v1/P16-1105
- Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. 2021. Named Entity Recognition and Relation Extraction: State-of-the-Art. ACM Comput. Surv. 54, 1, Article 20 (feb 2021), 39 pages. https://doi.org/10.1145/3445965
- Tapas Nayak and Hwee Tou Ng. 2020. Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 8528-8535. https://ojs.aaai. org/index.php/AAAI/article/view/6374
- Paul Over and James Yen. 2004. An introduction to DUC-2004. National Institute of Standards and Technology (2004).
- Alessio Palmero Aprosio and Sara Tonelli. 2015. Recognizing Biographical Sec- tions in Wikipedia. In Proceedings of EMNLP 2015. ACL, Lisbon, Portugal, 811-816. https://aclanthology.org/D15-1095
- Alistair Plum, Marcos Zampieri, Constantin Orăsan, Eveline Wandl-Vogt, and Ruslan Mitkov. 2019. Large-scale Data Harvesting for Biographical Data. In Proceedings of (BD-2019).
- Tharindu Ranasinghe and Marcos Zampieri. 2020. Multilingual Offensive Lan- guage Identification with Cross-lingual Embeddings. In Proceedings of EMNLP 2020. ACL, Online, 5838-5844. https://doi.org/10.18653/v1/2020.emnlp-main.470
- Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions without Labeled Text. In Machine Learning and Knowledge Discovery in Databases, José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148-163.
- Yatian Shen and Xuanjing Huang. 2016. Attention-Based Convolutional Neu- ral Network for Semantic Relation Extraction. In Proceedings of COLING 2016: Technical Papers. Osaka, Japan, 2526-2536. https://aclanthology.org/C16-1238
- Ayush Singhal, Michael Simmons, and Zhiyong Lu. 2016. Text mining for preci- sion medicine: automating disease-mutation relationship extraction from biomed- ical literature. J Am Med Inform Assoc 23, 4 (April 2016), 766-772.
- Alisa Smirnova and Philippe Cudré-Mauroux. 2018. Relation Extraction Using Distant Supervision: A Survey. ACM Comput. Surv. 51, 5, Article 106 (nov 2018), 35 pages. https://doi.org/10.1145/3241741
- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word Representations: A Simple and General Method for Semi-Supervised Learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL, Uppsala, Sweden, 384-394. https://aclanthology.org/P10-1040
- Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78-85.
- Hailin Wang, Guoming Lu, Jin Yin, and Ke Qin. 2021. Relation Extraction: A Brief Survey on Deep Neural Network Based Methods. In 2021 The 4th International Conference on Software Engineering and Information Management (Yokohama, Japan) (ICSIM 2021). ACM, New York, NY, USA, 220-228. https://doi.org/10.1145/ 3451471.3451506
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of EMNLP 2020: System Demonstrations. ACL, Online, 38-45. https://aclanthology.org/2020.emnlp-demos.6
- Shanchan Wu and Yifan He. 2019. Enriching Pre-Trained Language Model with Entity Information for Relation Classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM '19). Association for Computing Machinery, New York, NY, USA, 2361-2364. https://doi.org/10.1145/3357384.3358119
- Minguang Xiao and Cong Liu. 2016. Semantic Relation Classification via Hierar- chical Recurrent Neural Network with Attention. In Proceedings of COLING 2016: Technical Papers. Osaka, Japan, 1254-1263. https://aclanthology.org/C16-1119
- Fuzhao Xue, Aixin Sun, Hao Zhang, and Eng Siong Chng. 2021. GDPNet: Refining Latent Multi-View Graph for Relation Extraction. Proceedings of the AAAI Conference on Artificial Intelligence 35, 16 (May 2021), 14194-14202. https://ojs.aaai.org/index.php/AAAI/article/view/17670
- Kui Xue, Yangming Zhou, Zhiyuan Ma, Tong Ruan, Huanhuan Zhang, and Ping He. 2019. Fine-tuning BERT for Joint Entity and Relation Extraction in Chi- nese Medical Text. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 892-897. https://doi.org/10.1109/BIBM47256.2019.8983370
- Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of EMNLP 2020. ACL, Online, 6442-6454. https: //doi.org/10.18653/v1/2020.emnlp-main.523
- Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-End Open-Domain Question Answering with BERTserini. In Proceedings of NAACL 2019. ACL, Minneapolis, Minnesota, 72-77. https://doi.org/10.18653/v1/N19-4013
- Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu, and César A Hidalgo. 2016. Pantheon 1.0, a Manually Verified Dataset of Globally Famous Biographies. Sci- entific data 3, 1 (2016), 1-16.
- Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation Classification via Convolutional Deep Neural Network. In Proceedings of COLING 2014: Technical Papers. Dublin City University and ACL, Dublin, Ireland, 2335- 2344. https://aclanthology.org/C14-1220
- Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of EMNLP 2017. ACL, Copenhagen, Denmark, 35-45. https: //aclanthology.org/D17-1004
- Liang Zhou, Miruna Ticrea, and Eduard Hovy. 2004. Multi-Document Biography Summarization. In Proceedings of EMNLP 2004. ACL, Barcelona, Spain, 434-441. https://aclanthology.org/W04-3256
- Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of ACL 2016. ACL, Berlin, Germany, 207-212. https://doi.org/10.18653/v1/P16-2034