Academia.eduAcademia.edu

Outline

DNorm: disease name normalization with pairwise learning to rank

2013, Bioinformatics

https://doi.org/10.1093/BIOINFORMATICS/BTT474

Abstract

Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH Õ and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.

Key takeaways
sparkles

AI

  1. DNorm achieves 0.782 micro-averaged and 0.809 macro-averaged F-measure, outperforming previous methods significantly.
  2. DNorm is the first machine learning approach for disease name normalization, leveraging pairwise learning to rank (pLTR).
  3. The NCBI disease corpus consists of 793 abstracts with an average of 5.08 disease mentions per abstract.
  4. DNorm effectively handles variations in disease name mentions, including abbreviations and morphological changes.
  5. The primary goal is to enhance entity-specific semantic search and computer-assisted biocuration in biomedical literature.

References (41)

  1. Aronson,A.R. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium. pp. 17-21.
  2. Bai,B. et al. (2010) Learning to rank with (a lot of) word features. Inf. Retr., 13, 291-314.
  3. Biesecker,L.G. (2005) Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations. Clin. Genet., 68, 320-326.
  4. Burges,C. et al. (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine learning. ACM, New York, NY, USA, pp. 89-96.
  5. Buyko,E. et al. (2007) Resolution of coordination ellipses in biological named enti- ties using conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics. Pacific Association for Computational Linguistics, Melbourne, pp. 163-171.
  6. Collins,M. and Duffy,N. (2002) New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Montreal, Canada, pp. 263-270.
  7. Davis,A.P. et al. (2012) MEDIC: a practical disease vocabulary used at the com- parative toxicogenomics database. Database, 2012, bar065.
  8. Hakenberg,J. et al. (2012) A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions. J. Biomed. Inform., 45, 842-850.
  9. Herbrich,R. et al. (2000) Large margin rank boundaries for ordinal regression. In: Smola,A.J., et al. (eds.), Advances in Large Margin Classifiers. MIT Press, Cambridge, Massachusetts, USA, pp. 115-132.
  10. Hirschman,L. et al. (2005a) Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, 6(Suppl. 1), S11.
  11. Hirschman,L. et al. (2005b) Overview of BioCreAtIvE: critical assessment of infor- mation extraction for biology. BMC Bioinformatics, 6(Suppl. 1), S1.
  12. Huang,M. et al. (2011a) GeneTUKit: a software for document-level gene normal- ization. Bioinformatics, 27, 1032-1033.
  13. Huang,M. et al. (2011b) Recommending MeSH terms for annotating biomedical articles. J. Am. Med. Inform. Assoc., 18, 660-667.
  14. Hunter,L.E. (2009) The Processes of Life: An Introduction to Molecular Biology. MIT Press, Cambridge, Massachusetts, USA.
  15. Islamaj Dog˘an,R. and Lu,Z. (2012a) An improved corpus of disease mentions in PubMed citations. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, Montreal, Canada, pp. 91-99.
  16. Islamaj Dog˘an,R. and Lu,Z. (2012b) An Inference Method for Disease Name Normalization. In: Proceedings of the AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. pp. 8-13.
  17. Jimeno,A. et al. (2008) Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 9(Suppl. 3), S3.
  18. Kang,N. et al. (2012) Using rule-based natural language processing to improve disease normalization in biomedical text. J. Am. Med. Inform. Assoc., 20, 876-881.
  19. Kim,J.D. et al. (2009) Overview of BioNLP'09 shared task on event extraction. In: Proceedings of the NAACL-HLT 2009 Workshop on BioNLP. Association for Computational Linguistics, Boulder, Colorado, pp. 1-9.
  20. Kim,S. et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database, 2012, bas042.
  21. Lafferty,J.D. et al. (2001) Conditional random fields: probabilistic models for seg- menting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 282-289.
  22. Leaman,R. and Gonzalez,G. (2008) BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput., 13, 652-663.
  23. Leaman,R. et al. (2009) Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In: Proceedings of the 2009 Symposium on Languages in Biology and Medicine. Jeju Island, South Korea, pp. 82-89.
  24. Leaman,R. et al. (2013) NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder Normalization in Clinical Notes with DNorm. In: Proceedings of the Conference and Labs of the Evaluation Forum. To appear.
  25. Lu,Z. (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database, 2011, baq036.
  26. Lu,Z. et al. (2011) The gene normalization task in BioCreative III. BMC Bioinformatics, 12(Suppl. 8), S2.
  27. Manning,C.D. et al. (2008) Introduction to Information Retreival. Cambridge University Press, Cambridge, England.
  28. Morgan,A.A. et al. (2008) Overview of BioCreative II gene normalization. Genome Biol., 9(Suppl. 2), S3.
  29. Ne´ve´ol,A. et al. (2012) Linking multiple disease-related resources through UMLS. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. ACM, New York, NY, USA, pp. 767-772.
  30. Schriml,L.M. et al. (2012) Disease Ontology: a backbone for disease semantic inte- gration. Nucleic Acids Res., 40, D940-D946.
  31. Scully,J.L. (2004) What is a disease? EMBO Rep., 5, 650-653.
  32. Sohn,S. et al. (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics, 9, 402.
  33. Stearns,M.Q. et al. (2001) SNOMED clinical terms: overview of the development process and project status. In: Proceedings of the AMIA Symposium. pp. 662-666.
  34. Suominen,H. et al. (2013) Three shared tasks on clinical natural language process- ing. In: Proceedings of the Conference and Labs of the Evaluation Forum. To appear.
  35. Tsuruoka,Y. et al. (2007) Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23, 2768-2774.
  36. Uzuner,O. et al. (2011) 2010 i2b2/VA challenge on concepts, assertions, and rela- tions in clinical text. J. Am. Med. Inform. Assoc., 18, 552-556.
  37. Voorhees,E. and Tong,R. (2011) Overview of the TREC 2011 medical records track. In: The tenth Text REtrieval Conference. National Institute of Standards and Technology, Gaithersburg, MD.
  38. Wei,C.H. et al. (2012) Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database, 2012, bas041.
  39. Wei,C.H. et al. (2013) PubTator: a web-based text mining tool for assisting biocura- tion. Nucleic Acids Res., 41 (Web server), W518-W522.
  40. Wermter,J. et al. (2009) High-performance gene name normalization with GeNo. Bioinformatics, 25, 815-821.
  41. Wiegers,T.C. et al. (2012) Collaborative biocuration-text-mining development task for document prioritization for curation. Database, bas037.