Pooling annotated corpora for clinical concept extraction

Kavishwar B Wagholikar; Manabu Torii; Siddhartha R Jonnalagadda; Hongfang Liu

doi:10.1186/2041-1480-4-3

Outline

Pooling annotated corpora for clinical concept extraction.

Siddhartha Jonnalagadda

2013

https://doi.org/10.1186/2041-1480-4-3

visibility

…

description

21 pages

link

1 file

Abstract

Background The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers.

References (33)

Demner-Fushman D, Chapman WW, McDonald CJ: What can natural language processing do for clinical decision support? J Biomed Inform 2009, 42(5):760-772.
Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008, (1):128-144.
Uzuner O, Luo Y, Szolovits P: Evaluating the state-of-the-art in automatic de- identification. J Am Med Inform Assoc 2007, 14(5):550-563.
Uzuner O, Goldstein I, Luo Y, Kohane I: Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc 2008, 15(1):14-24.
Uzuner O, Solti I, Cadag E: Extracting medication information from clinical text. J Am Med Inform Assoc 2010, 17(5):514-518.
Wang Y, Patrick J: Cascading classifiers for named entity recognition in clinical notes. In Proceedings of the workshop on biomedical information extraction. Borovets, Bulgaria. 1859783: Association for Computational Linguistics; 2009:42-49.
Li D, Kipper-Schuler K, Savova G: Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the workshop on current trends in biomedical natural language processing. Columbus, Ohio. 1572326: Association for Computational Linguistics; 2008:94-95.
Jonnalagadda S: An effective approach to biomedical information extraction with limited training data (PhD Dissertation, Arizona State University). Phoenix, Arizona: PhD Phoenix; 2011.
Torii M, Hu Z, Wu CH, Liu H: BioTagger-GM: a gene/protein name recognition system. J Am Med Inform Assoc 2009, 16(2):247-255.
Tjong Kim Sang EF, De Meulder F: Introduction to the CoNLL-2003 shared task. In Seventh conference on natural language learning. Edmonton, Canada:; 2003:142-147.
Wilbur J, Smith L, Tanabe T: BioCreative 2 gene mention task. In Proceedings of the second BioCreative challenge workshop. Madrid, Spain: Proceedings of the second biocreative challenge evaluation workshop Vol: 23; 2007:7-16.
Arighi CN, Roberts PM, Agarwal S, Bhattacharya S, Cesareni G, Chatr-Aryamontri A, Clematide S, Gaudet P, Giglio MG, Harrow I, et al: BioCreative III interactive task: an overview. BMC Bioinformatics 2011, 12(Suppl 8):S4.
Kim JD, Nguyen N, Wang Y, Tsujii J, Takagi T, Yonezawa A: The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinformatics 2012, 13(Suppl 11):S1.
Aronson AR, Lang FM: An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010, 17(3):229-236.
Bakken S, Hyun S, Friedman C, Johnson S: A comparison of semantic categories of the ISO reference terminology models for nursing and the MedLEE natural language processing system. Stud Health Technol Inform 2004, 107(Pt 1):472-476.
Haug P, Koehler S, Lau LM, Wang P, Rocha R, Huff S: A natural language understanding system combining syntactic and semantic techniques. Proc Annu Symp Comput Appl Med Care 1994, :247-251.
Torii M, Wagholikar K, Liu H: Using machine learning for concept extraction on clinical documents from multiple data sources. J Am Med Inform Assoc 2011, 18(5):580- 587.
Johnson HL, Baumgartner WA, Krallinger M, Cohen KB, Hunter L: Corpus refactoring: a feasibility study. Journal of Biomedical Discovery and Collaboration 2007, 2(1):4.
Ohta T, Kim J-D, Pyysalo S, Wang Y, Tsujii J: Incorporating GENETAG-style annotation to GENIA corpus. In Workshop on current trends in biomedical natural language processing. Stroudsburg, PA, USA: Association for Computational Linguistics; 2009:106-107.
Wang Y, Kim J-D, Saetre R, Pyysalo S, Tsujii J: Investigating heterogeneous protein annotations toward cross-corpora utilization. BMC Bioinformatics 2009, 10(1):403.
The AIMed corpus: ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/interactions.tar.gz.
Wang Y, Saetre R, Kim J-D, Pyysalo S, Ohta T, Tsujii JI: Improving the inter-corpora compatibility for protein annotations. J Bioinform Comput Biol 2010, 08(05):901.
Fan JW, Prasad R, Yabut RM, Loomis RM, Zisook DS, Mattison JE, Huang Y: Part-of- speech tagging for clinical text: wall or bridge between institutions? AMIA Annu Symp Proc 2011, 2011:382-391.
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG: Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. Journal of American Medical Informatics Assocociation 2010, 17(5):507-513.
Uzuner O, South BR, Shen S, Duvall SL: 2010i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011, 18(5):552-556.
Wagholikar K, Torii M, Jonnalagadda S, Liu H: Feasibility of pooling annotated corpora for clinical concept extraction. In AMIA summit on clinical research informatics.
San Francisco, CA: American Medical Informatics Summits on Translational Science Proceedings; 2012:63-70.
Annotation schema for marking spans of clinical conditions in clinical text. http://orbit.nlm.nih.gov/resource/annotation-schema-marking-spans-clinical-conditions- clinical-text.
Ogren PVS, Guergana K, Chute Y, Christopher G: Constructing evaluation corpora for automated clinical named entity recognition. In LREC'08. Marrakech, Morocco: Proceedings of the Sixth International Conference on Language Resources and Evaluation LREC'08; 2008:3143-3150.
Bodenreider O: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, 32(Database issue):D267-270.
Friedman C, Liu H, Shagina L: A vocabulary development and visualization tool based on natural language processing and the mining of textual patient reports. J Biomed Inform 2003, 36(3):189-201.
Yoshimasa Tsuruoka YT, Jin-Dong K, Tomoko O, John MN, Sophia A, Junichi T: Developing a robust part-of-speech tagger for biomedical text. In Advances in informatics -10th panhellenic conference on informatics. Heidelberg, Berlin: Springer Berlin; 2005:382- 392.
MALLET: a machine learning for language toolkit: http://mallet.cs.umass.edu.

Pooling annotated corpora for clinical concept extraction.

Sign up for access to the world's latest research

Abstract

Related papers

References (33)

Related papers

Cited by