Academia.eduAcademia.edu

Outline

OntoGene web services for biomedical text mining

2014, BMC Bioinformatics

Abstract

Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges, with top ranked results in several of them.

References (65)

  1. Aronson AR, Lang FM: An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010, 17(3):229-236.
  2. Kim J, Pezik P, Rebholz-Schuhmann D: Medevi: Retrieving textual evidence of relations between biomedical concepts from medline. Bioinformatics 2008, 24(11):1410-1412.
  3. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: calling Whatizit. Bioinformatics 2008, 24(2):296-298.
  4. Campos D, Matos S, Oliveira JL: Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics 2013, 14:54.
  5. Hoffmann R: Using the iHOP information resource to mine the biomedical literature on genes, proteins, and chemical compounds. Curr Protoc Bioinformatics 2007, Chapter 1:1-16.
  6. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010, 17(5):507-513.
  7. Jonquet C, Shah NH, Musen MA: The open biomedical annotator. Summit on Translat Bioinforma 2009, 2009:56-60.
  8. Arighi C, Roberts P, Agarwal S, Bhattacharya S, Cesareni G, Chatr- aryamontri A, Clematide S, Gaudet P, Giglio M, Harrow I, Huala E, Krallinger M, Leser U, Li D, Liu F, Lu Z, Maltais L, Okazaki N, Perfetto L, Rinaldi F, Saetre R, Salgado D, Srinivasan P, Thomas P, Toldo L, Hirschman L, Wu C: BioCreative III interactive task: an overview. BMC Bioinformatics 2011, 12(Suppl 8):4.
  9. Arighi CN, Carterette B, Cohen KB, Krallinger M, Wilbur WJ, Fey P, Dodson R, Cooper L, Van Slyke CE, Dahdul W, Mabee P, Li D, Harris B, Gillespie M, Jimenez S, Roberts P, Matthews L, Becker K, Drabkin H, Bello S, Licata L, Chatr-aryamontri A, Schaeffer ML, Park J, Haendel M, Van Auken K, Li Y, Chan J, Muller HM, Cui H, Balhoff JP, Chi-Yang Wu J, Lu Z, Wei Tudor CO, Raja K, Subramani S, Natarajan J, Cejuela JM, Dubey P, Wu C: An overview of the BioCreative 2012 workshop track III: interactive text mining task. Database 2013, 2013.
  10. Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur W, Rocha L, Shatkay H, Tendulkar A, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Dogan R, Fontaine JF, Andrade-Navarro M, Valencia A: The protein-protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12(Suppl 8):3.
  11. Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J: Overview of BioNLP shared task 2011. ACL HLT 2011 2011, 1.
  12. Cohen KB, Demner-Fushman D, Ananiadou S, Pestian J, Tsujii J, Webber B: Proceedings of the BioNLP 2009 Workshop. Association for Computational Linguistics, Boulder, Colorado; 2009 [http://www.aclweb.org/ anthology/W09-13].
  13. Sun W, Rumshisky A, Uzuner O: Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc 2013, 20(5):806-813.
  14. Rebholz-Schuhmann D, Yepes A, Li C, Kafkas S, Lewin I, Kang N, Corbett P, Milward D, Buyko E, Beisswanger E, Hornbostel K, Kouznetsov A, Witte R, Laurila J, Baker C, Kuo CJ, Clematide S, Rinaldi F, Farkas R, Mora G, Hara K, Furlong LI, Rautschka M, Neves M, Pascual-Montano A, Wei Q, Collier N, Chowdhury M, Lavelli A, Berlanga R, Morante R, Van Asch V, Daelemans W, Marina J, van Mulligen E, Kors J, Hahn U: Assessment of NER solutions against the first and second CALBC silver standard corpus. Journal of Biomedical Semantics 2011, 2(Suppl 5):11.
  15. Rebholz-Schuhmann D, Clematide S, Rinaldi F, Kafkas S, van Mulligen EM, Bui C, Hellrich J, Lewin I, Milward D, Poprat M, Jimeno-Yepes A, Hahn U, Kors J: Entity recognition in parallel multi-lingual biomedical corpora: The CLEF-ER laboratory overview. In Information Access Evaluation Multilinguality, Multimodality, and Visualization Lecture Notes in Computer Science. Springer, Valencia;Forner, P., Mueller, H., Rosso, P., Paredes, R 2013:353-367[http://www.zora.uzh.ch/82216/].
  16. Segura-Bedmar I, Martínez P, Sánchez-Cisneros D: The 1st ddi extraction- 2011 challenge task: Extraction of drug-drug interactions from biomedical texts. Proc DDI Extraction-2011 Challenge Task Huelva, Spain; 2011, 1-9.
  17. Androutsopoulos I: A Challenge on Large-scale Biomedical Semantic Indexing and Question Answering. BioNLP Workshop (part of the ACL Conference) 2013 [http://www.bioasq.org/workshop/programme], presentation.pdf.
  18. Consortium T: The universal protein resource (UniProt) in 2010. Nucleic Acids Research 2010, 38(suppl 1):142-148.
  19. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2011, 39(Database):52-57.
  20. Federhen S: The NCBI Taxonomy database. Nucleic Acids Res 2012, 40(Database):136-143.
  21. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucl Acids Res 2004, 32(suppl 1):452-455.
  22. Dolinski K, Chatr-Aryamontri A, Tyers M: Systematic curation of protein and genetic interaction data for computable biology. BMC Biol 2013, 11:43.
  23. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: Biogrid: A general repository for interaction datasets. Nucleic Acids Research 2006, 34:535-9.
  24. Sangkuhl K, Berlin DS, Altman RB, Klein TE: PharmGKB: Understanding the effects of individual genetic variants. Drug Metabolism Reviews 2008, 40(4):539-551.
  25. Davis A, King B, Mockus S, Murphy C, Saraceni-Richards C, Rosenstein M, Wiegers T, Mattingly C: The comparative toxicogenomics database: update 2011. Nucleic Acids Res 2011, 39(Database):1067-72.
  26. Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz- Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A, Porron-Sotelo L, Alquicira-Hernandez S, Medina-Rivera A, Martinez-Flores I, Alquicira-Hernandez K, Martinez-Adame R, Bonavides- Martinez C, Miranda-Rios J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res 2011, 39(Database):98-105.
  27. Rinaldi F, Clematide S, Garten Y, Whirl-Carrillo M, Gong L, Hebert JM, Sangkuhl K, Thorn CF, Klein TE, Altman RB: Using ODIN for a PharmGKB re-validation experiment. Database: The Journal of Biological Databases and Curation 2012.
  28. Rinaldi F, Clematide S, Hafner S: Ranking of CTD articles and interactions using the OntoGene pipeline. Proceedings of the 2012 BioCreative Workshop Washington D.C; 2012.
  29. Gama-Castro S, Rinaldi F, López-Fuentes A, Balderas-Martínez YI, Clematide S, Ellendorff TR, Collado-Vides J: Assisted curation of growth conditions that affect gene expression in e. coli k-12. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop 2013, 1:214-218.
  30. Rinaldi F, Kappeler T, Kaljurand K, Schneider G, Klenner M, Clematide S, Hess M, von Allmen JM, Parisot P, Romacker M, Vachon T: OntoGene in BioCreative II. Genome Biology 2008, 9(Suppl 2):13.
  31. Rinaldi F, Schneider G, Kaljurand K, Clematide S, Vachon T, Romacker M: OntoGene in BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010, 7(3):472-480.
  32. Rinaldi F, Schneider G, Clematide S: Relation mining experiments in the pharmacogenomics domain. Journal of Biomedical Informatics 2012, 45(5):851-861.
  33. Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B: Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 2012, 17(21- 22):1188-1198.
  34. Mintz M, Bills S, Snow R, Jurafsky D: Distant supervision for relation extraction without labeled data. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP Volume 2 ACLIJCNLP 09 2009, 2(2005):1003.
  35. Morgan AA, Hirschman L, Colosimo M, Yeh AS, Colombe JB: Gene name identification and normalization using a model organism database. Journal of Biomedical Informatics 2004, 37(6):396-410.
  36. Craven M, Kumlien J: Constructing biological knowledge bases by extracting information from text sources. Proceedings International Conference on Intelligent Systems for Molecular Biology 1999, 77-86.
  37. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9(Suppl 2):4.
  38. Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An overview of BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010, 7(3):385-399.
  39. Hakenberg J: What's in a gene name? Automated refinement of gene name dictionaries. Proceedings of BioNLP 2007: Biological, Translational, and Clinical Language Processing; Prague, Czech Republic 2007, 153-160 [http:// www.aclweb.org/anthology-new/W/W07/W07-1020.pdf].
  40. Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl 2):14.
  41. Wang X, Matthews M: Distinguishing the species of biomedical named entities for term identification. BMC Bioinformatics 2008, 9(Suppl 11):6.
  42. Kaljurand K, Rinaldi F, Kappeler T, Schneider G: Using existing biomedical resources to detect and ground terms in biomedical literature. Proceedings of the 12th Conference on Artificial Intelligence in Medicine (AIME09) 2009, 225-234.
  43. Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text. bioinformatics 2002, 18(8):1124-32.
  44. Kappeler T, Kaljurand K, Rinaldi F: TX Task: Automatic Detection of Focus Organisms in Biomedical Publications. Proceedings of the BioNLP Workshop Boulder, Colorado; 2009, 80-88.
  45. Schneider G: Combining shallow and deep processing for a robust, fast, deep-linguistic dependency parser. In ESSLLI 2004 Workshop on Combining Shallow and Deep Processingfor NLP. Nancy, France; Hinrichs, E., Simov, K 2004:41-50.
  46. Kim JD, Ohta T, Tateisi Y, Tsujii J: GENIA corpus-semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19(Suppl 1):180-182.
  47. Schneider G, Kaljurand K, Kappeler T, Rinaldi F: Detecting Protein/Protein Interactions using a parser and linguistic resources. Proceedings of CICLing 2009, 10th International Conference Intelligent Text Processing and Computational Linguistics Springer, Mexico City, Mexico; 2009, 406-417.
  48. Schneider G, Kaljurand K, Rinaldi F, Kuhn T: Pro3Gres parser in the CoNLL domain adaptation shared task. Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007 Prague; 2007, 1161-1165 [http://www.aclweb. org/anthology/D07-1128].
  49. Haverinen K, Ginter F, Pyysalo S, Salakoski T: Accurate conversion of dependency parses: targeting the Stanford scheme. Proceedings of Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008) Turku, Finland; 2008.
  50. Clematide S, Rinaldi F: Ranking relations between diseases, drugs and genes for a curation task. Journal of Biomedical Semantics 2012, 3(Suppl 3):5.
  51. Richardson L, Ruby S: RESTful Web Services. O'Reilly, Sebastopol, California; 2007, ISBN 978-0-596-52926-0.
  52. Comeau DC, Doğan RI, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ: Bioc: a minimalist approach to interoperability for biomedical text processing. The Journal of Biological Databases and Curation bat064 2013.
  53. Liu W, Comeau DC, Doğan RI, Kwon D, Marques H, Rinaldi F, Wilbur WJ: Bioc implementations in go, perl, python and ruby. Database: The Journal of Biological Databases and Curation 2014, under review.
  54. Rinaldi F, Marques H: PyBioC: a Python implementation of the BioC core. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop 2013, 1:2-4.
  55. Cunningham H, Tablan V, Roberts A, K B: Getting more out of biomedical documents with gate's full lifecycle open source text analytics. PLoS Comput Biol 2013, 9(2):1002854.
  56. Ferrucci D, Lally A: Building an example application with the unstructured information management architecture. IBM Systems Journal 2004, 43(3):455-475, 2004.
  57. Noorden RV: Elsevier opens its papers to text-mining. Nature 2014, 506(17).
  58. Gama-Castro S, Rinaldi F, López-Fuentes A, Balderas-Martínez YI, Clematide S, Ellendorff TR, Santos-Zavaleta A, Marques-Madeira H, Collado- Vides J: Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12. Database: The Journal of Biological Databases and Curation bau049 2014.
  59. Rinaldi F, Clematide S, Hafner S, Schneider G, Grigonyte G, Romacker M, Vachon T: Using the OntoGene pipeline for the triage task of BioCreative 2012. The Journal of Biological Databases and Curation, Oxford Journals 2013.
  60. Clematide S, Rinaldi F, Schneider G: OntoGene at CALBC II and some thoughts on the need of document-wide harmonization. Proceedings of the CALBC II Workshop EBI, Cambridge, UK; 2011, 16-18, March.
  61. Rinaldi F, Kappeler T, Kaljurand K, Schneider G, Klenner M, Hess M, von Allmen JM, Romacker M, Vachon T: OntoGene in Biocreative II. Proceedings of the II Biocreative Workshop 2007.
  62. Schneider G, Clematide S, Rinaldi F: Detection of interaction articles and experimental methods in biomedical literature. BMC Bioinformatics 2011, 12(Suppl 8):13.
  63. Rinaldi F, Clematide S, Schneider G, Romacker M, Vachon T: ODIN: An advanced interface for the curation of biomedical literature. Biocuration 2010, the Conference of the International Society for Biocuration and the 4th International Biocuration Conference 2010, 61, Available from Nature Precedings http://dx.doi.org/10.1038/npre.2010.5169.1.
  64. Rinaldi F, Gama-Castro S, López-Fuentes A, Balderas-Martínez Y, Collado- Vides J: Digital curation experiments for regulondb. BioCuration 2013, April 10th Cambridge, UK; 2013.
  65. Rinaldi F, Clematide S, Ellendorff TR, Marques H: OntoGene: CTD entity and action term recognition. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop 2013, 1:90-94. doi:10.1186/1471-2105-15-S14-S6 Cite this article as: Rinaldi et al.: OntoGene web services for biomedical text mining. BMC Bioinformatics 2014 15(Suppl 14):S6.