Papers by Alfonso Valencia

The TIPS track consisted in a novel experimental t ask under the umbrella of the BioCreative text... more The TIPS track consisted in a novel experimental t ask under the umbrella of the BioCreative text mining challenges with the aim to, for the first time ever, carry out a text mining challenge with partic ular focus on the continuous assessment of technical aspects of text annotation web servers, specifically of biomedical online named entity recognition systems. A total of 13 teams registered annotation servers, implemented in various programming languages, supporting up to 12 different g eral annotation types. The continuous evaluation period took place from Februa ry to March 2017. The systematic and continuous evaluation of server respons es accounted for testing periods of low activity and moderate to high activity . Moreover three document provider settings were covered, including also NCBI Pu bMed. For a total of 4,092,502 requests, the median response time for mo st servers was below 3.74 s with a median of 10 annotations/document. Most of t he servers showed great reliabilit...

Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications - JNLPBA '04, 2004
The tagging of biological entities, and in particular gene and protein names, is an essential ste... more The tagging of biological entities, and in particular gene and protein names, is an essential step in the analysis of textual information in Molecular Biology and Biomedicine. The problem is harder than was originally thought because of the highly dynamic nature of the research area, in which new genes and their functions are constantly being discovered, and because of the lack of commonly accepted standards. An impressive collection of techniques has been used to detect protein and gene names in the last fourfive years, ranging from typical NLP to purely bioinformatics approaches. We explore here the relationship between protein/gene names and expressions used to characterize protein/gene function. These expressions are captured in a collection of patterns derived from an original set of manually derived expressions, extended to cover lexical variants and filtered with known cases of association patterns/ names. Applying these patterns to a large collection of curated sentences, we found a significant number of patterns with a very strong tendency to appear only in sentences in which a protein/gene name is simultaneously present. This approach is part of a larger effort to incorporate contextual information so as to make biological information less ambiguous.

Database : the journal of biological databases and curation, 2014
BioCreative: Critical Assessment of Information Extraction in Biology is an international communi... more BioCreative: Critical Assessment of Information Extraction in Biology is an international community-wide effort for evaluating text mining (TM) and information extraction systems applied to the biological domain ( Challenge Evaluations and the accompanying BioCreative Workshops bring together the TM and biology communities to drive the development of practically relevant TM systems. One of the main goals of this initiative is that the resulting systems facilitate a more efficient literature information access to biologists in general, but also provide tools that can be directly integrated into the biocuration workflow and the knowledge discovery process carried out by databases. Beyond addressing the current barriers faced by TM technologies applied to biological literature, BioCreative has further been conducting user requirement analyses, user-based evaluations and fostering standards development for TM tool reuse and integration. This DATABASE virtual issue captures the major results from the Fourth BioCreative Challenge Evaluation Workshop, and is the sixth special issue devoted to BioCreative. Built on the success of the previous Challenge Evaluations and Workshops (BioCreative I, II, II.5, III, 2012) (1-5), the BioCreative IV Workshop was held in Bethesda, MD, on October 7-9, 2013. BioCreative is distinct from other challenges in the bioNLP domain in how it selects its specific tasks, or tracks. From its inception, the organizers have worked with biocuration teams to define and evaluate tasks of importance to curation of the biomedical literature. Over the years, BioCreative has collaborated with curators from a variety of databases, including Gene Ontology Annotation (6), IntAct (7), MINT (8), BioGRID (9), Flybase (10), Mouse Genome Database (11), TAIR (12), CTD (13) and WormBase ( ). This has enabled BioCreative to leverage existing standards, resources (especially, the knowledge captured in curated databases) and the expertise of the curators and to propose tracks that respond to their needs.

There is an increasing need to facilitate automated access to information relevant for chemical c... more There is an increasing need to facilitate automated access to information relevant for chemical compounds and drugs described in text, including scientific articles, patents or health agency reports. A number of recent efforts have implemented natural language processing (NLP)andtextminingtechnologiesforthechemicaldomain(ChemNLP orchemicaltextmining).DuetothelackofmanuallylabeledGoldStandard datasets together with comprehensive annotation guidelines, both the implementation as well as the comparative assessment of ChemNLP technologiesBSFopaque.Twokeycomponentsformostchemicaltextmining technologies are the indexing of documents with chemicals (chemical document indexing -CDI) and finding the mentions of chemicals in text (chemical entity mention recognition -CEM). These two tasks formed part of the chemical compound and drug named entity recognition (CHEMDNER) task introduced at the fourth BioCreative challenge, a community effort to evaluate biomedical text mining applications.Forthistask,theCHEMDNERtextcorpuswasconstructed,con-sistingof10,000abstractscontainingatotalof84,355mentionsofchemical compounds and drugs that have been manually labeled by domain experts following specific annotation guidelines. This corpus covers representative abstracts from major chemistry-related sub-disciplines such as medicinal chemistry, biochemistry, organic chemistry and toxicology. A total of 27 teams -23 academic and 4 commercial HSPVQT, comprisedof 87 researchers -submitted results for this task. Of these teams, 26 provided submissions for the CEM subtask and 23 for the CDI subtask. Teams were provided with the manual annotations of 7,000 abstracts toimplement and train their systems and then had to return predictions for the 3,000 test set abstracts during a short period of time. When comparing exact matches of the automated results against the manually labeled Gold Standard annotations, the best teams reached an F-score ⋆ Corresponding author Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2 of 87.39% JO the CEM task and of 88.20% JO the CDI task. This can beregardedasaverycompetitiveresultwhencomparedtotheexpected upper boundary, the agreement between to human annotators, at 91%. In general, the technologies used to detect chemicals and drugs by the teams included machine learning methods (particularly CRFs using a considerablerangeofdifferentfeatures),interactionofchemistry-related lexical resources and manual rules (e.g., to cover abbreviations, chemicalformulaorchemicalidentifiers).Bypromotingtheavailabilityofthe software of the participating systems as well as through the release of the CHEMDNER corpus to enable implementation of new tools, this work fosters the development of text mining applications like the automatic extraction of biochemical reactions, toxicological properties of compounds, or the detection of associations between genes or mutations BOE drugs in the context pharmacogenomics.
SOA-Based Integration of Text Mining Services
2009 Congress on Services - I, 2009

Linking Literature, Information, and Knowledge for Biology, 2010
With the increasing availability of textual information related to biological research, such info... more With the increasing availability of textual information related to biological research, such information has become an important component of many bioinformatics applications. Much recent work aims to develop practical tools to facilitate the use of the literature for annotating the vast amounts of molecular data, including gene sequences, transcription profiles and biological pathways. The broad area of biomedical text mining is concerned with using methods from natural language processing, information extraction, information retrieval and summarization to automate knowledge discovery from biomedical text. In the biomedical domain, research has focused on several complex text-based applications, including the identification of relevant literature (information retrieval) for specific information needs, the extraction of experimental findings for assistance in building biological knowledge bases, and summarization -aiming to present key biological facts in a succinct form. Automated natural language processing (NLP) began in 1947 with the introduction of the idea of machine translation by Warren Weaver, and work on automated (still mechanical) dictionary lookup for translation by Andrew Booth . This work was continued throughout the 1950s in research on automatic translation by Bar Hillel, Garvin and others. In the 1950s, work on transformational grammars by Zellig Harris [3], formed the basis for computational linguistics, which was continued by Noam Chomsky, relating natural languages to formal grammars. The field made rapid progress starting in the late 1980s, thanks to a series of conferences focused on evaluation of text mining and information extraction systems: the Message Understanding Conferences (MUCs).

Comparative and Functional Genomics, 2003
An increasing number of groups are now working in the area of text mining, focusing on a wide ran... more An increasing number of groups are now working in the area of text mining, focusing on a wide range of problems and applying both statistical and linguistic approaches. However, it is not possible to compare the different approaches, because there are no common standards or evaluation criteria; in addition, the various groups are addressing different problems, often using private datasets. As a result, it is impossible to determine how well the existing systems perform, and particularly what performance level can be expected in real applications. This is similar to the situation in text processing in the late 1980s, prior to the Message Understanding Conferences (MUCs). With the introduction of a common evaluation and standardized evaluation metrics as part of these conferences, it became possible to compare approaches, to identify those techniques that did or did not work and to make progress. This progress has resulted in a common pipeline of processes and a set of shared tools av...
Comparative and functional genomics, 2005
Nucleic Acids Research, 2007
iHOP provides fast, accurate, comprehensive, and up-to-date summary information on more than 80 0... more iHOP provides fast, accurate, comprehensive, and up-to-date summary information on more than 80 000 biological molecules by automatically extracting key sentences from millions of PubMed documents. Its intuitive user interface and navigation scheme have made iHOP extremely successful among biologists, counting more than 500 000 visits per month (iHOP access statistics: . ihop-net.org/UniPub/iHOP/info/logs/). Here we describe a public programmatic API that enables the integration of main iHOP functionalities in bioinformatic programs and workflows.

Nucleic Acids Research, 2006
An entire family of methodologies for predicting protein interactions is based on the observed fa... more An entire family of methodologies for predicting protein interactions is based on the observed fact that families of interacting proteins tend to have similar phylogenetic trees due to co-evolution. One application of this concept is the prediction of the mapping between the members of two interacting protein families (which protein within one family interacts with which protein within the other). The idea is that the real mapping would be the one maximizing the similarity between the trees. Since the exhaustive exploration of all possible mappings is not feasible for large families, current approaches use heuristic techniques which do not ensure the best solution to be found. This is why it is important to check the results proposed by heuristic techniques and to manually explore other solutions. Here we present TSEMA, the server for efficient mapping assessment. This system calculates an initial mapping between two families of proteins based on a Monte Carlo approach and allows the user to interactively modify it based on performance figures and/or specific biological knowledge. All the explored mappings are graphically shown over a representation of the phylogenetic trees. The system is freely available at . Standalone versions of the software behind the interface are available upon request from the authors.

Journal of Molecular Biology, 1999
RCC1, the regulator of chromosome condensation, is the guanine nucleotide exchange factor (GEF) f... more RCC1, the regulator of chromosome condensation, is the guanine nucleotide exchange factor (GEF) for the nuclear Ras-like GTP-binding protein Ran. Its structure was solved by X-ray crystallography and revealed a seven-bladed b-propeller, one side of which was proposed to be the interaction site with Ran. To gain more insight into this interaction, alanine mutagenesis studies were performed on conserved residues on the surface of the structure. Puri®ed mutant proteins were analysed by steady-state kinetic analysis of their GEF activities towards Ran. A number of residues were identi®ed whose mutation affected either the K M or k cat of the overall reaction, or had no effect. Mutants were further analysed by plasmon surface resonance in order to get more information on individual steps of the complex reaction pathway. Ran-GDP was coupled to the sensor chip and reacted with RCC1 mutants to categorise them into different groups, demonstrating the usefulness of plasmon surface resonance in the study of complex multi-step kinetic processes. A docking solution of Ran-RCC1 structures in combination with sequence analysis allows prediction of the site of interaction between RCC1 and Ran and proposes a model for the Ran-RCC1 structure which corresponds to and extends the biochemical data. Three invariant residues which most severely affect the k cat of the reaction, D128, D182 and H304, are located in the centre of the Ran-RCC1 interface and interfere with switch II and the phosphate binding area. The structural model suggests that different guanine nucleotide exchange factors use a similar interaction site on their respective GTP-binding proteins, but that the molecular mechanisms for the release of nucleotides are likely to be different.

FEBS Letters, 2008
We propose that the combination of human expertise and automatic text‐mining systems can be used ... more We propose that the combination of human expertise and automatic text‐mining systems can be used to create a first generation of electronically annotated information (EAI) that can be added to journal abstracts and that is directly related to the information in the corresponding text. The first experiments have concentrated on the annotation of gene/protein names and those of organisms, as these are the best resolved problems. A second generation of systems could then attempt to address the problems of annotating protein interactions and protein/gene functions, a more difficult task for text‐mining systems. EAI will permit easier categorization of this information, it will help in the evaluation of papers for their curation in databases, and it will be invaluable for maintaining the links between the information in databases and the facts described in text. Additionally, it will contribute to the efforts towards completing database information and creating collections of annotated t...

Database, 2012
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated ... more Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. .

BMC Bioinformatics, 2011
Background The overall goal of the BioCreative Workshops is to promote the development of text mi... more Background The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances...

Bioinformatics, 2012
Motivation: The exponential growth of scientific literature has resulted in a massive amount of u... more Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics. Availability: http://myminer.armi.monash.edu.au. Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu Supplementary Information: Supplementary data are available at Bio...

Bioinformatics, 2005
Motivation: The World Wide Web has profoundly changed the way in which we access information. Sea... more Motivation: The World Wide Web has profoundly changed the way in which we access information. Searching the internet is easy and fast, but more importantly, the interconnection of related contents makes it intuitive and closer to the associative organization of human memory. However, the information retrieval tools currently available to researchers in biology and medicine lag far behind the possibilities that the layman has come to expect from the internet. Results: By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource. iHOP (Information Hyperlinked over Proteins) is an online service that provides this gene-guided network as a natural way of accessing millions of PubMed abstracts and brings all the advantages of the internet to scientific literature research. Navigating across interrelated sentences within this network is closer to human intuition than the use of conventional keyword search...
Uploads
Papers by Alfonso Valencia