Gene Normalization

description10 papers

group1 follower

lightbulbAbout this topic

Gene normalization is the process of standardizing gene names and identifiers across different databases and studies to ensure consistency and accuracy in genomic research. This involves mapping various nomenclatures to a unified system, facilitating data integration, comparison, and interpretation in bioinformatics and molecular biology.

lightbulbAbout this topic

Key research themes

1. What are effective data normalization strategies for accurate microRNA and gene expression quantification in qPCR and RNA-seq experiments?

This theme addresses the critical challenge of data normalization in gene expression quantification techniques such as quantitative real-time PCR (qPCR) and RNA sequencing (RNA-seq), with a focus on microRNAs and mRNAs. Normalization is fundamental to correcting for technical variability (e.g., differing RNA input, sequencing depth, or batch effects) to ensure accurate, reproducible, and biologically meaningful expression measurements. The lack of consensus on optimal endogenous or exogenous reference genes and normalization procedures leads to variability and complicates cross-study comparison. The theme explores selecting reference genes with stable expression across conditions, normalization algorithms for RNA-seq counts, and new approaches integrating genomic information to improve normalization robustness.

Data Normalization Strategies for MicroRNA Quantification

by George Calin

2023, Clinical chemistry

Key finding: This comprehensive review highlights the lack of consensus on optimal normalization strategies for microRNA quantification using qPCR and microarrays. The authors analyze endogenous small RNAs commonly used as normalizers and... Read more

articleView Paper downloadDownload

Selecting suitable reference genes for qPCR normalization: a comprehensive analysis in MCF-7 breast cancer cell line

by Inese Cakstina

2023, BMC Molecular and Cell Biology

Key finding: This study systematically evaluates the expression stability of 12 previously recommended reference genes across two sub-clones of the MCF-7 breast cancer cell line over multiple passages, including under nutrient stress... Read more

articleView Paper downloadDownload

Transcriptome-based identification of the optimal reference CHO genes for normalisation of qPCR data

by David C James

2021, Biotechnology journal

Key finding: By combining RNA-seq transcriptomic datasets from diverse Chinese hamster ovary (CHO) cell lines and culture conditions with qPCR validation, the study identifies four mRNAs (Gnb1, Fkbp1a, Tmed2, and Mmadhc) exhibiting highly... Read more

articleView Paper downloadDownload

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

by Shesh Rai

2022, PloS one

Key finding: This paper systematically evaluates multiple RNA-seq read count normalization methods, including established approaches (DESeq median-of-ratios, TMM, Upper Quartile) and novel per-gene normalization after per-sample global... Read more

articleView Paper downloadDownload

An Integrated Approach for RNA-seq Data Normalization

by Shengping Yang

2022, Cancer Informatics

Key finding: Introducing a novel normalization method for RNA-seq data, this study integrates DNA copy number alteration (CNA) information to adjust gene expression measurements, recognizing that CNAs explain a significant fraction (~15%)... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. How can integrated bioinformatics frameworks and knowledgebases enhance gene normalization by providing standardized and context-aware reference gene annotations for RT-qPCR and gene set management?

Accurate gene normalization not only depends on the appropriate experimental design but also on the availability of standardized, well-curated reference gene annotations and gene sets that consider species, tissue specificity, developmental stages, and experimental conditions. This theme revolves around the development of community-curated databases and computational platforms that aggregate experimentally validated internal control genes and gene sets. Such resources enable reproducible normalization across diverse biological contexts, facilitating proper interpretation and cross-study comparisons.

ICG: a wiki-driven knowledgebase of internal control genes for RT-qPCR normalization

by Zhang Zhang

2024, Nucleic Acids Research

Key finding: ICG provides a publicly editable wiki-based knowledgebase integrating over 750 experimentally validated internal control genes across 73 animal species, 115 plants, fungi, and bacteria. It includes detailed application... Read more

articleView Paper downloadDownload

MyGeneset.info: an interactive and programmatic platform for community-curated and user-created collections of genes

by Everaldo Rodolpho

2023, Nucleic Acids Research

Key finding: MyGeneset.info offers integrated access to curated and user-submitted gene sets from multiple sources (e.g., Wikipathways, Reactome, GO) along with up-to-date gene annotations via APIs, supporting species across humans and... Read more

articleView Paper downloadDownload

3. How do statistical machine learning and computational approaches contribute to addressing batch effects, gene clustering, and orthology-independent gene normalization in expression data?

Batch effects and heterogeneity in high-throughput gene expression data pose significant challenges for normalization and downstream analysis. Advanced computational approaches, such as artificial intelligence-based normalization, block mixture models for eQTL-driven gene clustering, and orthogonal shared basis factorization for cross-species expression comparison, enhance gene normalization by capturing underlying biological and technical structure without relying solely on physical gene homology or simplistic assumptions. These methods improve the accuracy of gene expression interpretation and facilitate comparative transcriptomic analyses.

Normalization of Large Scale Transcriptome Data Using Heuristic Methods

by Dr. Arthur Yosef

2025

Key finding: The authors introduce an artificial intelligence-driven normalization method aiming to reduce batch effects in transcriptome data without imposing assumptions on gene expression distribution. Unlike traditional normalization... Read more

articleView Paper downloadDownload

A block mixture model to map eQTLs for gene clustering and networking

by Runze Li

2021, Scientific reports

Key finding: This study presents a Gaussian block mixture model integrating gene clustering, genetic mapping, and network reconstruction by simultaneously modeling genotype-specific gene expression clustering patterns. Applied to C.... Read more

articleView Paper downloadDownload

Orthogonal Shared Basis Factorization: Cross-species gene expression analysis using a common expression subspace

by Amal Thomas

2025

Key finding: The paper introduces the orthogonal shared basis factorization (OSBF) method, a joint matrix factorization approach estimating a common expression subspace across species that captures conserved gene co-expression patterns... Read more

articleView Paper downloadDownload

Sequential Normalization: Embracing Smaller Sample Sizes for Normalization

by Neofytos Dimitriou

2023, Information

Key finding: Challenging the conventional wisdom that better batch normalization (BatchNorm) statistics arise from larger mini-batches, this study shows that GhostNorm, which normalizes smaller ‘ghost batches’ independently within... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Gene Normalization

The gene normalization task in BioCreative III

by Hong-jie Dai

2025, BMC Bioinformatics

Background: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). Results: We received a total of 37 runs from 14 different teams for the task. When evaluated using the goldstandard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. Conclusions: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

descriptionView Paper arrow_downwardDownload

Optimizing Gene Selection: A Mini Review on Reference Gene Normalization for qRT-PCR in Solanaceae Plants

by Dr Madhavi Hewadikaram

2025

Gene expression analysis is fundamental for understanding biological processes, and quantitative real-time PCR (qRT-PCR) has become a widely used method for validating the expressions. Proper normalization across multiple samples and... more

descriptionView Paper arrow_downwardDownload

A Scientific Note of Housekeeping Genes for the Primitively Eusocial bee Euglossa viridissima Friese (Apidae: Euglossini)

by Michael Lattorff

2025, Sociobiology

Studies on the expression of genes in different contexts are essential to our understanding of the functioning of organisms and their adaptations to the environment. Gene expression studies require steps of normalization, which are done... more

descriptionView Paper arrow_downwardDownload

The gene normalization task in BioCreative III

by W. Wilbur

2025, BMC Bioinformatics

descriptionView Paper arrow_downwardDownload

Validation of automated protein annotation

by Mario Esteban Donayre Silva

2024

Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and... more

descriptionView Paper arrow_downwardDownload

Ontogene Term and Relation Recognition for CDR

by Simon Clematide

2024

For our participation in the CDR task of BioCreative 5, we have adapted the Ontogene System and optimized it for disease recognition (DNER Task) and identification of chemical-disease relationships (CID Task). For the DNER Task we have... more

descriptionView Paper arrow_downwardDownload

OntoGene at CALBC II and Some Thoughts on the Need of Document-Wide Harmonization

by Simon Clematide

2024

descriptionView Paper arrow_downwardDownload

OntoGene web services for biomedical text mining

by Simon Clematide

2024, BMC Bioinformatics

Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical... more

descriptionView Paper arrow_downwardDownload

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII

by Matthew Hodgskiss

2024, Database

The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in... more

descriptionView Paper arrow_downwardDownload

Data preparation and interannotator agreement: BioCreAtIvE Task 1B

by Alex Morgan

2024, BMC Bioinformatics

Background We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene... more

descriptionView Paper arrow_downwardDownload

BioCreAtIvE Task 1A: gene mention finding evaluation

by Alex Morgan

2024, BMC Bioinformatics

Background The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of... more

descriptionView Paper arrow_downwardDownload

Gene name identification and normalization using a model organism database

by Alex Morgan

2024, Journal of Biomedical Informatics

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to... more

descriptionView Paper arrow_downwardDownload

BioHCVKD: a bioinformatics knowledge discovery system for HCV drug discovery - identifying proteins, ligands and active residues, in biological literature

by Rania Ahmed Abul Seoud

2023, International journal of bioinformatics research and applications

Hepatitis C Virus (HCV) causes significant morbidity worldwide with restricted treatment options and lack of a universal cure which necessitate design of novel drugs. Researchers face an enormous growth of literature with very small... more

descriptionView Paper arrow_downwardDownload

Assigning species information to corresponding genes by a sequence labeling framework

by Rezarta Dogan

2023, Database

The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an... more

descriptionView Paper arrow_downwardDownload

Overview of BioCreative II gene normalization

by Rafael Quintero Torres

2023, Genome Biology

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task,... more

descriptionView Paper arrow_downwardDownload

Detection of interaction articles and experimental methods in biomedical literature

by Gerold Schneider

2023, BMC Bioinformatics

Background: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable proteinprotein... more

descriptionView Paper arrow_downwardDownload

Detecting and grounding terms in biomedical literature

by Gerold Schneider

2023

We present an approach towards the automatic detection of names of proteins, genes, species, etc. in biomedical literature and their grounding to widely accepted identifiers. The annotation is based on a large term list that contains the... more

descriptionView Paper arrow_downwardDownload

OntoGene at CALBC II and Some Thoughts on the Need of Document-Wide Harmonization

by Gerold Schneider

2023

descriptionView Paper arrow_downwardDownload

Detecting and grounding terms in biomedical literature

by Gerold Schneider

2023

descriptionView Paper arrow_downwardDownload

Detection of interaction articles and experimental methods in biomedical literature

by Gerold Schneider

2023, BMC Bioinformatics

Background This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein... more

descriptionView Paper arrow_downwardDownload

Recognition and normalization of disease mentions in PubMed abstracts

by Hong-jie Dai

2023

The rapidly increasing number of available PubMed documents calls the need for an automatic approach in the identification and normalization of disease mentions in order to increase the precision and effectivity of information retrieval.... more

descriptionView Paper arrow_downwardDownload

From Entity Recognition to Entity Linking: A Survey of Advanced Entity Linking Techniques (人工知能学会全国大会(第26回)文化,科学技術と未来) -- (International Organized Session「Special Session on Web Intelligence & Data Mining」)

by Hong-jie Dai

2023

Several research results have shown that specifying the information about certain entities is the most common information demand of information retrieval users. The needs should be answered by returning specific entities, their properties... more

descriptionView Paper arrow_downwardDownload

SPRENO: a BioC module for identifying organism terms in figure captions

by Hong-jie Dai

2023, Database : the journal of biological databases and curation

Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings... more

descriptionView Paper arrow_downwardDownload

Entity recognition in the biomedical domain using a hybrid approach

by Carlo Tasso

2023, Journal of biomedical semantics

This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles. The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a... more

descriptionView Paper arrow_downwardDownload

Adapting a relation extraction pipeline for the BioCreAtIvE II task

by Michael Matthews

2023, Proceedings of the BioCreAtIvE II Workshop

The Second BioCreAtIvE Challenge provided an ideal opportunity to evaluate biomedical nlp techniques. Prior to the Challenge, an information extraction pipeline was developed to extract entities and relations relevant to the biomedical... more

descriptionView Paper arrow_downwardDownload

Classifying protein-protein interaction articles using word and syntactic features

by sabenabanu abdulkadhar

2023, BMC Bioinformatics

Background: Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in... more

descriptionView Paper arrow_downwardDownload

Overview of the CHEMDNER patents task

by Gael Rodriguez

2023

A considerable effort has been made to extract biological and chemical entities, as well as their relationships, from the scientific literature, either manually through traditional literature curation or by using information extraction... more

descriptionView Paper arrow_downwardDownload

Gene mention normalization in full texts using GNAT and LINNAEUS

by Goran Nenadic

2022

Gene mention normalization (GN) refers to the automated mapping of gene names to a unique identifier, such as an NCBI Entrez Gene ID. Such knowledge helps in indexing and retrieval, linkage to additional information (such as sequences),... more

descriptionView Paper arrow_downwardDownload

The use of Grounded Theory in the Construction of a Model for Sustainable Management in a Computer Company

by Ubiratan Borges

2022

This paper presents an approach towards high performance extraction of biomedical entities from the literature, which is built by combining a high recall dictionarybased technique with a high-precision machine learning filtering step. The... more

Table 2: Comparison of the scores obtained with OntoGene, with the combined OntoGene/Distiller pipeline and the scores obtained in Tseytlin et al. (2016). Table 1: Scores obtained with the Distiller/Ontogene pipeline using a MLP trained on the CRAFT corpus In the column headers, “FS7” stands for ‘““Feature Set n”’.

descriptionView Paper arrow_downwardDownload

OntoMate: a text-mining tool aiding curation at the Rat Genome Database

by Rajni Nigam

2022, Database : the journal of biological databases and curation

The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism... more

descriptionView Paper arrow_downwardDownload

eGIFT: Mining Gene Information from the Literature

by Vijay Shanker

2022, BMC Bioinformatics

Background: With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many... more

descriptionView Paper arrow_downwardDownload

SR4GN: a species recognition software tool for gene normalization

by Hung-Yu Kao

2022

As suggested in recent studies, species recognition and disambiguation is one of the most critical and challenging steps in many downstream text-mining applications such as the gene normalization task and protein-protein interaction... more

descriptionView Paper arrow_downwardDownload

KinDER : A Biocuration Tool for Extracting Kinase Knowledge from Biomedical Literature

by Adam Morrone

2022

Kinases are enzymes that mediate phosphate transfer. Extracting information on kinases from biomedical literature is an important task which has direct implications for applications such as drug design. In this work, we develop KinDER,... more

descriptionView Paper arrow_downwardDownload

LINNAEUS: A species name identification system for biomedical literature

by Goran Nenadic

2022

Background: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific... more

descriptionView Paper arrow_downwardDownload

LAITOR- Literature Assistant for Identification of Terms co-Occurrences and Relationships

by Georgios Pavlopoulos

2022, BMC …

Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of... more

descriptionView Paper arrow_downwardDownload

Collection‐Wide Extraction of Protein‐Protein Interactions

by Simon Clematide

2022, Proceedings of the 6th International Symposium on Semantic Mining in Biomedicine, Aveiro, Portugal

Evidence in support of relationships among biomedical entities, such as protein-protein interactions, can be gathered from a multiplicity of sources. The larger the pool of evidence, the more likely a given interaction can be considered... more

descriptionView Paper arrow_downwardDownload

PhenoGO: an integrated resource for the multiscale mining of clinical and biological data

by Yves Lussier

2022

The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this... more

descriptionView Paper arrow_downwardDownload

Using a Hybrid Approach for Entity Recognition in the Biomedical Domain

by Marco Basaldella

2022

This paper presents an approach towards high performance extraction of biomedical entities from the literature, which is built by combining a high recall dictionary-based technique with a high-precision machine learning filtering step.... more

descriptionView Paper arrow_downwardDownload

Concept Recognition for Extracting Protein Interaction Relations From Biomedical Text

by Kevin Bretonnel Cohen

2021, Genome …

Background: Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing. Results: Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist. Conclusion: Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet http://bionlp.sourceforge.net. Background Early efforts in information extraction have focused primarily on identification of character strings and, for the most part, have not been adopted for use by biologists. We posit that a prominent factor in the biologist's reluctance to rely on current information extraction technologies is the ambiguity that remains in these extracted strings of text. For example, there are a multitude of tools that can extract gene names from text. This is a classic problem in biomedical natural language processing (BioNLP), and one that has been extensively studied [1,2]. Determining that a particular string of text in a larger document corresponds to a gene name is a challenging

descriptionView Paper arrow_downwardDownload

The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge

by Martin Krallinger

2021, Database : the journal of biological databases and curation

Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing... more

descriptionView Paper arrow_downwardDownload

Overview of the BioCreative III Workshop

by Martin Krallinger

2021, BMC Bioinformatics

Background The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this... more

descriptionView Paper arrow_downwardDownload

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

by Martin Krallinger

2021, BMC Bioinformatics

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.

descriptionView Paper arrow_downwardDownload

A sentence sliding window approach to extract protein annotations from biomedical articles

by Martin Krallinger

2021, BMC Bioinformatics

Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a... more

descriptionView Paper arrow_downwardDownload

A sentence sliding window approach to extract protein annotations from biomedical articles

by Martin Krallinger

2021

Background: Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a... more

descriptionView Paper arrow_downwardDownload

Overview of the BioCreative III workshop

by Martin Krallinger

2021

Background: The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances can be gauged. This special issue contains overview papers for the three tasks of BioCreative III. Results: The BioCreative III Workshop was held in September of 2010 and continued the tradition of a challenge evaluation on several tasks judged basic to effective text mining in biology, including a gene normalization (GN) task and two protein-protein interaction (PPI) tasks. In total the Workshop involved the work of twenty-three teams. Thirteen teams participated in the GN task which required the assignment of EntrezGene IDs to all named genes in full text papers without any species information being provided to a system. Ten teams participated in the PPI article classification task (ACT) requiring a system to classify and rank a PubMed ® record as belonging to an article either having or not having "PPI relevant" information. Eight teams participated in the PPI interaction method task (IMT) where systems were given full text documents and were required to extract the experimental methods used to establish PPIs and a text segment supporting each such method. Gold standard data was compiled for each of these tasks and participants competed in developing systems to perform the tasks automatically. BioCreative III also introduced a new interactive task (IAT), run as a demonstration task. The goal was to develop an interactive system to facilitate a user's annotation of the unique database identifiers for all the genes appearing in an article. This task included ranking genes by importance (based preferably on the amount of described experimental information regarding genes). There was also an optional task to assist the user in finding the most relevant articles about a given gene. For BioCreative III, a user advisory group (UAG) was assembled and played an important role 1) in producing some of the gold standard annotations for the GN task, 2) in critiquing IAT systems, and 3) in providing guidance for a future more rigorous evaluation of IAT systems. Six teams participated in the IAT demonstration task and received feedback on their systems from the UAG group. Besides innovations in the GN and PPI tasks making them more realistic and practical and the introduction of the IAT task, discussions were begun on community data standards to promote interoperability and on user requirements and evaluation metrics to address utility and usability of systems.

descriptionView Paper arrow_downwardDownload

Identifying bioentity recognition errors of rule-based text-mining systems

by Rafael Torres

2021

An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to... more

descriptionView Paper arrow_downwardDownload

Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature

by Raheel Nawaz

2021, BMC Bioinformatics

Background: The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. Results: We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Conclusions: Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.

descriptionView Paper arrow_downwardDownload

Entity recognition in the biomedical domain using a hybrid approach

by Marco Basaldella

2021, Journal of biomedical semantics

descriptionView Paper arrow_downwardDownload

BioCreative III interactive task: an overview

by Junichi Tsujii

2021, BMC Bioinformatics

Background: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested. Results: A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and geneoriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation. Discussion: The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users

descriptionView Paper arrow_downwardDownload

Entity recognition in the biomedical domain using a hybrid approach

by Carlo Tasso

2021, Journal of biomedical semantics

descriptionView Paper arrow_downwardDownload