Papers by Siddhartha Jonnalagadda

Background: The availability of annotated corpora has facilitated the application of machine lear... more Background: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.

Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the... more Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B 3 , MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https:// sourceforge.net/projects/ohnlp/files/MedCoref.
Systematic Analysis of Cross-Institutional Medication Description Patterns in Clinical Notes
Abstract In clinical notes, medication information follows certain semantic patterns and some med... more Abstract In clinical notes, medication information follows certain semantic patterns and some medication descriptions contain additional word (s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them for natural language processing tools to effectively extract comprehensive medication information. We examined both semantic and context patterns and compared those found in Mayo Clinic and i2b2 challenge data. We found that some ...

Background: The availability of annotated corpora has facilitated the application of machine lear... more Background: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.

Journal of Biomedical Informatics
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes... more Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.

Evaluating Distributional Semantic and Feature Selection for Extracting Relationships from Biological Text
The constant flow of biomolecular findings being published each day challenges our ability to dev... more The constant flow of biomolecular findings being published each day challenges our ability to develop methods to automatically extract the knowledge expressed in text to potentially influence new discoveries. Finding relations between the biological entities (e.g. proteins and genes) in text is a challenging task. To facilitate the extraction process, a relation can be decomposed into a trigger and the complementary arguments (e.g. theme, site). Several approaches have been proposed based on machine learning which generally use a common set of features for all trigger types. Here we evaluate the impact of applying a feature selection method for trigger classification. Our proposed method uses a greedy feature selection algorithm to find an optimal set of attributes for each trigger type. We show that using the customized set of features can improve classification results significantly (up to 53.96% in f-measure). In addition, we evaluated different settings for including semantic features in the classifiers. We found that using semantic features can improve classification results and found the best setting for each trigger type.
Analysis of Cross-Institutional Medication Information Annotations in Clinical Notes
Towards a semantic lexicon for clinical natural language processing
Abstract A semantic lexicon which associates words and phrases in text to concepts is critical fo... more Abstract A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text.
Background The availability of annotated corpora has facilitated the application of machine learn... more Background The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers.
Feasibility of pooling annotated corpora for clinical concept extraction
Abstract Availability of annotated corpora has facilitated application of machine learning algori... more Abstract Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection.
Abstract Online health knowledge resources contain answers to most of the information needs raise... more Abstract Online health knowledge resources contain answers to most of the information needs raised by clinicians in the course of care. However, significant barriers limit the use of these resources for decision-making, especially clinicians' lack of time. Existing solutions are less optimal when information needs cannot be met without substantial cognitive effort and time.
Background Rapid identification of subject experts for medical topics helps in improving the impl... more Background Rapid identification of subject experts for medical topics helps in improving the implementation of discoveries by speeding the time to market drugs and aiding in clinical trial recruitment, etc. Identifying such people who influence opinion through social network analysis is gaining prominence. In this work, we explore how to combine named entity recognition from unstructured news articles with social network analysis to discover opinion leaders for a given medical topic.
Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules
Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the... more Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun.
An Effective Approach to Biomedical Information Extraction with Limited Training Data
In the current millennium, extensive use of computers and the internet caused an exponential incr... more In the current millennium, extensive use of computers and the internet caused an exponential increase in information. Few research areas are as important as information extraction, which primarily involves extracting concepts and the relations between them from free text. ...
… of Human Language …, Jan 1, 2009
Arxiv preprint arXiv:1001.4273, Jan 1, 2010
IEEE IEEE/ACM …, Jan 1, 2010
… Workshop, 2009. BIBMW …, Jan 1, 2009
The 3rd International …, Jan 1, 2009
… Linguistics and Intelligent …, Jan 1, 2010
Uploads
Papers by Siddhartha Jonnalagadda