Goran Nenadic

Followers

Following

Co-authors

Public Views

Interests

Uploads

Papers by Goran Nenadic

MedMine: Examining Pre-trained Language Models on Medication Mining

arXiv (Cornell University), Aug 7, 2023

Automatic medication mining from clinical and biomedical text has become a popular topic due to i... more Automatic medication mining from clinical and biomedical text has become a popular topic due to its real impact on healthcare applications and the recent development of powerful language models (LMs). However, fully-automatic extraction models still face obstacles to be overcome such that they can be deployed directly into clinical practice for better impacts. Such obstacles include their imbalanced performances on different entity types and clinical events. In this work, we examine current state-of-theart pre-trained language models (PLMs) on such tasks, via fine-tuning including the monolingual model Med7 and multilingual large language model (LLM) XLM-RoBERTa. We compare their advantages and drawbacks using historical medication mining shared task data sets from n2c2-2018 challenges. We report the findings we get from these fine-tuning experiments such that they can facilitate future research on addressing them, for instance, how to combine their outputs, merge such models, or improve their overall accuracy by ensemble learning and data augmentation. Med-Mine is part of the M3 Initiative https: //github.com/HECTA-UoM/M3

Download

Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case Study using Latent Dirichlet Allocation Method

arXiv (Cornell University), Jan 8, 2023

Topic Modelling (TM) is a natural language processing (NLP) method for discovering topics in a co... more Topic Modelling (TM) is a natural language processing (NLP) method for discovering topics in a collection of documents. Being an unsupervised method, it is a valuable tool when trying to summarise the main topics and topic changes in large quantities of data. In this study, we apply two prevalent topic modelling techniques-Latent Dirichlet Allocation (LDA) and BERTopic-to analyse the change of topics in the Swedish newspaper articles about COVID-19. We describe the corpus we created including 6515 articles, methods applied, and statistics on topic changes over approximately 1 year and two months period of time from 17th January 2020 to 13th March 2021. We hope this work can be an asset for grounding applications of topic modelling and can be inspiring for similar case studies in an era with pandemics, to support socioeconomic impact research as well as clinical and healthcare analytics.

Download

An Analysis of PubMed Abstracts From 1946 to 2021 to Identify Organizational Affiliations in Epidemiological Criminology: Descriptive Study

Interactive Journal of Medical Research

Background Epidemiological criminology refers to health issues affecting incarcerated and noninca... more Background Epidemiological criminology refers to health issues affecting incarcerated and nonincarcerated offender populations, a group recognized as being challenging to conduct research with. Notwithstanding this, an urgent need exists for new knowledge and interventions to improve health, justice, and social outcomes for this marginalized population. Objective To better understand research outputs in the field of epidemiological criminology, we examined the lead author’s affiliation by analyzing peer-reviewed published outputs to determine countries and organizations (eg, universities, governmental and nongovernmental organizations) responsible for peer-reviewed publications. Methods We used a semiautomated approach to examine the first-author affiliations of 23,904 PubMed epidemiological studies related to incarcerated and offender populations published in English between 1946 and 2021. We also mapped research outputs to the World Justice Project Rule of Law Index to better unde...

Download

EDU-level Extractive Summarization with Varying Summary Lengths

arXiv (Cornell University), Oct 8, 2022

Extractive models usually formulate text summarization as extracting fixed top-k salient sentence... more Extractive models usually formulate text summarization as extracting fixed top-k salient sentences from the document as a summary. Few works exploited extracting finer-grained Elementary Discourse Unit (EDU) with little analysis and justification for the extractive unit selection. Further, the selection strategy of the fixed top-k salient sentences fits the summarization need poorly, as the number of salient sentences in different documents varies and therefore a common or best k does not exist in reality. To fill these gaps, this paper first conducts the comparison analysis of oracle summaries based on EDUs and sentences, which provides evidence from both theoretical and experimental perspectives to justify and quantify that EDUs make summaries with higher automatic evaluation scores than sentences. Then, considering this merit of EDUs, this paper further proposes an EDU-level extractive model with Varying summary Lengths (EDU-VL 1) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in the document, generate multiple candidate summaries with varying lengths based on various k values, and encode and score candidate summaries, in an end-to-end training manner. Finally, EDU-VL is experimented on single and multi-document benchmark datasets and shows improved performances on ROUGE scores in comparison with state-of-the-art extractive models, and further human evaluation suggests that EDUconstituent summaries maintain good grammaticality and readability.

Download

A Text Mining Model for Answering Checklist Questions Automatically from Parasitology Literature

2020 International Conference on Computing and Information Technology (ICCIT-1441), 2020

Complete reporting of Experimental Meta-data (EM) is necessary for reproducing and understanding ... more Complete reporting of Experimental Meta-data (EM) is necessary for reproducing and understanding biomedical experiments and results. Experimental Metadata Reporting Checklist Questions (EMR-CLQs) have been designed and used by journals as guidelines to capture EM and evaluate the quality of the reporting. Automatically answering EMR-CLQs is necessary to check completeness and clarity of EM, which can be useful for the peer-review process. Moreover, automatically extracting the EMR-CLQs answers can be used to search the relevant literature for the meta-data analysis process in an efficient way. This paper shows the possibility of answering different types of EMR-CLQs automatically by understanding the structure of both EMR-CLQs and the biomedical article. A text mining model (rule-based approach) based on the information extraction techniques and the structure of the biomedical articles and the EMR-CLQs, is proposed as a first model in the biomedical reproducibility domain to answer EMR-CLQs automatically. The model was used to answer five EMR-CLQs of two different types automatically; Main and Attribute questions. We evaluated the feasibility of the model against gold-standard data of 58 full-text articles annotated by domain experts. The results are showing the possibility of answering the EMR-CLQs automatically with a mean f-measure of 75% and 73% for development and testing datasets, respectively.

MASK: A Success Story for An International Collaboration

International Journal of Population Data Science, 2020

IntroductionA significant amount of valuable information in Electronic Health Records (EHR) such ... more IntroductionA significant amount of valuable information in Electronic Health Records (EHR) such as laboratory test results or echocardiogram interpretations is embedded in lengthy free-text fields. Often patients’ personal information is also included in these narratives. Privacy legislation in different jurisdictions requires de-identification of this information prior to making it available for research. This process can be challenging and time-consuming. In particular, rule-based algorithms may lead to over-masking of essential medical terms, conditions, or devices that are named after individuals. Objectives and ApproachWe aimed to enhance ICES’ existing rule-based application to make it contextually-driven by applying Artificial Intelligence (AI). The ICES team collaborated with computer scientists at the University of Manchester who had already published work in this area and Evenset, a Toronto-based software company. Based on the Manchester University de-identification frame...

A curation pipeline for bio-derived chemical feedstocks

F1000Research, 2016

A Framework for Evaluation of Machine Reading Comprehension Gold Standards

ArXiv, 2020

Machine Reading Comprehension (MRC) is the task of answering a question over a paragraph of text.... more Machine Reading Comprehension (MRC) is the task of answering a question over a paragraph of text. While neural MRC systems gain popularity and achieve noticeable performance, issues are being raised with the methodology used to establish their performance, particularly concerning the data design of gold standards that are used to evaluate them. There is but a limited understanding of the challenges present in this data, which makes it hard to draw comparisons and formulate reliable hypotheses. As a first step towards alleviating the problem, this paper proposes a unifying framework to systematically investigate the present linguistic features, required reasoning and background knowledge and factual correctness on one hand, and the presence of lexical cues as a lower bound for the requirement of understanding on the other hand. We propose a qualitative annotation schema for the first and a set of approximative metrics for the latter. In a first application of the framework, we analys...

Download

Early Phase Validation of a Decision Support System within the Exemplar of Aneurysmal Subarachnoid Haemorrhage

Extracting useful software development information from mobile application reviews: A survey of intelligent mining techniques and tools

Expert Systems with Applications, 2018

Abstract Mobile application (app) websites such as Google Play and AppStore allow users to review... more Abstract Mobile application (app) websites such as Google Play and AppStore allow users to review their downloaded apps. Such reviews can be useful for app users, as they may help users make an informed decision; such reviews can also be potentially useful for app developers, if they contain valuable information concerning user needs and requirements. However, in order to unleash the value of app reviews for mobile app development, intelligent mining tools that can help discern relevant reviews from irrelevant ones must be provided. This paper surveys the state of the art in the development of such tools and techniques behind them. To gain insight into the maturity of the current support mining tools, the paper will also find out what app development information these tools have discovered and what challenges they are facing. The results of this survey can inform the development of more effective and intelligent app review mining techniques and tools.

Automatic Extraction of Mental Health Disorders From Domestic Violence Police Narratives: Text Mining Study

Journal of medical Internet research, Jan 13, 2018

Vast numbers of domestic violence (DV) incidents are attended by the New South Wales Police Force... more Vast numbers of domestic violence (DV) incidents are attended by the New South Wales Police Force each year in New South Wales and recorded as both structured quantitative data and unstructured free text in the WebCOPS (Web-based interface for the Computerised Operational Policing System) database regarding the details of the incident, the victim, and person of interest (POI). Although the structured data are used for reporting purposes, the free text remains untapped for DV reporting and surveillance purposes. In this paper, we explore whether text mining can automatically identify mental health disorders from this unstructured text. We used a training set of 200 DV recorded events to design a knowledge-driven approach based on lexical patterns in text suggesting mental health disorders for POIs and victims. The precision returned from an evaluation set of 100 DV events was 97.5% and 87.1% for mental health disorders related to POIs and victims, respectively. After applying our app...

Download

Identification of Occupation Mentions in Clinical Narratives

Lecture Notes in Computer Science, 2016

A patient’s occupation is an important variable used for disease surveillance and modeling, but s... more A patient’s occupation is an important variable used for disease surveillance and modeling, but such information is often only available in free-text clinical narratives. We have developed a large occupation dictionary that is used as part of both knowledge- (dictionary and rules) and data-driven (machine-learning) methods for the identification of occupation mentions. We have evaluated the approaches on both public and non-public clinical datasets. A machine-learning method using linear chain conditional random fields trained on minimalistic set of features achieved up to 88 % \( {\text{F}}_{1} \)-measure (token-level), with the occupation feature derived from the knowledge-driven method showing a notable positive impact across the datasets (up to additional 32 % \( {\text{F}}_{1} \)-measure).

NTCIR-11

Building a web application to visualise and explore epidemiological literature

Improving Project Management through Ontology��Driven Text Mining

Using local grammars for agreement modeling in highly inflective languages

Digital methods to enhance the usefulness of patient experience data in services for long-term conditions: the DEPEND mixed-methods study

Health Services and Delivery Research, 2020

Background Collecting NHS patient experience data is critical to ensure the delivery of high-qual... more Background Collecting NHS patient experience data is critical to ensure the delivery of high-quality services. Data are obtained from multiple sources, including service-specific surveys and widely used generic surveys. There are concerns about the timeliness of feedback, that some groups of patients and carers do not give feedback and that free-text feedback may be useful but is difficult to analyse. Objective To understand how to improve the collection and usefulness of patient experience data in services for people with long-term conditions using digital data capture and improved analysis of comments. Design The DEPEND study is a mixed-methods study with four parts: qualitative research to explore the perspectives of patients, carers and staff; use of computer science text-analytics methods to analyse comments; co-design of new tools to improve data collection and usefulness; and implementation and process evaluation to assess use of the tools and any impacts. Setting Services fo...

Download

MC-DRE: Multi-Aspect Cross Integration for Drug Event/Entity Extraction

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Extracting meaningful drug-related information chunks, such as adverse drug events (ADE), is cruc... more Extracting meaningful drug-related information chunks, such as adverse drug events (ADE), is crucial for preventing morbidity and saving many lives. Most ADEs are reported via an unstructured conversation with the medical context, so applying a general entity recognition approach is not sufficient enough. In this paper, we propose a new multi-aspect cross-integration framework for drug entity/event detection by capturing and aligning different context/language/knowledge properties from drug-related documents. We first construct multi-aspect encoders to describe semantic, syntactic, and medical document contextual information by conducting those slot tagging tasks, main drug entity/event detection, part-ofspeech tagging, and general medical named entity recognition. Then, each encoder conducts cross-integration with other contextual information in three ways: the key-value cross, attention cross, and feedforward cross, so the multi-encoders are integrated in depth. Our model outperforms all SOTA on two widely used tasks, flat entity detection and discontinuous event extraction. 1 CCS CONCEPTS • Computing methodologies → Natural language processing.

Download

Investigating Massive Multilingual Pre-Trained Machine Translation Models for Clinical Domain via Transfer Learning

Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonst... more Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonstrating superpowers and the pre-knowledge they acquire for downstream tasks. This work investigates whether MMPLMs can be applied to clinical domain machine translation (MT) towards entirely unseen languages via transfer learning. We carry out an experimental investigation using Meta-AI's MMPLMs "wmt21-dense-24-wide-en-X and X-en (WMT21fb)" which were pre-trained on 7 language pairs and 14 translation directions including English to Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese, and the opposite direction. We fine-tune these MMPLMs towards English-Spanish language pair which did not exist at all in their original pre-trained corpora both implicitly and explicitly. We prepare carefully aligned clinical domain data for this fine-tuning, which is different from their original mixed domain knowledge. Our experimental result shows that the fine-tuning is very successful using just 250k well-aligned in-domain EN-ES segments for three sub-task translation testings: clinical cases, clinical terms, and ontology concepts. It achieves very close evaluation scores to another MMPLM NLLB from Meta-AI, which included Spanish as a high-resource setting in the pre-training. To the best of our knowledge, this is the first work on using MMPLMs towards clinical domain transferlearning NMT successfully for totally unseen languages during pre-training.

Download

Predicting Perfect Quality Segments in MT Output with Fine-Tuned OpenAI LLM: Is it possible to capture editing distance patterns from historical data?

arXiv (Cornell University), Jul 31, 2023

Translation Quality Evaluation (TQE) is an essential step of modern translation production proces... more Translation Quality Evaluation (TQE) is an essential step of modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. Ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimization. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state of the art technology and approach TQE as a binary classification task. On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that finetuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited. Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.

Download

Goran Nenadic

Uploads

Papers by Goran Nenadic

Log In