Word embeddings have found their way into a wide range of natural language processing tasks inclu... more Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden patterns and trends in the data, they fail to offer interpretability. Interpretability is a key means to justification which is an integral part when it comes to biomedical applications. We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods. Qualitative and quantitative measurements and metrics for interpretability of word vector representations are provided. For the quantitative evaluation, we introduce an extensive categorized dataset that can be used to quantify interpretability based on category theory. Intrinsic and extrinsic evaluation of the studied methods are also presented. As for the latter, we propose datasets which can be utilized for effective extrinsic evaluati...
Dataset of COVID-19 outbreak and potential predictive features in the USA
This dataset provides information related to the outbreak of COVID-19 disease in the United State... more This dataset provides information related to the outbreak of COVID-19 disease in the United States, including data from each of 3142 US counties from the beginning of the outbreak (January 2020) until September 2020. This data is collected from many public online databases and includes the daily number of COVID-19 confirmed cases and deaths, as well as 33 features that may be relevant to the pandemic dynamics: demographic, geographic, climatic, traffic, public-health, social-distancing-policy adherence, and political characteristics of each county. We anticipate many researchers will use this dataset to train models that can predict the spread of COVID-19 and to identify the key driving factors.
The need for improved models that can accurately predict COVID-19 dynamics is vital to managing t... more The need for improved models that can accurately predict COVID-19 dynamics is vital to managing the pandemic and its consequences. We use machine learning techniques to design an adaptive learner that, based on epidemiological data available at any given time, produces a model that accurately forecasts the number of reported COVID-19 deaths and cases in the United States, up to 10 weeks into the future with a mean absolute percentage error of 9%. In addition to being the most accurate long-range COVID predictor so far developed, it captures the observed periodicity in daily reported numbers. Its effectiveness is based on three design features: (1) producing different model parameters to predict the number of COVID deaths (and cases) from each time and for a given number of weeks into the future, (2) systematically searching over the available covariates and their historical values to find an effective combination, and (3) training the model using “last-fold partitioning”, where each...
Heterogeneous networks are large graphs consisting of different types of nodes and edges. They ar... more Heterogeneous networks are large graphs consisting of different types of nodes and edges. They are an important category of complex networks, but the process of knowledge extraction and relations discovery from these networks are so complicated and time-consuming. Moreover, the scale of these networks is steadily increasing. Thus, scalable and accurate methods are required for efficient knowledge extraction. In this paper, two distributed label propagation algorithms, namely DHLP-1 and DHLP-2, in the heterogeneous networks have been introduced. The Apache Giraph platform is employed which provides a vertex-centric programming model for designing and running distributed graph algorithms. Complex heterogeneous networks have many examples in the real world and are widely used today for modeling complicated processes. Biological networks are one of such networks. As a case study, we have measured the efficiency of our proposed DHLP-1 and DHLP-2 algorithms on a biological network consist...
Prediction and discovery of disease-causing genes are among the main missions of biology and medi... more Prediction and discovery of disease-causing genes are among the main missions of biology and medicine. In recent years, researchers have developed several methods based on gene/protein networks for the detection of causative genes. However, because of the presence of false positives in these networks, the results of these methods often lack accuracy and reliability. This problem can be solved by using multiple genomic sources to reduce noise in data. However, network integration can also affect the quality of the integrated network. In this paper, we present a method named RWRHN (random walk with restart on a heterogeneous network) with fuzzy fusion or RWRHN-FF. In this method, first, four gene-gene similarity networks are constructed based on different genomic sources and then integrated using the type-II fuzzy voter scheme. The resulting gene-gene network is then linked to a disease-disease similarity network, which itself is constructed by the integration of four sources, through...
Uploads
Papers by Zeinab Maleki