Papers by Erfaneh Gharavi

Bioengineering, Mar 8, 2024
As available genomic interval data increase in scale, we require fast systems to search them. A c... more As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a lowdimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Using RST-based deep neural networks to improve text representation
Signal and Data Processing, Jun 1, 2023

bioRxiv (Cold Spring Harbor Laboratory), Aug 28, 2023
Background: Representation learning models have become a mainstay of modern genomics. These model... more Background: Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. Methods: To bridge this gap, we propose four evaluation metrics: the cluster tendency test (CTT), the reconstruction test (RCT), the genome distance scaling test (GDST), and the neighborhood preserving test (NPT). The CTT and RCT are statistical methods that evaluate how well region embeddings can be clustered and how much the embeddings can preserve the information contained in training data. The GDST and NPT exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings and a set of region embeddings. Results: We demonstrate the utility of these statistical and biological tests for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings. Availability: Code is available at https://github.com/databio/geniml. 1• Evaluation of unsupervised genomic regions embeddings .

bioRxiv (Cold Spring Harbor Laboratory), Aug 21, 2023
As available genomic interval data increases in scale, we require fast systems to search it. A co... more As available genomic interval data increases in scale, we require fast systems to search it. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but these approaches lead to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Results: Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string; suggesting new labels for database region sets; and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval. 1• Representation learning of genomic interval sets .

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings
MotivationData from the single-cell assay for transposase-accessible chromatin using sequencing (... more MotivationData from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower-dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning.ResultsWe implemented our approach in scEmbed, an unsupervised machine learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed is competitive with alternative scATAC embedding approaches in terms of clustering ability ...

Predicting customers' future demand using data mining analysis: A case study of wireless communication customer
The 5th Conference on Information and Knowledge Technology, 2013
Due to high competition in today's business and the need for satisfactory communication with ... more Due to high competition in today's business and the need for satisfactory communication with customers, companies understand the inevitable necessity to focus not only on preventing customer churn but also on predicting their needs and providing the best services for them. The purpose of this article is to predict future services needed by wireless users, with data mining techniques. For this purpose, the database of customers of an ISP in Shiraz, which logs the customer usage of wireless internet connections, is utilized. Since internet service has three main factors to define (Time, Speed, Traffics) we predict each separately. First, future service demand is predicted by implementing a simple Recency, Frequency, Monetary (RFM) as a basic model. Other factors such as duration from first use, slope of customer's usage curve, percentage of activation, Bytes In, Bytes Out and the number of retries to establish a connection and also customer lifetime value are considered and added to RFM model. Then each one of R, F, M criteria is alternately omitted and the result is evaluated. Assessment is done through analysis node which determines the accuracy of evaluated data among partitioned data. The result shows that CART and C5.0 are the best algorithms to predict future services in this case. As for the features, depending upon output of each features, duration and transfer Bytes are the most important after RFM. An ISP may use the model discussed in this article to meet customers' demands and ensure their loyalty and satisfaction.
Multi-level text document similarity estimation and its application for plagiarism detection
Iran Journal of Computer Science

Neural computing & applications (Print), Nov 7, 2019
The efficiency and scalability of plagiarism detection systems have become a major challenge due ... more The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.

Plagiarism detection is defined as automatic identification of reused text materials. General ava... more Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and lack of proper algorithms for Persian plagiarism detection, in this paper, we propose a deep learning based method to detect plagiarism. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors for sentence representation. By comparing representations of source and suspicious sentences, pair sentences with the highest similarity are considered as the candidates for plagiarism. The decision on being plagiarism is performed using a two level evaluation method. Our method has been used in PAN2016 Persian plagiar...
During a disease outbreak, timely non-medical interventions are critical in preventing the diseas... more During a disease outbreak, timely non-medical interventions are critical in preventing the disease from growing into an epidemic and ultimately a pandemic. However, taking quick measures requires the capability to detect the early warning signs of the outbreak. This work collects Twitter posts surrounding the 2020 COVID-19 pandemic expressing the most common symptoms of COVID-19 including cough and fever, geolocated to the United States. Through examining the variation in Twitter activities at state level, we observed a temporal lag between the rises in the number of symptom reporting tweets and officially reported positive cases which varies between 5 to 19 days.

2019 Systems and Information Engineering Design Symposium (SIEDS)
Readily available, trustworthy, and usable medical information is vital to promoting global healt... more Readily available, trustworthy, and usable medical information is vital to promoting global health. Cochrane is a non-profit medical organization that conducts and publishes systematic reviews of medical research findings. Over 3000 Cochrane Reviews are presently used as evidence in Wikipedia articles. Currently, Cochrane's researchers manually search Wikipedia pages related to medicine in order to identify Wikipedia articles that can be improved with Cochrane evidence. Our aim is to streamline this process by applying existing document similarity and information retrieval methods to automatically link Wikipedia articles and Cochrane Reviews. Potential challenges to this project include document length and the specificity of the corpora. These challenges distinguish this problem from ordinary document representation and retrieval problems. For our methodology, we worked with data from 7400 Cochrane Reviews, ranging from one to several pages in length, and 33,000 Wikipedia articles categorized as medical. We explored different methods of document vectorization including TFIDF, LDA, LSA, word2Vec, and doc2Vec. For every document in both corpora, their similarity to each document in the opposing set was calculated using established vector similarity metrics such as cosine similarity and KLdivergence. Labeled data for this unsupervised task was not available. Models were evaluated by comparing the results to two standards: (1) Cochrane Reviews currently cited in Wikipedia articles and (2) a data set provided by a medical expert that indicates which Cochrane Reviews could be considered for specific Wikipedia articles. Our system performs best using TFIDF document representation and cosine similarity.

ArXiv, 2020
Due to the increasing amount of data on the internet, finding a highly-informative, low-dimension... more Due to the increasing amount of data on the internet, finding a highly-informative, low-dimensional representation for text is one of the main challenges for efficient natural language processing tasks including text classification. This representation should capture the semantic information of the text while retaining their relevance level for document classification. This approach maps the documents with similar topics to a similar space in vector space representation. To obtain representation for large text, we propose the utilization of deep Siamese neural networks. To embed document relevance in topics in the distributed representation, we use a Siamese neural network to jointly learn document representations. Our Siamese network consists of two sub-network of multi-layer perceptron. We examine our representation for the text categorization task on BBC news dataset. The results show that the proposed representations outperform the conventional and state-of-the-art representatio...

Topic identification as a specific case of text classification is one of the primary steps toward... more Topic identification as a specific case of text classification is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector result from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. In dealing with tweets which are limited in the number of words the aforementioned problems are reflected more than ever. In order to alleviate such issues, we have proposed a new topic identification method for Spanish tweets based on the deep representation of Spanish words. In the proposed method, words are represented as multi-dimensional vectors, in other words, words are replaced with their equivalent vectors which are calculated based on some transformation of raw text data. Average aggregation technique is used to transform the word vectors into tweet representation. Our...
ArXiv, 2020
During a disease outbreak, timely non-medical interventions are critical in preventing the diseas... more During a disease outbreak, timely non-medical interventions are critical in preventing the disease from growing into an epidemic and ultimately a pandemic. However, taking quick measures requires the capability to detect the early warning signs of the outbreak. This work collects Twitter posts surrounding the 2020 COVID-19 pandemic expressing the most common symptoms of COVID-19 including cough and fever, geolocated to the United States. Through examining the variation in Twitter activities at the state level, we observed a temporal lag between the rises in the number of symptom reporting tweets and officially reported positive cases which varies between 5 to 19 days.

Siamese Discourse Structure Recursive Neural Network for Semantic Representation
2019 IEEE 13th International Conference on Semantic Computing (ICSC), 2019
Finding a highly informative, low-dimensional representation for texts, specifically long texts, ... more Finding a highly informative, low-dimensional representation for texts, specifically long texts, is one of the main challenges for efficient information storage and retrieval. This representation should capture the semantic and syntactic information of the text while retaining relevance for large-scale similarity search. We propose the utilization of Rhetorical Structure Theory (RST) to consider text structure in the representation. In addition, to embed document relevance in distributed representation, we use a Siamese neural network to jointly learn document representations. Our Siamese network consists of two sub-networks of recursive neural networks built over the RST tree. We examine our approach on two datasets, a subset of Reuters's corpus and BBC news dataset. Our model outperforms latent Dirichlet allocation document modeling on both datasets. Our method also outperforms latent semantic analysis document representation has been beaten by our method by 3% and 6% on the B...

Plagiarism detection is defined as automatic identification of reused text materials. General ava... more Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and lack of proper algorithms for Persian plagiarism detection, in this paper, we propose a deep learning based method to detect plagiarism. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors for sentence representation. By comparing representations of source and suspicious sentences, pair sentences with the highest similarity are considered as the candidates for plagiarism. The decision on being plagiarism is performed using a two level evaluation method. Our method has been used in PAN2016 Persian plagiar...

Existing simulations designed for cultural and interpersonal skill training rely on pre-defined r... more Existing simulations designed for cultural and interpersonal skill training rely on pre-defined responses with a menu option selection interface. Using a multiple-choice interface and restricting trainees' responses may limit the trainees' ability to apply the lessons in real life situations. This systems also uses a simplistic evaluation model, where trainees' selected options are marked as either correct or incorrect. This model may not capture sufficient information that could drive an adaptive feedback mechanism to improve trainees' cultural awareness. This paper describes the design of a dialogue-based simulation for cultural awareness training. The simulation, built around a disaster management scenario involving a joint coalition between the US and the Chinese armies. Trainees were able to engage in realistic dialogue with the Chinese agent. Their responses, at different points, get evaluated by different multi-label classification models. Based on training on...

Data Collection Methods for Building a Free Response Training Simulation
Most past research in the area of serious games for simulation has focused on games with constrai... more Most past research in the area of serious games for simulation has focused on games with constrained multiple-choice based dialogue systems. Recent advancements in natural language processing research make free-input text classification-based dialogue systems more feasible, but an effective framework for collecting training data for such systems has not yet been developed. This paper presents methods for collecting and generating data for training a free-input classification-based system. Various data crowdsourcing prompt types are presented. A binary category system, which increases the fidelity of the labeling to make free-input classification more effective, is presented. Finally, a data generation algorithm based on the binary data labeling system is presented. Future work will use the data crowdsourcing and generation methods presented here to implement a free-input dialogue system in a virtual reality (VR) simulation designed for cultural competency training.

A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation
Nowadays, global networks facilitate access to vast amount of textual information and enhance the... more Nowadays, global networks facilitate access to vast amount of textual information and enhance the feasibility of plagiarism as a consequence. Given the amount of text material produced everyday, the need for an automated fast plagiarism detection system is more crucial than ever. Plagiarism detection is defined as identification of reused text materials. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to limitation in semantic representation and computational inefficiency of traditional algorithms for plagiarism detection, in this paper, we proposed an embedding based document representation to detect plagiarism in documents using a two-level decision making approach. The method is language-independent and works properly on various languages as well. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors in order to repre...

2019 Systems and Information Engineering Design Symposium (SIEDS), Apr 1, 2019
Readily available, trustworthy, and usable medical information is vital to promoting global healt... more Readily available, trustworthy, and usable medical information is vital to promoting global health. Cochrane is a non-profit medical organization that conducts and publishes systematic reviews of medical research findings. Over 3000 Cochrane Reviews are presently used as evidence in Wikipedia articles. Currently, Cochrane's researchers manually search Wikipedia pages related to medicine in order to identify Wikipedia articles that can be improved with Cochrane evidence. Our aim is to streamline this process by applying existing document similarity and information retrieval methods to automatically link Wikipedia articles and Cochrane Reviews. Potential challenges to this project include document length and the specificity of the corpora. These challenges distinguish this problem from ordinary document representation and retrieval problems. For our methodology, we worked with data from 7400 Cochrane Reviews, ranging from one to several pages in length, and 33,000 Wikipedia articles categorized as medical. We explored different methods of document vectorization including TFIDF, LDA, LSA, word2Vec, and doc2Vec. For every document in both corpora, their similarity to each document in the opposing set was calculated using established vector similarity metrics such as cosine similarity and KLdivergence. Labeled data for this unsupervised task was not available. Models were evaluated by comparing the results to two standards: (1) Cochrane Reviews currently cited in Wikipedia articles and (2) a data set provided by a medical expert that indicates which Cochrane Reviews could be considered for specific Wikipedia articles. Our system performs best using TFIDF document representation and cosine similarity.
Uploads
Papers by Erfaneh Gharavi