Papers by Pashutan Modaresi
From Phrases to Keyphrases
Automatic keyphrase extraction aims at extracting a compact representation of a single document w... more Automatic keyphrase extraction aims at extracting a compact representation of a single document which can be used for various applications such as indexing, classification or summarization. Existing methods for keyphrase extraction usually define the set of phrases of a document as a crisp set and by scoring the phrases, they select the keyphrases of the document. In this work we define the set of phrases inside a document to be a fuzzy set, and based on the membership values of the phrases, we select the ones with higher membership values as the keyphrases of the document. Moreover we propose a novel evaluation method inspired by the Turing test which can be used for extractive summarization tasks.
CLEF (Working Notes), 2016
Author masking is the task of paraphrasing a document so that its writing style no longer matches... more Author masking is the task of paraphrasing a document so that its writing style no longer matches that of its original author. This task was introduced as part of the 2016 PAN Lab on Digital Text Forensics, for which a total of three research teams submitted their results. This work describes our methodology to evaluate the submitted obfuscation systems based on their safety, soundness and sensibleness. For the first two dimensions, we introduce automatic evaluation measures and for sensibleness we report our manual evaluation results.
FIRE (Working Notes), 2016
We developed an approach to automatically predict the personality traits of Java developers based... more We developed an approach to automatically predict the personality traits of Java developers based on their source code for the PR-SOCO challenge 2016. The challenge provides a data set consisting of source code with their associated developers' personality traits (neuroticism, extraversion, openness, agreeableness, and conscientiousness). Our approach adapts features from the authorship identification domain and utilizes features that were specifically engineered for the PR-SOCO challenge. We experiment with two learning methods: linear regression and k-nearest neighbors regressor. The results are reported in terms of the Pearson product-moment correlation and root mean square error.
arXiv (Cornell University), Jan 3, 2017
In this work, we present the results of a systematic study to investigate the (commercial) benefi... more In this work, we present the results of a systematic study to investigate the (commercial) benefits of automatic text summarization systems in a real world scenario. More specifically, we define a use case in the context of media monitoring and media response analysis and claim that even using a simple query-based extractive approach can dramatically save the processing time of the employees without significantly reducing the quality of their work.

Identification and Evaluation of Keyphrases: Fuzzy Set based Scoring and Turing Test Inspired Evaluation
Automatic keyphrase extraction aims at extracting a compact representation of a single document, ... more Automatic keyphrase extraction aims at extracting a compact representation of a single document, which can be used for numerous applications such as indexing, classification or summarization. Existing keyphrase extraction approaches typically consist of two steps. An extraction step to select the candidatephrases using some heuristics and a scoring phase for ranking the extracted candidate phrases based on their importance in the text. Existing approachesto automatic keyphrase extraction mainly define the set of phrases of a document as a crisp set and by scoring and ranking the phrases, they selectthe keyphrases of the document. In this work we define the set of phrases in a document to be a fuzzy set, and based on the membership values ofthe phrases, we select the ones with higher membership values as the keyphrases of the document. Moreover we propose a novel evaluation methodinspired by the Turing test, which can be used for extractive summarization tasks.
CLEF (Working Notes), 2014
In this work we describe our approach to solve the author verification problem introduced in the ... more In this work we describe our approach to solve the author verification problem introduced in the PAN 2014 Author Identification task. The author verification task presents participants with a set of problems where each problem consists of a set of documents written by the same author and a questioned document with an unknown author. The task is then to decide whether the questioned document has the same author as the other documents or not. Inspired by a psychological personality model, our approach uses basic lexical feature extraction and fuzzy clustering. Using the created fuzzy clusters, the membership values of documents to the clusters can be computed. The distribution of the cluster membership values will be used finally to solve the verification problem.
ArXiv, 2017
In this work, we present the results of a systematic study to investigate the (commercial) benefi... more In this work, we present the results of a systematic study to investigate the (commercial) benefits of automatic text summarization systems in a real world scenario. More specifically, we define a use case in the context of media monitoring and media response analysis and claim that even using a simple query-based extractive approach can dramatically save the processing time of the employees without significantly reducing the quality of their work.
On Definition of Automatic Text Summarization
Research in the continuously growing field of automatic text summarization is branched into extra... more Research in the continuously growing field of automatic text summarization is branched into extractive and abstractive approaches. Over the past few decades, major advances have occurred in extractive summarization and a smooth transition from extractive to abstractive approaches can be observed in recent years. Despite advances, a proper definition of automatic text summarization has been mainly neglected by researchers. In this work we emphasize on the importance of an appropriate definition of automatic text summarization. We review previous definitions on text summarization, investigate their properties and propose our own definition.
We developed an approach to automatically predict the personality traits of Java developers based... more We developed an approach to automatically predict the personality traits of Java developers based on their source code for the PR-SOCO challenge 2016. The challenge provides a data set consisting of source code with their associated developers’ personality traits (neuroticism, extraversion, openness, agreeableness, and conscientiousness). Our approach adapts features from the authorship identification domain and utilizes features that were specifically engineered for the PR-SOCO challenge. We experiment with two learning methods: linear regression and k-nearest neighbors regressor. The results are reported in terms of the Pearson product-moment correlation and root mean square error. CCS Concepts •Computing methodologies Ñ Artificial intelligence; Natural language processing;

Identification and Evaluation of Keyphrases: Fuzzy Set based Scoring and Turing Test Inspired Evaluation
Automatic keyphrase extraction aims at extracting a compact representation of a single document, ... more Automatic keyphrase extraction aims at extracting a compact representation of a single document, which can be used for numerous applications such as indexing, classification or summarization. Existing keyphrase extraction approaches typically consist of two steps. An extraction step to select the candidatephrases using some heuristics and a scoring phase for ranking the extracted candidate phrases based on their importance in the text. Existing approachesto automatic keyphrase extraction mainly define the set of phrases of a document as a crisp set and by scoring and ranking the phrases, they selectthe keyphrases of the document. In this work we define the set of phrases in a document to be a fuzzy set, and based on the membership values ofthe phrases, we select the ones with higher membership values as the keyphrases of the document. Moreover we propose a novel evaluation methodinspired by the Turing test, which can be used for extractive summarization tasks.
In this work we describe our approach to solve the author verification problem introduced in the ... more In this work we describe our approach to solve the author verification problem introduced in the PAN 2014 Author Identification task. The author verification task presents participants with a set of problems where each problem consists of a set of documents written by the same author and a questioned document with an unknown author. The task is then to decide whether the questioned document has the same author as the other documents or not. Inspired by a psychological personality model, our approach uses basic lexical feature extraction and fuzzy clustering. Using the created fuzzy clusters, the membership values of documents to the clusters can be computed. The distribution of the cluster membership values will be used finally to solve the verification problem.
Author profiling deals with the study of various profile dimensions of an author such as age and ... more Author profiling deals with the study of various profile dimensions of an author such as age and gender. This work describes our methodology proposed for the task of cross-genre author profiling at PAN 2016. We address gender and age prediction as a classification task and approach this problem by extracting stylistic and lexical features for training a logistic regression model. Furthermore, we report the effects of our cross-genre machine learning approach for the author profiling task. With our approach, we achieved the first place for gender detection in English and tied for second place in terms of joint accuracy. For Spanish, we tied for first place.
Author masking is the task of paraphrasing a document so that its writing style no longer matches... more Author masking is the task of paraphrasing a document so that its writing style no longer matches that of its original author. This task was introduced as part of the 2016 PAN Lab on Digital Text Forensics, for which a total of three research teams submitted their results. This work describes our methodology to evaluate the submitted obfuscation systems based on their safety, soundness and sensibleness. For the first two dimensions, we introduce automatic evaluation measures and for sensibleness we report our manual evaluation results.
We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection ta... more We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative single- linkage strategy using a defined distance measure.
Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation, 2016
Given a set of sentences, a sentence orderer permutes the sentences in a way that the final text ... more Given a set of sentences, a sentence orderer permutes the sentences in a way that the final text is linguistically coherent and semantically understandable. In this work, we focus on the binary and ternary tasks of ordering a pair of sentences regarding their linguistic coherence. We propose a methodology to automatically collect and annotate sentence ordering corpora in the news domain for English and German documents. Furthermore, we introduce a data-driven end-to-end neural architecture to learn the order of a pair of sentences and also recognize the cases where no ordering can be determined due to missing context.

Simurg
Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation, 2016
Abstractive single document summarization is considered as a challenging problem in the field of ... more Abstractive single document summarization is considered as a challenging problem in the field of artificial intelligence and natural language processing. Meanwhile and specifically in the last two years, several deep learning summarization approaches were proposed that once again attracted the attention of researchers to this field. It is a well-known issue that deep learning approaches do not work well with small amounts of data. With some exceptions, this is, unfortunately, the case for most of the datasets available for the summarization task. Besides this problem, it should be considered that phonetic, morphological, semantic and syntactic features of the language are constantly changing over the time and unfortunately most of the summarization corpora are constructed from old resources. Another problem is the language of the corpora. Not only in the summarization field, but also in other fields of natural language processing, most of the corpora are only available in English. In addition to the above problems, license terms, and fees of the corpora are obstacles that prevent many academics and specifically non-academics from accessing these data. This work describes an open source framework to create an extendable multilingual corpus for abstractive single document summarization that addresses the above-mentioned problems. We describe a tool consisted of a scalable crawler and a centralized key-value store database to construct a corpus of an arbitrary size using a news aggregator service.
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016
This paper describes our participation in the SemEval-2016 Task 1: Semantic Textual Similarity (S... more This paper describes our participation in the SemEval-2016 Task 1: Semantic Textual Similarity (STS). We developed three methods for the English subtask (STS Core). The first method is unsupervised and uses WordNet and word2vec to measure a token-based overlap. In our second approach, we train a neural network on two features. The third method uses word2vec and LDA with regression splines.
From Phrases to Keyphrases
Proceedings of the 12th International Conference on Advances in Mobile Computing and Multimedia, 2014
Automatic keyphrase extraction aims at extracting a compact representation of a single document w... more Automatic keyphrase extraction aims at extracting a compact representation of a single document which can be used for various applications such as indexing, classification or summarization. Existing methods for keyphrase extraction usually define the set of phrases of a document as a crisp set and by scoring the phrases, they select the keyphrases of the document. In this work we define the set of phrases inside a document to be a fuzzy set, and based on the membership values of the phrases, we select the ones with higher membership values as the keyphrases of the document. Moreover we propose a novel evaluation method inspired by the Turing test which can be used for extractive summarization tasks.
Uploads
Papers by Pashutan Modaresi