In digital documents analysis for forensic applications, when anonymous documents are presented a... more In digital documents analysis for forensic applications, when anonymous documents are presented and it is not possible with the available tools to determine the true author of the document, there are of vital importance methods that identify the characteristics of the Author Profile (Gender, Age, Personality, etc.). We propose to use a simple method of classification based on the similarity between objects, considering different features for documents representation: (a document corresponds to a set of tweets of a user), the terms used in the tweets, as well as characteristics of opinion and subjectivity presented in them. Our goal will be to classify, based on the content of the tweets, the Gender and language variety of an author from an unknown set of tweets corresponding to him. In the experiments we observed good results in Gender classification, but low values in language variety classification. We processed only the English dataset.
Author Profiling is an important field for detection of demographic characteristics of users base... more Author Profiling is an important field for detection of demographic characteristics of users based on texts written by him. Our main contribution is focused in determining a reduced subset of features that represent frequent lexical words for each profile of Mexican twitters. The new subset of features was obtained considering the frequency of words in a profile (e.g.: students), employing the theory of Transition Points. All the objects are represented in this new feature space conformed by all the reduced subset computed for each class or profile. The classification phase was carried out using Support Vector Machines provided by the Weka platform. The results obtained were good for Gender, but needs more efforts for Location and Occupation, because, the main factor that affects the results correspond to scenarios with unbalanced class distribution that impact the construction of the reduced vocabulary.
The goal of Style Change Detection task in a document is to determine if it was written by more t... more The goal of Style Change Detection task in a document is to determine if it was written by more than one author and in such case, to delimit which paragraph (or more generally a portion of text) corresponds to each one of them. The objective of our proposal is to build a paragraph representation based on general Style Feature computed considering characters, lexical and syntactic features, without the use of semantic words. The paragraphs were grouped employing a non overlapped variant of the B0-maximal clustering algorithm, where the overlapping was eliminated considering the order of paragraphs in the document.
Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerston... more Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerstone for automatic forensic applications. Moreover, it is a challenging task for both humans and computers. Clustering documents according to the linguistic style of the authors who wrote them has been a task little studied by the research community. In order to address this problem, PAN Evaluation Framework has become the first effort to promote the development of the author clustering. This article proposes a graph-based method, specifically βcompact clustering, for discovering the groups of documents written by the same author. The β-compact algorithm is based on the analysis of the similarity between documents and they belong to the same group as long as the similarity between them exceeds the threshold β and it is the maximum similarity with respect to other documents. In our proposal we evaluated different linguistic features and similarity measures presented in previous works of auth...
This paper describes the proposal presented in the TAG-it author profiling task from EVALITA 2020... more This paper describes the proposal presented in the TAG-it author profiling task from EVALITA 2020 for sub-task 1. The main objective is to predict gender and age of some blog users by their posts, as well as topic they wrote about. Our proposal uses an ensemble of machine learning algorithms with three of the most used classifiers and language model of the n-grams of characters represented in a Bag of Word. To face this task we presented two different strategies aimed at finding the best possible results.
False or unverified information spreads just like accurate information on social media platforms,... more False or unverified information spreads just like accurate information on social media platforms, thus possibly going viral and influencing the public opinion and its decisions. Fake news represents one of the most popular forms of false and unverified information, and should be identified as soon as possible for minimizing their dramatic effects. In order to face this challenge, in this paper we describe our system developed for participating in the Author Profiling task: “Profiling Fake News Spreaders on Twitter” proposed at the PAN 2020 Forum. Our proposal learns two representations for each tweet in an account’s profile. The first one is based on CNN and LSTM nets analyzing the tweets at word level, the second one is learned by using the same architecture without sharing the weights, but at this time, the tweets are analyzed at character-level. These representations are used for modeling the accounts’ profiles. Also is conceived for the whole account’s profile a general represen...
Masking the writing style of an author has been useful and used by novelists for the purpose of p... more Masking the writing style of an author has been useful and used by novelists for the purpose of passing unnoticed, as well as by people who aim to give information without being linked to it. Within the PAN evaluation framework, it is presented the task of paraphrasing or changing the writing style of a document, maintaining the topic that is being discussed. We propose a method that performs transformations in sentences, with an unsupervised approach, i.e., without previous data of the author or linguistic characteristics of a document collection. We make syntactic and semantic changes using dictionaries and semantic resources, as well as syntactic rules for sentence simplification. In the evaluation section, we will expose the observed strengths and weaknesses of the proposal.
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020, 2020
English. This paper describes our system for participating in the TAG-it Author Profiling task at... more English. This paper describes our system for participating in the TAG-it Author Profiling task at EVALITA 2020. The task aims to predict age and gender of blogs users from their posts, as the topic they wrote about. Our proposal combines learned representations by RNN at word and sentence levels, Transformer Neural Nets and hand-crafted stylistic features. All these representations are mixed and fed into a fully connected layer from a feed-forward neural network in order to make predictions for addressed subtasks. Experimental results show that our model achieves encouraging performance.
Hoy en día, el porcentaje de la información disponible en Inglés en Word Wide Web está disminuyen... more Hoy en día, el porcentaje de la información disponible en Inglés en Word Wide Web está disminuyendo, debido a que otros lenguajes como: chino, español, árabe y portugués están ganando aceptación y difusión. Este fenómeno ha provocado que el multilingüismo se convierta en uno de los principales retos para el procesamiento inteligente, gestión y recuperación de documentos. Con el fin de hacer frente a este problema de forma eficaz, los sistemas computacionales necesitan el diseño de nuevos modelos o mejorar los modelos tradicionales de representación de documentos. La disponibilidad de repositorios multilingües de conceptos y redes semánticas, ha abierto un enfoque atractivo para modelar documentos escritos en diferentes lenguas, como los vectores de conceptos en un espacio común de representación. En este trabajo se presenta una nueva representación basada en conceptos usando Multilingual Central Repository. Nuestra propuesta aplica una desambiguación del sentido de la palabra de gr...
Resumen. El análisis de autoría se ha convertido en una herramienta determinante para el análisis... more Resumen. El análisis de autoría se ha convertido en una herramienta determinante para el análisis de documentos digitales en las ciencias forenses. Proponemos un método de Verificación de Autoría mediante el análisis de las semejanzas entre documentos de un autor por vecindad, sin estimar umbrales a partir de un entrenamiento, implementamos dos estrategias de representación de los documentos de un autor, una basada en instancias y otra en el cálculo del centroide. Evaluamos colecciones según el número de muestras, los géneros textuales y el tema abordado. Realizamos un análisis del aporte de cada función de comparación y de cada rasgo empleado así como una combinación por mayoría de los votos de cada par función-rasgo empleado en la semejanza entre documentos. Las pruebas se realizaron usando las colecciones públicas de las competencias PAN 2014 y 2015. Los resultados obtenidos son prometedores y nos permiten evaluar nuestra propuesta y la identificación del trabajo futuro a desarrollar. Palabras clave. Análisis de autoría, verificación de autoría, funciones de comparación, rasgos lingüísticos.
Authorship analysis is an important task for different text applications, for example in the fiel... more Authorship analysis is an important task for different text applications, for example in the field of digital forensic text analysis. Hence, we propose an authorship analysis method that compares the average similarity of a text of unknown authorship with all the texts of an author. Using this idea, a text that was not written by an author, would not exceed the average of similarity with known texts and a text of unknown authorship would be considered as written by the author, only if it exceeds the average of similarity obtained between texts written by him and if it got the major value comparing the average similarity with the rest of the authors. For each linguistic feature we obtain a vote by majority using different functions and for the final decision we divide the number of votes for each feature that consider as written by the author the unknown text by the total of features analyzed. The results obtained for each language in the PAN 2015 authorship verification competition are exposed in the overview of the task.
Authorship analysis is an important task for different text applications, for example in the fiel... more Authorship analysis is an important task for different text applications, for example in the field of digital forensic text analysis. Hence, we propose an authorship analysis method that compares the average similarity of a text of unknown authorship with all the text of an author. Using this idea, a text that was not written by an author, would not exceed the average of similarity with known texts and only the text of unknown authorship would be considered as written by the author, if it exceeds the average of similarity obtained between texts written by him. The experiments were realized using the data provided in PAN 2014 competition for Spanish articles for the task of authorship verification. We realize experiments using different similarity functions and 17 linguistics features. We analyze the results obtained with each pair function-features against the baseline of the competition. Additionally, we introduce a text filtering phase that delete all the sample text of an author ...
Uploads
Papers by Daniel Castro