Papers by Marcelo Mendoza

Twitter sentiment analysis or the task of automatically retrieving opinions from tweets has recei... more Twitter sentiment analysis or the task of automatically retrieving opinions from tweets has received an increasing interest from the web mining community. This is due to its importance in a wide range of fields such as business and politics. People express sentiments about specific topics or entities with different strengths and intensities, where these sentiments are strongly related to their personal feelings and emotions. A number of methods and lexical resources have been proposed to analyze sentiment from natural language texts, addressing different opinion dimensions. In this article , we propose an approach for boosting Twitter sentiment classification using different sentiment dimensions as meta-level features. We combine aspects such as opinion strength, emotion and polarity indicators, generated by existing sentiment analysis methods and resources. Our research shows that the combination of sentiment dimensions provides significant improvement in Twitter sentiment classification tasks such as polarity and subjectivity.

Topic models have been of growing interest in the last decade. In particular, the techniques base... more Topic models have been of growing interest in the last decade. In particular, the techniques based on probabilistic latent variable models provide a solid theoretical base and a flexible framework that allow for the modeling of various kinds of documentary collections. These models consist of introducing a set of latent variables that allow one to capture the relationships between terms, documents and other attributes of the documentary collection that are not evidently manifested but can be modeled as unobserved relationships. The flexibility of this modeling family allows one to incorporate relevant properties of the text such as polysemy, making groups in sets of terms that describe concepts, forming topics. The use of latent variables also allows one to make inferences about the presence of topics in each document. Topic models are fundamentally divided into two broad approaches: the techniques resulting from Probabilistic Latent Semantic Analysis (PLSA) [1], which introduce latent variables without assuming distribution priors, and the techniques based on Latent Dirichlet Allocation (LDA) [2], which assume distribution priors over topics and vocabulary by using a Dirichlet distribution. Both approaches have strengths and weaknesses. On the one hand, PLSA fits the model by using the Expectation-Maximization (EM) algorithm [3] which is a standard method for the inference of parameters in latent variable models but tends to over fit data limiting the generalization capability because it can only guarantee convergence to local optimums. On the other hand, LDA addresses this limitation by introducing Dirichlet distribution priors on the vocabulary and on topic distributions over documents, which corresponds to a Bayesian regularization over the input. This process allows for improvement in the generalization capability of the models, but it introduces computational difficulties in the parameter estimation method, addressed using Monte Carlo methods through Gibbs sampling. We propose to explore the use of regularization operators on the EM estimators of PLSA, which would allow for control of the compromise between generalization and overfitting that is inherent in the local optimization methods. We introduce the eliteness versus background concept to model the production of text from two components. The idea is that when producing text, the author selects words from the elite of the distributions of text associated with each topic. However, there are words that are part of the natural language background and do not correspond to specific terms from any topic but rather to common terms, transversal to the topics. To model this linguistic phenomenon, we propose to modify PLSA, introducing sparsification over the latent variables associated with the terms and smoothing over a single latent variable that is capable of modeling the background.

We assess media bias in cable news reporting compared to online news stories. We make use of larg... more We assess media bias in cable news reporting compared to online news stories. We make use of large-scale data resources to operationalize media bias on three levels: gatekeeping or news selection; coverage or differential attention to news; and the degree of subjectivity in news statements. We analyze the captions of about 140 cable channels in the U.S. and hundreds of online news stories for six months, an observation window that coincides with the 2012 Republican primaries. Our findings suggest that cable channels are more similar to each other than to online news sources, but that similarities vary across the three levels of bias. The comparison between online news and cable channels also suggests that some of the differences are not related to systematic bias but to the amount of diversity that the two media allow (cable being more restricted by space constraints than online media). Abstract We assess media bias in cable news reporting compared to online news stories. We

—Collaborative filtering (CF) is one of the most successful recommender techniques. It is based o... more —Collaborative filtering (CF) is one of the most successful recommender techniques. It is based on the idea that people often get the best recommendations from someone with similar tastes to themselves. Broadly, there are memory-based and model-based CF techniques. As a representative memory-based CF technique, neighborhood-based CF uses some measure to compute the similarity between users. In model-based CF, clustering algorithms group users according to their ratings and use the cluster as its neighborhood. A shortcoming of all these methods is the over exploitation of data locality. These methods discard the use of the global data structure affecting recall and diversity. To address this limitation, we propose to explore the use of a spectral clustering strategy to infer the user cluster structure. Then, to expand the search of relevant users/items we use the Bray-Curtis coefficient, a measure that is able to exploit the global cluster structure to infer user proximity. Compared to traditional similarity metrics our approach is more flexible because it can capture relationships considering the overall cluster structure, enriching recommendations. We perform an experimental comparison of the proposed method against traditional prediction algorithms using three widely known benchmark data sets. Our experimental results show that our proposal is feasible.

We propose a new method for spectra modeling, that uses Splatalogue (Spectral Lines catalog) as a... more We propose a new method for spectra modeling, that uses Splatalogue (Spectral Lines catalog) as a training data set to learn species and transitions in data cubes captured by observatories. Our model is based on Latent Dirichlet Allocation, a probabilistic generative model that is capable to capture the co occurrence of emission lines in different channels. We use Splatalogue to create a channel vocabulary, processing each species as a document in the topic model domain. The model comprises a collection of species/transitions in a comprehensive collection of channel-energy pairs. Then, we extend the model using Labeled Latent Dirichlet Allocation, exploring the capabilities of our approach to label lines in an unsupervised fashion. To the best of our knowledge, this is the first time that a probabilistic generative model is used to label spectra in Astronomy. The main advantage of our proposal is the ability to model sparse, high dimensional data using posterior inference to label new unseen data. Our Splatalogue-based mixed membership model comprises the human knowledge acquired by astronomers for decades labeling spectral lines. Experimental results show that our proposal is feasible.

Opinions in forums and social networks are released by millions of people due to the increasing n... more Opinions in forums and social networks are released by millions of people due to the increasing number of users that use Web 2.0 platforms to opine about brands and organizations. For enterprises or government agencies it is almost impossible to track what people say producing a gap between user needs/expectations and organizations actions. To bridge this gap we create Viscovery, a platform for opinion summarization and trend tracking that is able to analyze a stream of opinions recovered from forums. To do this we use dynamic topic models, allowing to uncover the hidden structure of topics behind opinions, characterizing vocabulary dynamics. We extend dynamic topic models for incremental learning, a key aspect needed in Viscovery for model updating in near-real time. In addition, we include in Vis-covery sentiment analysis, allowing to separate positive/negative words for a specific topic at different levels of granularity. Viscovery allows to visualize representative opinions and terms in each topic. At a coarse level of granularity, the dynamic of the topics can be analyzed using a 2D topic embedding , suggesting longitudinal topic merging or segmentation. In this paper we report our experience developing this platform, sharing lessons learned and opportunities that arise from the use of sentiment analysis and topic modeling in real world applications.

Text classification is a challenge in document labeling tasks such as spam filtering and sentimen... more Text classification is a challenge in document labeling tasks such as spam filtering and sentiment analysis. Due to the descriptive richness of generative approaches such as probabilistic Latent Semantic Analysis (pLSA), documents are often modeled using these kind of strategies. Recently, a supervised extension of pLSA (spLSA [10]) has been proposed for human action recognition in the context of computer vision. In this paper we propose to extend spLSA to be used in text classification. We do this by introducing two extensions in spLSA: a) Regularized spLSA, and b) Label uncertainty in spLSA. We evaluate the proposal in spam filtering and sentiment analysis classification tasks. Experimental results show that spLSA outperforms pLSA in both tasks. In addition, our extensions favor fast convergence suggesting that the use of spLSA may reduce training time while achieving the same accuracy as more expensive methods such as sLDA or SVM.

Usually time series are controlled by generative processes which display changes over time. On ma... more Usually time series are controlled by generative processes which display changes over time. On many occasions, two or more ge-nerative processes may switch forcing the abrupt replacement of a fitted time series model by another one. We claim that the incorporation of past data can be useful in the presence of concept shift. We believe that history tends to repeat itself and from time to time, it is desirable to discard recent data reusing old past data to perform model fitting and forecasting. We address this challenge by introducing an ensemble method that deals with long-memory time series. Our method starts by segmenting historical time series data to identify data segments which present model consistency. Then, we project the time series by using data segments which are close to current data. By using a dynamic time warping alignment function , we try to anticipate concept shifts, looking for similarities between current data and the prequel of a past shift. We evaluate our proposal on non-stationary and non-linear time series. To achieve this we perform forecasting accuracy testing against well known state-of-the-art methods such as neural networks and threshold auto regressive models. Our results show that the proposed method anticipates many concept shifts.

Link prediction is the problem of inferring whether potential edges between pairs of vertices in ... more Link prediction is the problem of inferring whether potential edges between pairs of vertices in a graph will be present or absent in the near future. To perform this task it is usual to use information provided by a number of available and observed vertices/edges. Then, a number of edge scoring methods based on this information can be created. Usually, these methods assess local structures of the observed graph, assuming that closer vertices in the original period of observation will be more likely to form a link in the future. In this paper we explore the combination of local and global features to conduct link prediction in online social networks. The contributions of the paper are twofold: a) We evaluate a number of strategies that combines global and local features tackling the locality assumption of link prediction scoring methods, and b) We only use network topology-based features, avoiding the inclusion of informational or transactional based features that involve heavy computational costs in the methods. We evaluate our proposal using real-world data provided by Skout Inc., an affinity online social network with millions of users around the world. Our results show that our proposal is feasible.
Design of a system for image registration and compensation based on spectral analysis
Proceedings 20th International Conference of the Chilean Computer Science Society, 2000
Estimation of the values of the parameters of aflne trans-formations are crucial in dealing with ... more Estimation of the values of the parameters of aflne trans-formations are crucial in dealing with the compensation problem in image acquisition and image registration. In this paper we investigate these problems using spectral, Fourier and complex analysis. We furthermore introduce ...
IFIP International Federation for Information Processing, 2006
We present a method to help a user redefine a query suggesting a list of similar queries. The met... more We present a method to help a user redefine a query suggesting a list of similar queries. The method proposed is based on clickthrough data were sets of similar queries could be identified. Scientific literature shows that similar queries are useful for the identification of different information needs behind a query. Unlike most previous work, in this paper we are focused on the discovery of better queries rather than related queries. We will show with experiments over real data that the identification of better queries is useful for query disambiguation and query specialization.
Algoritmo híbrido de extracción de reglas usando algoritmos genéticos y árboles de clasificación
En este capítulo cubrimos los principales avances y desafíos del área de minería de datos en la W... more En este capítulo cubrimos los principales avances y desafíos del área de minería de datos en la Web. Esta área puede definirse como el uso de técnicas de minería de datos para la extracción de información útil desde la World Wide Web. Como repositorio de información, la Web ofrece múltiples desafíos a quienes buscan extraer información útil, entre ellos la variedad de formatos y estilos con los cuales son escritos los documentos que en ella residen, la dispar calidad de estos documentos, el amplio espectro de temas que estos cubren así como la volatilidad de los contenidos y recursos disponibles. Este escenario hace de la Web una fuente de información que en sí misma ha requerido del desarrollo de algoritmos y métodos específicos que permitan extraer información útil de ella.
Resumen-Twitter se ha constituido en el resonar de la ciudadanía que vierte en esta red social su... more Resumen-Twitter se ha constituido en el resonar de la ciudadanía que vierte en esta red social sus pensamientos, críticas, opiniones o burlas, y con mayor fuerza cuando se trata de política. Es conveniente por tanto, realizar una descripción detallada del ecosistema que circunda a Twitter y las elecciones políticas. En este trabajo revisamos los diferentes aspectos que confluyen en el devenir político que sucede en Twitter, cuando de elecciones políticas se trata. También proponemos dos análisis distintos a los existentes en la literatura, utilizando data real extraída a través del API del Streaming de Twitter en tres diferentes tipos de elecciones, lo que nos permite encontrar resultados, que bien se pueden extrapolar a otros contextos políticos.
The hard-won road to credibility
Análisis de la incidencia de los estudios de magíster en la movilidad laboral y social

In this paper we deal with the problem of automatic detection of query intent in search engines. ... more In this paper we deal with the problem of automatic detection of query intent in search engines. We studied features that have shown good performance in the state-of-theart, combined with novel features extracted from click-through data. We show that the combination of these features gives good precision results. In a second stage, four textbased classifiers were studied to test the usefulness of text-based features. With a low rate of false positives (less than 10 %) the proposed classifiers can detect query intent in over 90% of the evaluation instances. However due to a notorious unbalance in the classes, the proposed classifiers show poor results to detect transactional intents. We address this problem by including a cost sensitive learning strategy, allowing to solve the skewed data distribution. Finally, we explore the use of classifier ensembles which allow to us to achieve the best performance for the task.

The emergence of support learning platforms poses new challenges for Learning Objects (LOs for sh... more The emergence of support learning platforms poses new challenges for Learning Objects (LOs for short) community. One of them, perhaps most important, is to provide Learning Objects Metadata management platforms to simplify description and classification of LOs. Given the large number of LOs repositories, their varying quality and the gaps in information completeness and reliability describing them, it is necessary to design and implement a platform allowing LOs Metadata synthesis and improvement using classifications algorithms. In this paper the design and implementation of a platform for LOs support is presented. One of the attributes of the platform is it makes the categorization of LOs easier. The platform makes it possible to manage, describe and categorize digital objects from various sources with imperfect information. The design is flexible enough to allow automatically incorporate classifications algorithms.
Computer grids are systems containing heterogeneous, autonomous and geographically distributed no... more Computer grids are systems containing heterogeneous, autonomous and geographically distributed nodes. The proper functioning of a grid depends mainly on the efficient management of grid resources to carry out the various jobs that users send to the grid. This paper proposes an algorithm that uses intelligent agents in each node to perform global scheduling in a collaborative and coordinated way. The algorithm was implemented in a grid simulation environment that allows the incorporation of intelligent agents. This simulation environment was designed and developed to run and analyze the behavior of the proposed algorithm, which outperforms the numerical performance of two well-known algorithms in terms of balancing the load and making use of the grid's capacity without giving preference to any node.
We perform an automatic analysis of television news programs, based on the closed captions that a... more We perform an automatic analysis of television news programs, based on the closed captions that accompany them. Specifically, we collect all the news broadcasted in over 140 television channels in the US during a period of six months. We start by segmenting, processing, and annotating the closed captions automatically. Next, we focus on the analysis of their linguistic style and on mentions of people using NLP methods. We present a series of key insights about news providers, people in the news, and we discuss the biases that can be uncovered by automatic means. These insights are contrasted by looking at the data from multiple points of view, including qualitative assessment.
Uploads
Papers by Marcelo Mendoza