Semi-supervised learning (SSL) stands out for using a small amount of labeled points for data clu... more Semi-supervised learning (SSL) stands out for using a small amount of labeled points for data clustering and classification. In this scenario graph-based methods allow the analysis of local and global characteristics of the available data by identifying classes or groups regardless data distribution and representing submanifold in Euclidean space. Most of methods used in literature for SSL classification do not worry about graph construction. However, regular graphs can obtain better classification accuracy compared to traditional methods such as k-nearest neighbor (kNN), since kNN benefits the generation of hubs and it is not appropriate for high-dimensionality data. Nevertheless, methods commonly used for generating regular graphs have high computational cost. We tackle this problem introducing an alternative method for generation of regular graphs with better runtime performance compared to methods usually find in the area. Our technique is based on the preferential selection of vertices according some topological measures, like closeness, generating at the end of the process a regular graph. Experiments using the global and local consistency method for label propagation show that our method provides better or equal classification rate in comparison with kNN.
In this article we show how the accuracy of a rule based first order theory may be increased by c... more In this article we show how the accuracy of a rule based first order theory may be increased by combining it with a case-based approach in a classification task. Case-based learning is used when the rule language bias is exhausted. This is achieved in an iterative approach. In each iteration theories consisting of first order rules are induced and covered examples are removed. The process stops when it is no longer possible to find rules with satisfactory quality. The remaining examples are then handled as cases. The case-based approach proposed here is also, to a large extent, new. Instead of only storing the cases as provided, it has a learning phase where, for each case, it constructs and stores a set of explanations with support and confidence above given thresholds. These explanations have different levels of generality and the maximally specific one corresponds to the case itself. The same case may have different explanations representing different perspectives of the case. Therefore, to classify a new case, it looks for relevant stored explanations applicable to the new case. The different possible views of the case given by the explanations correspond to considering different sets of conditions/features to analyze the case. In other words, they lead to different ways to compute similarity between known cases/explanations and the new case to be classified (as opposed to the commonly used global metric). Experimental results have been obtained on a corpus of Portuguese texts for the task of part-of-speech tagging with significant improvement.
Assigning a category to a given word (tagging) depends on the particular word and on the categori... more Assigning a category to a given word (tagging) depends on the particular word and on the categories (tags) of neighboring words. A theory that is able to assign tags to a given text can naturally be viewed as a recursive logic program. This article describes how iterative induction, a technique that has been proven powerful in the synthesis of recursive logic programs, has been applied to the task of part-of-speech tagging. The main strategy consists of inducing a succession T1, T2, ..., Tn of theories, using in the induction of theory Ti all the previously induced theories. Each theory in the sequence may have lexical rules, context rules and hybrid ones. This iterative strategy is, to a large extent, independent of the inductive algorithm underneath. Here we consider one particular relational learning algorithm, CSC(RC), and we induce first order theories from positive examples and background knowledge that are able to successfully tag a relatively large corpus in Portuguese.
In this article we discuss in detail two techniques for rule and case integration. Case-based lea... more In this article we discuss in detail two techniques for rule and case integration. Case-based learning is used when the rule language is exhausted. Initially, all the examples are used to induce a set of rules with satisfactory quality. The examples that are not covered by these rules are then handled as cases. The case-based approach used also combines rules and cases internally. Instead of only storing the cases as provided, it has a learning phase where, for each case, it constructs and stores a set of explanations with support and confidence above given thresholds. These explanations have different levels of generality and the maximally specific one corresponds to the case itself. The same case may have different explanations representing different perspectives of the case. Therefore, to classify a new case, it looks for relevant stored explanations applicable to the new case. The different possible views of the case given by the explanations correspond to considering different sets of conditions/features to analyze the case. In other words, they lead to different ways to compute similarity between known cases/explanations and the new case to be classified (as opposed to the commonly used fixed metric).
Multidimensional projections are valuable tools to generate visualizations that support explorato... more Multidimensional projections are valuable tools to generate visualizations that support exploratory analysis
of a wide variety of complex high-dimensional data. However, projection mappings obtained from different
techniques vary considerably, and users exploring the mappings or selecting between projection techniques
still have limited assistance in their task. Current methods to assess projection quality fail to capture
properties that are paramount to user interpretation, such as the capability of conveying class information,
or the preservation of groups and neighborhoods from the original space. In this paper we propose a
unifying framework to derive objective measures of the
local behavior of projection mappings that support
interpreting the mappings and comparing solutions regarding several properties. A quality value is
computed for each data point, from which a single global value may be also assigned to the projection.
Measures are computed from a recently introduced data graph model known as Extended Minimum Spanning Tree
(EMST). Measurements of the topology of EMST
graphs, built relative to the original and projected data representations, are scale independent and afford evaluation of multiple properties. We introduce measures of visual properties and of preservation of properties from the original space. They are targeted at (i) depicting class segregation capability; (ii) quantifying neighborhood purity regarding classes; (iii) evaluating neighborhood preservation; and
finally (iv) evaluating group preservation. We introduce the
measures and illustrate how they can inform users about the local and global behavior of projection
techniques considering multiple mappings of artificial and real data sets.
Non-stationary classification problems concern the changes on data distribution over a classifier... more Non-stationary classification problems concern the changes on data distribution over a classifier lifetime. To face this problem, learning algorithms must conciliate essential, but difficult to gather, attributes like good classification performance, stability and low associated costs, like processing time and memory. This paper presents an extension of the Kassociated optimal graph learning algorithm to cope with classification over non-stationary domains. The algorithm relies on a graph structure consisting of many disconnected components (subgraphs). Such graph enhances data representation by fitting locally groups of data according to a purity measure, which, in turn, quantifies the overlapping between vertices of different classes. As a result, the graph can be used to accurately estimate the probability of unlabeled data to belong to a given class. The proposed algorithm is benefited from the dynamical evolution of the graph by updating its set of components when new data is presented along time, by removing old components as new components arise. Experimental results on artificial and real domains and further statistical analysis show that the proposed algorithm is an effective solution to non-stationary classification problems.
Graph is a powerful representation formalism that has been widely employed in machine learning an... more Graph is a powerful representation formalism that has been widely employed in machine learning and data mining. In this paper, we present a graph-based classification method, consisting of the construction of a special graph referred to as K-associated graph, which is capable of representing similarity relationships among data cases and proportion of classes overlapping. The main properties of the K-associated graphs as well as the classification algorithm are described. Experimental evaluation indicates that the proposed technique captures topological structure of the training data and leads to good results on classification task particularly for noisy data. In comparison to other well-known classification techniques, the proposed approach shows the following interesting features: (1) A new measure, called purity, is introduced not only to characterize the degree of overlap among classes in the input data set, but also to construct the K-associated optimal graph for classification; (2) nonlinear classification with automatic local adaptation according to the input data. Contrasting to K-nearest neighbor classifier, which uses a fixed K, the proposed algorithm is able to automatically consider different values of K, in order to best fit the corresponding overlap of classes in different data subspaces, revealing both the local and global structure of input data. (3) The proposed classification algorithm is nonparametric, implicating high efficiency and no need for model selection in practical applications.
In many situations, individuals or groups of individuals are faced with the need to examine sets ... more In many situations, individuals or groups of individuals are faced with the need to examine sets of documents to achieve understanding of their structure and to locate relevant information. In that context, this paper presents a framework for visual text mining to support exploration of both general structure and relevant topics within a textual document collection. Our approach starts by building a visualization from the text data set. On top of that, a novel technique is presented that generates and filters association rules to detect and display topics from a group of documents. Results have shown a very consistent match between topics extracted using this approach to those actually present in the data set. r
Semi-Supervised Learning (SSL) techniques have become very relevant since they require a small se... more Semi-Supervised Learning (SSL) techniques have become very relevant since they require a small set of labeled data. In this context, graph-based algorithms have gained promi- nence in the area due to their capacity to exploiting, besides information about data points, the relationships among them. Moreover, data represented in graphs allow the use of collective inference (vertices can affect each other), propagation of labels (autocorrelation among neighbors) and use of neighborhood characteristics of a vertex. An important step in graph-based SSL methods is the conversion of tabular data into a weighted graph. The graph construction has a key role in the quality of the classification in graph-based methods. This paper explores a method for graph construction that uses available labeled data. We provide extensive experiments showing the proposed method has many advantages: good classification accuracy, quadratic time complexity, no sensitivity to the parameter k>10, sparse graphformationwithaveragedegreearound 2 andhubformation from the labeled points, which facilitates the propagation of labels.
Extraction of protein-protein interactions from scientific papers is a relevant task in the biome... more Extraction of protein-protein interactions from scientific papers is a relevant task in the biomedical field. Machine learning-based methods such as kernel-based represent the state-of-the-art in this task. Many efforts have focused on obtaining new types of kernels in order to employ syntactic information, such as parse trees, to extract interactions from sentences. These methods have reached the best performances on this task. Nevertheless, parse trees were not exploited by other machine learning-based methods such as Bayesian networks. The advantage of using Bayesian networks is that we can exploit the structure of the parse trees to learn the Bayesian network structure, i.e., the parse trees provide the random variables and also possible relations among them. Here we use syntactic relation as a causal dependence between variables. Hence, our proposed method learns a Bayesian network from parse trees. The evaluation was carried out over five protein-protein interaction benchmark corpora. Results show that our method is competitive in comparison with state-of-the-art methods.
In this paper, we present some preliminary results indicating that Complex Network properties may... more In this paper, we present some preliminary results indicating that Complex Network properties may be useful to improve performance of Active Learning algorithms. In fact, centrality measures derived from networks generated from the data allow ranking the instances to find out the best ones to be presented to a human expert for manual classification. We discuss how to rank the instances based on the network vertex properties of closeness and betweenness. Such measures, used in isolation or combined, enable identifying regions in the data space that characterize prototypical or critical examples in terms of the classification task. Results obtained on different data sets indicate that, as compared to random selection of training instances, the approach reduces error rate and variance, as well as the number of instances required to reach representatives of all classes.
In Information Visualization, adding and removing data elements can strongly impact the underlyin... more In Information Visualization, adding and removing data elements can strongly impact the underlying visual space. We have developed an inherently incremental technique (incBoard) that maintains a coherent disposition of elements from a dynamic multidimensional data set on a 2D grid as the set changes. Here, we introduce a novel layout that uses pairwise similarity from grid neighbors, as defined in incBoard, to reposition elements on the visual space, free from constraints imposed by the grid. The board continues to be updated and can be displayed alongside the new space. As similar items are placed together, while dissimilar neighbors are moved apart, it supports users in the identification of clusters and subsets of related elements. Densely populated areas identified in the incSpace can be efficiently explored with the corresponding incBoard visualization, which is not susceptible to occlusion. The solution remains inherently incremental and maintains a coherent disposition of elements, even for fully renewed sets. The algorithm considers relative positions for the initial placement of elements, and raw dissimilarity to fine tune the visualization. It has low computational cost, with complexity depending only on the size of the currently viewed subset, V. Thus, a data set of size N can be sequentially displayed in O(N) time, reaching O(N 2 ) only if the complete set is simultaneously displayed.
In this paper, we propose a new graph-based classifier which uses a special network, referred to ... more In this paper, we propose a new graph-based classifier which uses a special network, referred to as optimal K-associated network, for modeling data. The K -associated network is capable of representing (dis)similarity relationships among data samples and data classes. Here, we describe the main properties of the K -associated network as well as the classification algorithm based on it. Experimental evaluation indicates that the model based on an optimal K -associated network captures topological structure of the training data leading to good results on the classification task particularly for noisy data.
In this paper, we present an extension to the optimal Kassociated Network classifier to perform o... more In this paper, we present an extension to the optimal Kassociated Network classifier to perform online classification. The static classifier uses a special network, stated as optimal network, to classify a test pattern. This network is constructed through a iterative process which is based in the K-associated network and in a measure called purity. The good results with the static classifier obtained in stationary data sets has motivated the development of an incremental version. Knowing the network capability of representing similarity relationships among pattern and data classes, here we present an extension that implements incremental learning to handle online classification for non-stationary data sets. Results in non-stationary data comparing the proposed method and two state-of-the-art ensemble classification methods are provided.
Currently, online social networks and social media have become increasingly popular showing an e... more Currently, online social networks and social media have become increasingly popular showing an exponential growth. This fact have attracted increasing research interest and, in turn, facilitating the emergence of new interdisciplinary research directions, such as social network analysis. In this scenario, link prediction is one of the most important tasks since it deals with the problem of the existence of a future relation among members in a social network. Previous techniques for link prediction were based on structural (or topological) information. Nevertheless, structural information is not enough to achieve a good performance in the link prediction task on large-scale social networks. Thus, the use of additional information, such as interests or behaviors that nodes have into their communities, may improve the link prediction performance. In this paper, we analyze the viability of using a set of simple and non-expensive techniques that combine structural with community information for pre- dicting the existence of future links in a large-scale online social network, such as Twitter. Twitter, a microblogging service, has emerged as a useful source of informative data shared by millions of users whose relationships require no reciprocation. Twitter network was chosen because it is not well understood, mainly due to the occurrence of directed and asymmetric links yet. Experiments show that our proposals can be used efficiently to improve unsupervised
Cluster in graphs is densely connected group of vertices sparsely connected to other groups. Hen... more Cluster in graphs is densely connected group of vertices sparsely connected to other groups. Hence, for prediction of a future link between a pair of vertices, these vertices common neighbors may play dif- ferent roles depending on if they belong or not to the same cluster. Based on that, we propose a new measure (WIC) for link prediction between a pair of vertices considering the sets of their intra-cluster or within-cluster (W) and between-cluster or inter-cluster (IC) common neighbors. Also, we propose a set of measures, referred to as W forms, using only the set given by the within-cluster common neighbors instead of using the set of all common neighbors as usually considered in the basic local similarity measures. Consequently, a previous clustering scheme must be applied on the graph. Using three different clustering algorithms, we compared WIC measure with ten basic local similarity measures and their counter- part W forms on ten real networks. Our analyses suggest that clustering information, no matter the clustering algorithm used, improves link pre- diction accuracy.
In image processing, edge detection is a valuable tool to perform the extraction of features fro... more In image processing, edge detection is a valuable tool to perform the extraction of features from an image. This detection reduces the amount of information to be processed, since the redundant information (considered less relevant) can be disconsidered. The technique of edge detection consists of determining the points of a digital image whose intensity changes sharply. This changes are, for example, due to the discontinuities of the orientation on a surface. A well known method of edge detection is the Difference of Gaussians (DoG). The method consists of subtracting two Gaussians, where a kernel has a standard deviation smaller than the previous one. The convolution between the subtraction of kernels and the input image results in the edge detection of this image. This paper introduces a method of extracting edges using DoG with kernels based on the q-Gaussian probability distribution, derived from the q- statistic proposed by Constantino Tsallis. To demonstrate the method’s potential, we compare the introduced method with the tradicional DoG using Gaussians kernels. The results showed that the proposed method can extract edges with more accurate details.
Semi-supervised learning is a machine learning paradigm in which the induced hypothesis is improv... more Semi-supervised learning is a machine learning paradigm in which the induced hypothesis is improved by taking advantage of unlabeled data. It is particularly useful when labeled data is scarce. Cotraining is a widely adopted semi-supervised approach that assumes availability of two views of the training data a restrictive assumption for most real world tasks. In this paper, we propose a one-view Cotraining approach that combines two different k-Nearest Neighbors (KNN) strategies referred to as global and local k-NN. In global KNN, the nearest neighbors selected to classify a new instance are given by the training examples which include this instance as one of their own k-nearest neighbors. In local KNN, on the other hand, the neighborhood considered when classifying a new instance is computed with the traditional KNN approach. We carried out experiments showing that a combination of these strategies significantly improves the classification accuracy in Cotraining, particularly when one single view of training data is available. We also introduce an optimized algorithm to cope with time complexity of computing the global KNN, which enables tackling real classification problems. ECAI 2010 H. Coelho et al. (Eds.) IOS Press, 2010
This paper presents a fast technique for map generation of document collections that, besides bei... more This paper presents a fast technique for map generation of document collections that, besides being able to group (and separate) documents by their contents, runs at very manageable computational costs, generating maps of pre-processed text in a matter of seconds. Based on multi-dimensional projection techniques and an algorithm for projection improvement, it results in a surface map that allows the user to identify a number of important relationships between documents and groups of documents that are reflected as visual attributes such as height, color, isolines as well as aural attributes (such as pitch). The map is interactive, allowing further exploration and narrowing of focus on a search task. The technique, named IDMAP (Interactive Document Map), is fully described in this paper. The results are bound to support a large number of applications that rely on retrieval and examination of document collections.
Uploads
Papers by A lopes
of a wide variety of complex high-dimensional data. However, projection mappings obtained from different
techniques vary considerably, and users exploring the mappings or selecting between projection techniques
still have limited assistance in their task. Current methods to assess projection quality fail to capture
properties that are paramount to user interpretation, such as the capability of conveying class information,
or the preservation of groups and neighborhoods from the original space. In this paper we propose a
unifying framework to derive objective measures of the
local behavior of projection mappings that support
interpreting the mappings and comparing solutions regarding several properties. A quality value is
computed for each data point, from which a single global value may be also assigned to the projection.
Measures are computed from a recently introduced data graph model known as Extended Minimum Spanning Tree
(EMST). Measurements of the topology of EMST
graphs, built relative to the original and projected data representations, are scale independent and afford evaluation of multiple properties. We introduce measures of visual properties and of preservation of properties from the original space. They are targeted at (i) depicting class segregation capability; (ii) quantifying neighborhood purity regarding classes; (iii) evaluating neighborhood preservation; and
finally (iv) evaluating group preservation. We introduce the
measures and illustrate how they can inform users about the local and global behavior of projection
techniques considering multiple mappings of artificial and real data sets.