Papers by José Salvador Sánchez Garreta
A machine vision system for on-line fruit color classification
Simple and fast recomputation of proximity graphs in prototype selection

Neural Computing and Applications, 2010
In many real applications, data are not all available at the same time, or it is not affordable t... more In many real applications, data are not all available at the same time, or it is not affordable to process them all in a batch process, but rather, instances arrive sequentially in a stream. The scenario of streaming data introduces new challenges to the machine learning community, since difficult decisions have to be made. The problem addressed in this paper is that of classifying incoming instances for which one attribute arrives only after a given delay. In this formulation, many open issues arise, such as how to classify the incomplete instance, whether to wait for the delayed attribute before performing any classification, or when and how to update a reference set. Three different strategies are proposed which address these issues differently. Orthogonally to these strategies, three classifiers of different characteristics are used. Keeping on-line learning strategies independent of the classifiers facilitates system design and contrasts with the common alternative of carefully crafting an ad hoc classifier. To assess how good learning is under these different strategies and classifiers, they are compared using learning curves and final classification errors for fifteen data sets. Results indicate that learning in this stringent context of streaming data and delayed attributes can successfully take place even with simple on-line strategies. Furthermore, active strategies behave generally better than more conservative passive ones. Regarding the classifiers, it was found that simple instance-based classifiers such as the well-known nearest neighbor may outperform more elaborate classifiers such as the support vector machines, espe
A linear program for nearest neighbour-like decision tree induction

Aprendizaje y clasificación basados en criterios de vecindad. métodos alternativos y análisis comparativo
El contenido de esta Tesis incide directamente sobre un conjunto de tecnicas de clasificacion y a... more El contenido de esta Tesis incide directamente sobre un conjunto de tecnicas de clasificacion y aprendizaje basadas en criterios de vecindad sobre espacios metricos, algunas de las cuales (por ejemplo, la regla de los k-Vecinos mas Proximos) han estado consideradas como una aproximacion cuasi-optima para un gran numero de problemas de estimacion, convirtiendose ademas en punto de referencia obligado para el desarrollo de otros muchos procedimientos, A lo largo de este trabajo, proponemos un conjunto de conceptos y metodos alternativos a las aproximaciones clasicas basadas en criterios de vecindad, con el objetivo de superar ciertas deficiencias derivadas basicamente de la perdida de efectividad a medida que la cantidad y la calidad de la informacion que utilizan disminuye, asi como de la complejidad temporal que su aplicacion puede suponer. Los esquemas introducidos en cada apartado son empiricamente comparados con las principales tecnicas convencionales, en aras de evaluar y valorar las ventajas e inconvenientes del comportamiento exhibido por cada uno de ellos. En general, los resultados permiten garantizar una cierta superioridad de los nuevos metodos sobre problemas reales, conservando ademas el comportamiento optimo en el caso asintotico.

Feature Dimensionality vs. Distribution of Sample Types: A Preliminary Study on Gene-Expression Microarrays
n gene-expression microarray data sets each sample is defined by hundreds or thousands of measure... more n gene-expression microarray data sets each sample is defined by hundreds or thousands of measurements. High- dimensionality data spaces have been reported as a significant obstacle to apply machine learning algorithms, owing to the associated phenomenon called ‘curse of dimensionality’. The analysis and interpretation of these data sets have been defined as a very challenging problem. The hypothesis proposed in this paper is that there may exist some correlation between dimensionality and the types of samples (safe, borderline, rare and outlier). To examine our hypothesis, we have carried out a series of experiments over four gene-expression microarray databases because these data correspond to a typical example of the so-called ‘curse of dimensionality’ phenomenon. The results show that there indeed exist meaningful relationships between dimensionality and the proportion of each type of samples, demonstrating that the amount of safe samples increases and the total number of border...

Addressing the Links Between Dimensionality and Data Characteristics in Gene-Expression Microarrays
In gene-expression microarray data sets each sample is defined by hundreds or thousands of measur... more In gene-expression microarray data sets each sample is defined by hundreds or thousands of measurements. High-dimensionality data spaces have been reported as a significant obstacle to apply machine learning algorithms, owing to the associated phenomenon called 'curse of dimensionality'. Therefore the analysis (and interpretation) of these data sets has become a challenging problem. The hypothesis set out in this paper is that the curse of dimensionality is directly linked to other intrinsic data characteristics, such as class overlapping and class separability. To examine our hypothesis, here we have carried out a series of experiments over four gene-expression microarray databases because these data correspond to a typical example of the so-called 'curse of dimensionality' phenomenon. The results show that there exist meaningful relationships between dimensionality and some specific complexities that are inherent to data (especially, class separability and geometry...

Deep transfer learning for the recognition of types of face masks as a core measure to prevent the transmission of COVID-19
Applied Soft Computing
The use of face masks in public places has emerged as one of the most effective non-pharmaceutica... more The use of face masks in public places has emerged as one of the most effective non-pharmaceutical measures to lower the spread of COVID-19 infection. This has led to the development of several detection systems for identifying people who do not wear a face mask. However, not all face masks or coverings are equally effective in preventing virus transmission or illness caused by viruses and therefore, it appears important for those systems to incorporate the ability to distinguish between the different types of face masks. This paper implements four pre-trained deep transfer learning models (NasNetMobile, MobileNetv2, ResNet101v2, and ResNet152v2) to classify images based on the type of face mask (KN95, N95, surgical and cloth) worn by people. Experimental results indicate that the deep residual networks (ResNet101v2 and ResNet152v2) provide the best performance with the highest accuracy and the lowest loss.
Pattern Recognition and Image Analysis
We here compare the performance (predictive accuracy and processing time) of different neural net... more We here compare the performance (predictive accuracy and processing time) of different neural network ensembles with that of nearest neighbor classifier ensembles. Concerning the connectionist models, the multilayer perceptron and the modular neural network are employed. Experiments on several real-problem data sets demonstrate a certain superiority of the nearest-neighborbased schemes, in terms of both accuracy and computing time. When comparing the neural network ensembles, one can observe a better behavior of the multilayer perceptron than that of the modular networks.
This paper analyzes a generalization of a new metric to evaluate the classification performance i... more This paper analyzes a generalization of a new metric to evaluate the classification performance in imbalanced domains, combining some estimate of the overall accuracy with a plain index about how dominant the class with the highest individual accuracy is. A theoretical analysis shows the merits of this metric when compared to other well-known measures.
The class imbalance problem has been reported as an important challenge in various fields such as... more The class imbalance problem has been reported as an important challenge in various fields such as Pattern Recognition, Data Mining and Machine Learning. A less explored research area is related to how to evaluate classifiers on imbalanced data sets. This work analyzes the behaviour of performance measures widely used on imbalanced problems, as well as other metrics recently proposed in the literature. We perform two theoretical analysis based on Pearson correlation and operations for a 2× 2 confusion matrix with the aim to show the strengths and weaknesses of those performance metrics in the presence of skewed distributions.
A realistic appearance-based representation of sideview gait sequences is here introduced. It is ... more A realistic appearance-based representation of sideview gait sequences is here introduced. It is based on a prior method where a set of appearance-based features of a gait sample is used for gender recognition. These features are computed from parameter values of ellipses that fit body parts enclosed by regions previously defined while ignoring well-known facts of the human body structure. This work presents an improved regionalization method supported by some adaptive heuristic rules to better adjust regions to body parts. As a result, more realistic ellipses and a more meaningful feature space are obtained. Gender recognition experiments conducted on the CASIA Gait Database show better classification results when using the new features.

Computación y Sistemas, 2019
Research carried out by the scientific community has shown that the performance of the classifier... more Research carried out by the scientific community has shown that the performance of the classifiers depends not only on the learning rule, if not also on the complexities inherent in the data sets. Some traditional classifiers have been commonly used in the context of classification problems (three Neural Networks, C4.5, SVM, among others). However, the associative approach has been further explored in the recovery context, than in the classification task, and its performance almost has not been analyzed when several complexities in the data are presented. The present investigation analyzes the performance of the associative approach (CHA, CHAT and original Alpha Beta) when three classification problems occur (class imbalance, overlapping and atypical patterns). The results show that the CHAT algorithm recognizes the minority class better than the rest of the classifiers in the context of class imbalance. However, the CHA model ignores the minority class in most cases. In addition, the CHAT algorithm requires well-defined decision boundaries when Wilson's method is applied, because of its performance increases. Also, it was noted that when a balance between the rates is emphasized, the performance of the three classifiers increase (RB, RFBR and CHAT). The original Alfa Beta model shows poor performance when pre-processing the data is done. The performance of the classifiers increases significantly when the SMOTE method is applied, which does not occur without a pre-processing or with a subsampling, in the context of the imbalance of the classes.

Information Sciences, 2015
Microarray gene expression data sets usually contain a large number of genes, but a small number ... more Microarray gene expression data sets usually contain a large number of genes, but a small number of samples. In this article, we present a two-stage classification model by combining feature selection with the dissimilaritybased representation paradigm. In the preprocessing stage, the ReliefF algorithm is used to generate a subset with a number of top-ranked genes; in the learning/classification stage, the samples represented by the previously selected genes are mapped into a dissimilarity space, which is then used to construct a classifier capable of separating the classes more easily than a feature-based model. The ultimate aim of this paper is not to find the best subset of genes, but to analyze the performance of the dissimilarity-based models by means of a comprehensive collection of experiments for the classification of microarray gene expression data. To this end, we compare the classification results of an artificial neural network, a support vector machine

Applied Soft Computing, 2016
This paper presents an alternative technique for financial distress prediction systems. The metho... more This paper presents an alternative technique for financial distress prediction systems. The method is based on a type of neural network, which is called hybrid associative memory with translation. While many different neural network architectures have successfully been used to predict credit risk and corporate failure, the power of associative memories for financial decision-making has not been explored in any depth as yet. The performance of the hybrid associative memory with translation is compared to four traditional neural networks, a support vector machine and a logistic regression model in terms of their prediction capabilities. The experimental results over nine real-life data sets show that the associative memory here proposed constitutes an appropriate solution for bankruptcy and credit risk prediction, performing significantly better than the rest of models under class imbalance and data overlapping conditions in terms of the true positive rate and the geometric mean of true positive and true negative rates.
When a multiple classifier system is employed, one of the most popular methods to accomplish the ... more When a multiple classifier system is employed, one of the most popular methods to accomplish the classifier fusion is the simple majority voting. However, when the performance of the ensemble members is not uniform, the efficiency of this type of voting generally results affected negatively. In the present paper, new functions for dynamic weighting in classifier fusion are introduced. Experimental results with several real-problem data sets from the UCI Machine Learning Database Repository demonstrate the advantages of these novel weighting strategies over the simple voting scheme.
The problem of imbalanced training sets in supervised pattern recognition methods is receiving gr... more The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.
Two extensions of the original Wilson's editing method are introduced in this paper. These new al... more Two extensions of the original Wilson's editing method are introduced in this paper. These new algorithms are based on estimating probabilities from the k-nearest neighbor patterns of an instance, in order to obtain more compact edited sets while maintaining the classification rate. Several experiments with synthetic and real data sets are carried out to illustrate the behavior of the algorithms proposed here and compare their performance with that of other traditional techniques.
In this work, we present a clustering algorithm to find clusters of different sizes, shapes and d... more In this work, we present a clustering algorithm to find clusters of different sizes, shapes and densities, to deal with overlapping cluster distributions and background noise. The algorithm is divided in two stages. In a first step, local density is estimated at each data point. In a second stage, a hierarchical approach is used by merging clusters according to the introduced cluster distance, based on heuristic measures about how modes overlap in a distribution. Experimental results on synthetic and real databases show the validity of the method.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012
A wide range of classification models have been explored for financial risk prediction, but concl... more A wide range of classification models have been explored for financial risk prediction, but conclusions on which technique behaves better may vary when different performance evaluation measures are employed. Accordingly, this paper proposes the use of multiple criteria decision making tools in order to give a ranking of algorithms. More specifically, the selection of the most appropriate credit risk prediction method is here modeled as a multi-criteria decision making problem that involves a number of performance measures (criteria) and classification techniques (alternatives). An empirical study is carried out to evaluate the performance of ten algorithms over six real-life credit risk data sets. The results reveal that the use of a unique performance measure may lead to unreliable conclusions, whereas this situation can be overcome by the application of multicriteria decision making techniques.
Uploads
Papers by José Salvador Sánchez Garreta