Comparative Study of Microarray Based Disease Prediction - A Survey
2019, International Journal of Scientific Research in Computer Science, Engineering and Information Technology
https://doi.org/10.32628/CSEIT195435…
9 pages
1 file
Sign up for access to the world's latest research
Abstract
Recognition of genetic expression becomes an important issue for research while diagnosing genetic diseases. Microarrays are considered as the representation for identifying gene behaviors that may help in detection process. Hence, it is used in analyzing samples that may be normal or affected, also in diagnosing various gene-based diseases. Various clustering and classification techniques were used to face the challenges in handling microarray. High dimensional data is one of the major issues caused while handling microarray. Also because of this issue, possibilities of redundant, irrelevant and noisy data may occur. To solve this problem feature selection process which optimally extracts the features is introduced in clustering in classification techniques. This survey observes some various techniques of classification, clustering of genes and feature selection methods such as supervised, unsupervised and semi-supervised methods. To determine the suitable semi-supervised algorithm that combines and analyze for detecting new or difficult mutated disease. This survey shows that how semi-supervised approach evolves and outperforms the existing algorithms.
Related papers
A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality of the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. This paper provides a comparison between dimension reduction technique, namely Partial Least Squares (PLS)method and a hybrid feature selection scheme, and evaluates the relative performance of four different supervised classification procedures such as Radial Basis Function Network (RBFN), Multilayer Perceptron Network (MLP), Support Vector Machine using Polynomial kernel function(Polynomial-SVM) and Support Vector Machine using RBF kernel function (RBF-SVM) incorporating those methods. Experimental results show that the Partial Least-Squares(PLS) regression method is an appropriate feature selection method and a combined use of different classification and feature selection approaches makes it possible to construct high performance classification models for microarray data.
Integrated Intelligent Research, 2012
In the year 1999, when T. R Golub first presented an idea for classifying cancer at the molecular level, this boosted research in cancer diagnosis to a whole new level. The researchers began to analyze the disease at the genetic level with the help of microarray databases. Then there were many new algorithms designed by researchers to classify different types of cancer. The objective of this paper is to present a tool designed exclusively to predict and classify leukemia into its types. The leukemia dataset published by Golub is used for this purpose. The first step is to identify the most significant genes causing cancer from the training set. These selected genes then are used to build the classifier based on decision rules, and eventually to predict the type of leukamia. This classifier which is modeled based on decision rules is found to work with an accuracy of 94%. The algorithm is quite simple in terms of complexity. It is possible to use a minimum number of genes for classification purposes rather than using a large set of genes. The genes that are responsible for prognosis of cancer are mainly selected for designing the classifier.
A DNA microarray has the ability to record levels of huge number of genes in one experiment. Previous research has shown that this technology can be helpful in the classification of cancers and their treatments outcomes. Normally, cancer microarray data has a limited number of samples which have a tremendous amount of genes expression levels as features. To specify relevant genes participated in different kinds of cancer still represents a challenge. For the purpose of extracting useful genes information from the data of cancer microarray, gene selection algorithms were examined systematically in this study and an integrated framework of gene selection was proposed. Using feature ranking based on absolute value two sample ttest with pooled variance estimate evaluation criterion combined with sequential forward feature selection, we show that the performance of classification at least as better as published results can be obtained on the therapy outcomes regarding breast cancer patients. Also, we reveal that combined use of different feature selection and classification approaches makes it feasible to select strongly relevant genes with high confidence.
Feature selection has become elementary tool for processing high dimensional data. DNA microarray technology is used for the study of large number of genes simultaneously, which helps in determining the expression levels of the genes. Gene selection using high dimensional gene expression data is foremost and imperative for prediction and classification of disease. This gene expression data can be shown in the form of matrix and usually contains irrelevant, redundant and noisy data, so the study and analysis of data becomes very problematic. The prime purpose of feature selection approaches is to remove the curse of dimensionality, improve the performance and accuracy of classification and clustering algorithms by the elimination of these irrelevant features and reduction of noise. This paper explains the taxonomy of feature selection methods stating their respective pros and cons. It also presents a review on few feature selection approaches, mainly those that have been proposed over the past few years.
2014
Microarray data classification is one of the most emerging clinical applications in the medical community. The classification process takes the detection of relevant and irrelevant probes into account, which is fundamental for subsequent classification. In this thesis, an efficient technique is proposed for the precise classification of microarray genes from the microarray gene expression dataset. The proposed classification technique performs the classification process with the aid of three phases namely, feature extraction, dimensionality reduction, and gene classification. Initially, Principal Component Analysis (PCA) is applied for dimension reduction and significant features are extracted from high dimension microarray data. The original data is projected to the lower dimension for selecting the eigenvector for the co-variance matrix shown cumulative variance up to a label of 100%. After the implementation of PCA, the reduced feature matrix so obtained is divided into two sets ...
Computers in Biology and Medicine, 2011
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. In cancer classification, available training data sets are generally of a fairly small sample size compared to the number of genes involved. Along with training data limitations, this constitutes a challenge to certain classification methods. Feature (gene) selection can be used to successfully extract those genes that directly influence classification accuracy and to eliminate genes which have no influence on it. This significantly improves calculation performance and classification accuracy. In this paper, correlation-based feature selection (CFS) and the Taguchi-genetic algorithm (TGA) method were combined into a hybrid method, and the K-nearest neighbor (KNN) with the leaveone-out cross-validation (LOOCV) method served as a classifier for eleven classification profiles to calculate the classification accuracy. Experimental results show that the proposed method reduced redundant features effectively and achieved superior classification accuracy. The classification accuracy obtained by the proposed method was higher in ten out of the eleven gene expression data set test problems when compared to other classification methods from the literature.
2015
Microarray gene expression data has a high dimensionality, e.g. small number of samples with large number of genes. Using machine learning techniques for knowledge discovery in such data become a rich area for researchers. This large number of genes, not all has the useful information that can be used to perform a certain diagnostic test, so feature selections become very important in both research and application communities of data mining. This paper proves the importance of finding the most informative genes in the database by using statistical gene selection technique to achieve a reduction in time, cost and increase the efficiency of the classifier. We applied T-Test statistical feature selection technique and K-Nearest neighbor (KNN) classifier on two public microarray data sets, SRBCT and Leukemia datasets. The feature selection is done on the whole available datasets and the data reduction results are then divided into training and testing and supplemented to the KNN classif...
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016
Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection techniques are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. Compared to the number of genes involved, available training data sets generally have a fairly small sample size in cancer type classification. These training data limitations constitute a challenge to certain classification methodologies. A reliable selection method for genes relevant for sample classification is needed in order to speed up the processing rate, decrease the predictive error rate, and to avoid incomprehensibility due to the large number of genes investigated. In this study, we combined information gain and an improved binary particle swarm optimization as a hybrid method to implement feature selection, and the K-nearest neighbor (K-NN) method serves as an evaluator for gene expression data classification problems. Experimental results show that this method effectively simplifies feature selection and reduces the total number of featur...
A typical microarray experiment yields the expression level of a large number of genes for a small number of samples. Given a classification of the samples, the goal of feature selection is to identify a small subset of relevant genes, which are differentialy expressed for different sample classes. We present a new method for feature selection that combines a solution for the Min (α,β)-Feature Set Problem and a clustering algorithm, the Arithmetic-Harmonic Cut to robustly identify relevant features. We apply our method to the NCI60 cancer dataset and evaluate the effectiveness and performance of the new algorithm for the classification of cancer cell-lines.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (4)
- Mohammed Aledhari, Marianne Di Pierro, Mohamed Hefeida, Fahad Saeed. "A deep learning-based data minimization algorithm for fast and secure transfer of big genomic datasets". IEEE Transactions on Big Data . pp. 1-1.2018.
- Yvan Saeys, Inaki Inza, and Pedro Larranaga. "A review of feature selection techniques in bioinformatics". Bioinformatics,.Vol. 23.19, pp. 2507-2517, 2007.
- Sebastian Maldonado, and Richard Weber. "A wrapper method for feature selection using support vector machines". Information Sciences .vol.179.13, pp. 2208-2217. 2009.
- Pablo Bermejo, Jose A. Gamez, and Jose M. Puerta. "A GRASP algorithm for fast hybrid