Received: 11-08-2015
Accepted: 08-03-2016
DOI:
Views
Downloads
How to Cite:
Evaluation of Feature Selection Methods for Gene Expression Data Classifcation
Keywords
Classification, gene expression data, feature selection, Random forest, Regularized Random Forest, Guided Regularized Random Forests
Abstract
Selection of relevant genes that have effects in some diseases is a challenging task in gene expression studies. Most gene selection studies focused on assessing the association between individual gene and the disease. In fact, diseases are thought to involve a complex etiology including complicated interactions between many genes and the disease. Random Forest (RF) method has recently been successfully used for identifying genetic factors that have effects in some complex diseases. In spite of performing well in some data sets with moderate size, RF still suffers from working for selecting informative genes and building accurate prediction models. In this paper, we investigatedsome methods in learning advanced random forests that allow one to select a sub-set of informative genes (most relevant to disease). The method can therefore reduce the dimensionality and can perform well in prediction high-dimensional data sets. The performance of these methods has been analyzed for finding the robust one for each interest objective (the accuracy of the prediction model or the smallest possible set of relevant genes) based on experiments results on 8 available public data sets of gene expression from the repository of biomedical data sets (Kent Ridge) and bioinformatics data sets(Bioinformatics).
References
Bioinformatics Research Group, http://eps.upo.es/bigs/ datasets.html.
Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.
Breiman L. (2001). Random forests. Machine Learning, 45(1): 5-32.
Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P. (2005). Identifying snps predictive of phenotype using random forests. Genetic epidemiology, 28(2): 171-182.
Bø TH., Jonassen I. (2002). New feature subset selection procedures for classification of expression profiles. Genome Biology, 3(4): 0017.1-0017.11.
Deng H. and G. Runger (2013). Gene selection with guided regularized random forest. Journal of Pattern Recognition, 46: 3483-3489.
Deng H and G. Runger (2012). Feature selection via regularized trees. International Joint Conference on Neural Networks (IJCNN).
Deng H. (2013). Guided random forest in the RRF package, http://arxiv.org/abs/1306.0237.
Díaz-Uriarte R. (2005). Supervised methods with genomic data: a review and cautionary view. In Data analysis and visualization in genomics and proteomics. Edited by Azuaje F, Dopazo J. New York: Wiley, pp.193-214.
Dudoit S, Fridlyand J, Speed TP (2002). Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc., 97(457): 77-87.
Furlanello C, Serafini M, Merler S, Jurman G: An accelerated procedure for recursive feature ranking on microarray data. Neural Netw, 16: 641-648.
Goldstein B. A., Hubbard, A. E., Cutler, A., Barcellos, L. F. (2010). An application of Random Forests to a genome-wide association dataset: Methodological considerations and new findings.BMC Genetics, 11: 49.
Goldstein B. A., Polley, E. C. Briggs, Farren B. S. (2011). Random Forests for Genetic Association Studies. Statistical Applications in Genetics and Molecular Biology, 10(1): 32.
Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21: 1509-1515.
Kent Ridge Bio-medical Dataset, http://datam.i2r.a-star.edu.sg/datasets/krbd/
Jirapech-Umpai T, Aitken S (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics, 6: 148.
Lee JW, Lee JB, Park M, Song SH (2005). An extensive evaluation of recent classification tools applied to microarray data. Computation Statistics and Data Analysis, 48: 869-885.
Lunetta, K.L., Hayward, L.B., Segal, J., Van Eerdewegh, P. (2004). Screening large-scale association study data: exploiting interactions using random forests. BMC genetics, 5(1): 32.
Li Y, Campbell C, Tipping M (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18: 1332-1339.
Li T, Zhang C, Ogihara M (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20: 2429-2437.
Roepman P, Wessels LF, Kettelarij N, Kemmeren P, Miles AJ, Lijnzaad P, Tilanus MG, Koole R, Hordijk GJ, van der Vliet PC, Reinders MJ, Slootweg PJ, Holstege FC (2005). An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas. Nat Genet, 37: 182-186.
van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415: 530-536.
Yang Q. and X. Wu (2006). Challenging Problems in Data Mining Research. Journal of Information Technology and Decision Making 5(4): 597-604.
Yeung KY, Bumgarner RE, Raftery AE (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21: 2394-2402.
Wiener M. and A. Liaw (2002). "Classification and regression by randomforest," The Journal of R news, 2(3): 18-22.
Winham, S.J., Colby, C. L., Freimuth, R., Wang, X., Andrade, M., Huebner, M., Biernacka, J. M. (2012). SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinformatics, 13: 164.