Received: 22-10-2014
Accepted: 20-12-2014
DOI:
Views
Downloads
How to Cite:
A New Feature Sampling Method in Learning Random Forest for SNP Data Analysis
Keywords
Genome-wide Association Study, machine learning, data mining, random forest
Abstract
Recently, Genome-wide association studies (GWAS) have been successful in the identification of genetic variants that have effects in some complex diseases. Most GWA studies used single SNP (single-nucleotide polymorphism) approaches that mainly focused on assessing the association between each individual SNP and the disease. However, in fact, complex diseases are thought to involve complex etiologies including complicated interactions between many SNPs. Thus, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Random Forest (RF) method recently has been successfully used in GWAS for identifying genetic factors that have effects in some complex diseases. In spite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose a new two-stage sampling method in learning random forests. The proposed method allows to select a sub-set of informative SNPs which are most relevant to disease. Therefore, it reduces the dimensionality and can perform well with high-dimensional data sets. We conducted experiments on two genome-wide SNP data sets to demonstrate the effectiveness of the proposed method.
References
Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.
BreimanL. (2001). Random forests. Machine Learning, 45(1): 5-32.
Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P. (2005). Identifying snpspredictive of phenotype using random forests. Genetic epidemiology, 28(2): 171-182.
Cordell, H.J. (2009). Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics, 10(6): 392-404.
Easton, D. et al. (2007). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447(7148): 1087-1093.
Easton, D. F., Eeles, R. A. (2008). Genome-wide association studies in cancer. Hum MolGenet, 17: R109-R115.
Fung, H.C., Scholz, S., Matarin, M., Sim ´ on-S ´ anchez, J., Hernandez, D., Britton, A., Gibbs, J.R., Langefeld, C., Stiegert, M.L., Schymick, J., et al. (2006). Genome-wide genotyping in Parkinson’s disease and neurologically normal controls: first stage analysis and public release of data. The Lancet Neurology, 5(11): 911-916.
Goldstein, B. A., Hubbard, A. E., Cutler, A.,Barcellos, L. F. (2010). An application of Random Forests to a genome-wide association dataset: Methodological considerations and new findings. BMC Genetics, 11: 49.
Goldstein, B. A.; Polley, E. C., Briggs, FarrenB. S. (2011).RndomForests for Genetic Association Studies. Statistical Applications in Genetics and Molecular Biology, 10(1): 32
LettreG., RiouxJ. D. (2008). Autoimmune diseases: insights from genome-wide association studies. Hum MolGenet, 17: R116-R121.
Lunetta, K.L., Hayward, L.B., Segal, J., Van Eerdewegh, P. (2004). Screening large-scale association study data: exploiting interactions using random forests. BMC genetics, 5(1): 32
Marchini, J., Donnelly, P., Cardon, L.R. (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature genetics, 37(4): 413-417.
Mardis, E. R. (2011). A decade’s prespectiveon DNA sequencing technology. Nature, 470(7333): 198-203.
MohlkeK. L., BoehnkeM., AbecasisG. R. (2008). Metabolic and cardiovascular traits: an abundance of recently identified common genetic variants. Hum MolGenet, 17: R102-R108.
Moore, J. H. (2005). A global view of epistasis. Nature Genetic, 37(1): 13-14.
Schwarz, D.F., K”onig, I.R., Ziegler, A. (2010). On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics, 26(14): 1752.
Sladek, R. et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445(7130): 881-885.
Webster, J.A., Gibbs, J.R., Clarke, J., Ray, M., Zhang, W., Holmans, P., Rohrer, K., Zhao, A., Marlowe, L., Kaleem, M., et al. (2009).Genetic control of human brain transcript expression in Alzheimer disease. The American Journal of Human Genetics, 84(4): 445-458.
WellcomeTrust (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145): 661-678
Winham, S.J., Colby, C. L., Freimuth, R., Wang, X., Andrade, M., Huebner, M., Biernacka, J. M. (2012). SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinformatics, 13:164.
Wu, Q., Ye, Y., Liu, Y., Ng, M.K. (2012). SPN selection and classification of genome-wide snpdata using stratified sampling random forests. NanoBioscience, IEEE Transactions on, 11(3): 216-227.
Xu, B., Huang, J.Z., Williams, G., Wang, Q., Ye, Y. (2012). Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM), 8(2): 44-63.