PHƯƠNG PHÁP LẤY MẪU THUỘC TÍNH MỚI TRONG RỪNG NGẪU NHIÊN CHO PHÂN TÍCH DỮ LIỆU SNP

Nguyễn Văn Hoàng; Phan Thị Thu Hồng; Nguyễn Thanh Tùng; Nguyễn Thị Thủy

A New Feature Sampling Method in Learning Random Forest for SNP Data Analysis

Nguyen Van Hoang (*) ¹ , Phan Thi Thu Hong ¹ , Nguyen Thanh Tung ¹ , Nguyen Thi Thuy ¹

¹ Khoa Công nghệ thông tin, Học viện Nông nghiệp Việt Nam

Keywords

Genome-wide Association Study, machine learning, data mining, random forest

Abstract

Recently, Genome-wide association studies (GWAS) have been successful in the identification of genetic variants that have effects in some complex diseases. Most GWA studies used single SNP (single-nucleotide polymorphism) approaches that mainly focused on assessing the association between each individual SNP and the disease. However, in fact, complex diseases are thought to involve complex etiologies including complicated interactions between many SNPs. Thus, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Random Forest (RF) method recently has been successfully used in GWAS for identifying genetic factors that have effects in some complex diseases. In spite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose a new two-stage sampling method in learning random forests. The proposed method allows to select a sub-set of informative SNPs which are most relevant to disease. Therefore, it reduces the dimensionality and can perform well with high-dimensional data sets. We conducted experiments on two genome-wide SNP data sets to demonstrate the effectiveness of the proposed method.

References

Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.

BreimanL. (2001). Random forests. Machine Learning, 45(1): 5-32.

Bureau, A., Dupuis, J., Falls, K., Lunetta, K.L., Hayward, B., Keith, T.P., Van Eerdewegh, P. (2005). Identifying snpspredictive of phenotype using random forests. Genetic epidemiology, 28(2): 171-182.

Cordell, H.J. (2009). Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics, 10(6): 392-404.

Easton, D. et al. (2007). Genome-wide association study identiﬁes novel breast cancer susceptibility loci. Nature 447(7148): 1087-1093.

Easton, D. F., Eeles, R. A. (2008). Genome-wide association studies in cancer. Hum MolGenet, 17: R109-R115.

Fung, H.C., Scholz, S., Matarin, M., Sim ´ on-S ´ anchez, J., Hernandez, D., Britton, A., Gibbs, J.R., Langefeld, C., Stiegert, M.L., Schymick, J., et al. (2006). Genome-wide genotyping in Parkinson’s disease and neurologically normal controls: first stage analysis and public release of data. The Lancet Neurology, 5(11): 911-916.

Goldstein, B. A., Hubbard, A. E., Cutler, A.,Barcellos, L. F. (2010). An application of Random Forests to a genome-wide association dataset: Methodological considerations and new findings. BMC Genetics, 11: 49.

Goldstein, B. A.; Polley, E. C., Briggs, FarrenB. S. (2011).RndomForests for Genetic Association Studies. Statistical Applications in Genetics and Molecular Biology, 10(1): 32

LettreG., RiouxJ. D. (2008). Autoimmune diseases: insights from genome-wide association studies. Hum MolGenet, 17: R116-R121.

Lunetta, K.L., Hayward, L.B., Segal, J., Van Eerdewegh, P. (2004). Screening large-scale association study data: exploiting interactions using random forests. BMC genetics, 5(1): 32

Marchini, J., Donnelly, P., Cardon, L.R. (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature genetics, 37(4): 413-417.

Mardis, E. R. (2011). A decade’s prespectiveon DNA sequencing technology. Nature, 470(7333): 198-203.

MohlkeK. L., BoehnkeM., AbecasisG. R. (2008). Metabolic and cardiovascular traits: an abundance of recently identified common genetic variants. Hum MolGenet, 17: R102-R108.

Moore, J. H. (2005). A global view of epistasis. Nature Genetic, 37(1): 13-14.

Schwarz, D.F., K”onig, I.R., Ziegler, A. (2010). On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics, 26(14): 1752.

Sladek, R. et al. (2007). A genome-wide association study identiﬁes novel risk loci for type 2 diabetes. Nature, 445(7130): 881-885.

Webster, J.A., Gibbs, J.R., Clarke, J., Ray, M., Zhang, W., Holmans, P., Rohrer, K., Zhao, A., Marlowe, L., Kaleem, M., et al. (2009).Genetic control of human brain transcript expression in Alzheimer disease. The American Journal of Human Genetics, 84(4): 445-458.

WellcomeTrust (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145): 661-678

Winham, S.J., Colby, C. L., Freimuth, R., Wang, X., Andrade, M., Huebner, M., Biernacka, J. M. (2012). SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinformatics, 13:164.

Wu, Q., Ye, Y., Liu, Y., Ng, M.K. (2012). SPN selection and classification of genome-wide snpdata using stratified sampling random forests. NanoBioscience, IEEE Transactions on, 11(3): 216-227.

Xu, B., Huang, J.Z., Williams, G., Wang, Q., Ye, Y. (2012). Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM), 8(2): 44-63.

A New Feature Sampling Method in Learning Random Forest for SNP Data Analysis

Received: 22-10-2014

Accepted: 20-12-2014

DOI:

Views

Downloads

Issue: Vol. 13 No. 2 (2015)

Section:

How to Cite:

A New Feature Sampling Method in Learning Random Forest for SNP Data Analysis

Keywords

Abstract

References