Received: 22-10-2014 / Accepted: 20-12-2014
Recently, Genome-wide association studies (GWAS) have been successful in the identification of genetic variants that have effects in some complex diseases. Most GWA studies used single SNP (single-nucleotide polymorphism) approaches that mainly focused on assessing the association between each individual SNP and the disease. However, in fact, complex diseases are thought to involve complex etiologies including complicated interactions between many SNPs. Thus, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Random Forest (RF) method recently has been successfully used in GWAS for identifying genetic factors that have effects in some complex diseases. In spite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose a new two-stage sampling method in learning random forests. The proposed method allows to select a sub-set of informative SNPs which are most relevant to disease. Therefore, it reduces the dimensionality and can perform well with high-dimensional data sets. We conducted experiments on two genome-wide SNP data sets to demonstrate the effectiveness of the proposed method.