Received: 15-02-2017
Accepted: 08-01-2018
DOI:
Views
Downloads
How to Cite:
Evaluation of Data Classification Methods Base on Random Forest, SVM before and after Using Feature Selection Method Guided Random Forest for High Dimensional Gene Data
Keywords
Classification, high dimensional data, machine learning, random forest, feature selection
Abstract
Data classification is a common method used for mining the potentialknowledge from large databases. Several methods have been proposed such as AdaBoost, Support Vector Machine (sVM), Neural Network, random forest (RF), C45… Among all these classifiers, Random forest and SVM provided more accurate and efficient result for high-dimensional data. However, the performance of these algorithms may be affected/degraded when working on gene data with thousands of features. This is because gene data usually contain many redundant features which are uninformative to classification. Therefore, using the subsets of selected genes may give a better performance than using all the features. In this study, we summarizedthe data classification algorithms based on the random forest models, SVM, and evaluate the effectiveness of the algorithms for classifying high-dimensional data. Next, we evaluatedthe combination the features selection using GRF and other classifiers such as RF, WSRF, RUF, SVM. Sevengene data sets were used to evaluate the methods. Experimental results showedthat the combination not only increasedthe accuracy but also decreasedthe execution time of the algorithms
References
Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Qiang Wang and Yunming Ye (2012). Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM), 8(2): 44-63.
Breiman, L. (2001). Random forests. Journal of Machine learning, 45(1): 5-32.
Ciss, S. (2015). Variable Importance in Random Uniform Forests. https://hal.archives-ouvertes.fr/hal-01104340/file/RandomUniform Forests.pdf.
Deng, H. (2013). Guided random forest in the rrf package. arXivpreprint arXiv:1306.0237.
Deng, H., & Runger, G. (2012). Feature selection via regularized trees. International Joint Conference on Neural Networks (IJCNN), pp. 1-8.
Deng, H., & Runger, G. (2013). Gene selection with guided regularized random forest. Journal of Pattern Recognition, 46: 3483-3489.
Đỗ Thanh Nghị, P. N. (2013). So sánh các mô hình dự báo lượng mưa cho thành phố Cần Thơ. Tạp chí Khoa học, Trường đại học Cần Thơ, tr. 80-90.
Manuel, F.-D., Eva, C., & Senén, B. (2014). Do we need hundreds of classifiers to solve.The Journal of Machine Learning Research, 15(1): 3133-3181.
Mardis, E. R. (2011). A decade's prespective on DNA sequencing technology. Nature, 470(7333): 198-203.
Rea, A. (1995). Data Mining - An Introduction. Nor of The Queen’s University of Belfast.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. USA: Springer-Verlag.