ĐÁNH GIÁ HIỆU QUẢ PHÂN LỚP DỮ LIỆU GENE CHIỀU CAO DỰA TRÊN RỪNG NGẪU NHIÊN, SVM VÀ KẾT HỢP PHƯƠNG PHÁP CHỌN ĐẶC TRƯNG RỪNG NGẪU NHIÊN ĐIỀU HƯỚNG

Hoàng Thị Hà

Evaluation of Data Classification Methods Base on Random Forest, SVM before and after Using Feature Selection Method Guided Random Forest for High Dimensional Gene Data

Hoang Thi Ha (*) ¹

¹ Khoa Công nghệ thông tin, Học viện Nông nghiệp Việt Nam

Keywords

Classification, high dimensional data, machine learning, random forest, feature selection

Abstract

Data classification is a common method used for mining the potentialknowledge from large databases. Several methods have been proposed such as AdaBoost, Support Vector Machine (sVM), Neural Network, random forest (RF), C45… Among all these classifiers, Random forest and SVM provided more accurate and efficient result for high-dimensional data. However, the performance of these algorithms may be affected/degraded when working on gene data with thousands of features. This is because gene data usually contain many redundant features which are uninformative to classification. Therefore, using the subsets of selected genes may give a better performance than using all the features. In this study, we summarizedthe data classification algorithms based on the random forest models, SVM, and evaluate the effectiveness of the algorithms for classifying high-dimensional data. Next, we evaluatedthe combination the features selection using GRF and other classifiers such as RF, WSRF, RUF, SVM. Sevengene data sets were used to evaluate the methods. Experimental results showedthat the combination not only increasedthe accuracy but also decreasedthe execution time of the algorithms

References

Baoxun Xu, Joshua Zhexue Huang, Graham Williams, Qiang Wang and Yunming Ye (2012). Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM), 8(2): 44-63.

Breiman, L. (2001). Random forests. Journal of Machine learning, 45(1): 5-32.

Ciss, S. (2015). Variable Importance in Random Uniform Forests. https://hal.archives-ouvertes.fr/hal-01104340/file/RandomUniform Forests.pdf.

Deng, H. (2013). Guided random forest in the rrf package. arXivpreprint arXiv:1306.0237.

Deng, H., & Runger, G. (2012). Feature selection via regularized trees. International Joint Conference on Neural Networks (IJCNN), pp. 1-8.

Deng, H., & Runger, G. (2013). Gene selection with guided regularized random forest. Journal of Pattern Recognition, 46: 3483-3489.

Đỗ Thanh Nghị, P. N. (2013). So sánh các mô hình dự báo lượng mưa cho thành phố Cần Thơ. Tạp chí Khoa học, Trường đại học Cần Thơ, tr. 80-90.

Manuel, F.-D., Eva, C., & Senén, B. (2014). Do we need hundreds of classifiers to solve.The Journal of Machine Learning Research, 15(1): 3133-3181.

Mardis, E. R. (2011). A decade's prespective on DNA sequencing technology. Nature, 470(7333): 198-203.

Rea, A. (1995). Data Mining - An Introduction. Nor of The Queen’s University of Belfast.

Vapnik, V. (1995). The Nature of Statistical Learning Theory. USA: Springer-Verlag.

Evaluation of Data Classification Methods Base on Random Forest, SVM before and after Using Feature Selection Method Guided Random Forest for High Dimensional Gene Data

Received: 15-02-2017

Accepted: 08-01-2018

DOI:

Views

Downloads

Issue: Vol. 15 No. 12 (2017)

Section:

How to Cite:

Evaluation of Data Classification Methods Base on Random Forest, SVM before and after Using Feature Selection Method Guided Random Forest for High Dimensional Gene Data

Keywords

Abstract

References