Received: 24-02-2022
Accepted: 20-12-2022
DOI:
Views
Downloads
How to Cite:
Multi-label Classification and its Application for Vietnamese SMS classification
Keywords
Multi-label classification, SMS classification, spam messages, algorithm adaptation methods, problem transformation methods
Abstract
Today, most of the users of mobile devices are regularly bothered by a large number of scam messages, advertising messages in different fields such as entertainment, shopping, finance, and real estate. Among these, each SMS message can belong to one or more different message types at the same time. Therefore, using single-label classification methods to classify messages would be inappropriate. In this study, we have summarized multi-label classification techniques, collected a dataset of 2,000 Vietnamese SMS messages (SMSVN), and improved the accuracy of the methods for multi-label classification by using the preprocessing techniques to normalize and clean data. Moreover, we have also applied the well-known multiple classifiers to test classification on this dataset. The results show that, after applying the preprocessing techniques, most of the multi-label classification techniques had higher accuracy and lower classification error. The Classifier Chains technique using Naïve Bayes model was suitable for the Vietnamese SMS data classification issues.
References
Bkav (2015). Tổng kết tình hình an ninh mạng nửa đầu năm 2015. Truy cập từhttps://www.bkav.com.vn/ tin-tuc-noi-bat/-/view-content/141094/tong-ket-tinh -hinh-an-ninh-mang-nua-au-nam-2015ngày 20/11/2021
Chính Phủ (2020). Chống tin nhắn rác, thư điện tử rác, cuộc gọi rác. Truy cập từhttps://vanban.chinhphu. vn/default.aspx?pageid=27160&docid=200773ngày20/11/2021.
Cheng W. & H¨ullermeier E. (2009). Combining instance-based learning and logistic regression for multilabel classification. Machine Learning. 76(2-3): 211-225.
Dembczy´nski K., ChengW. & H¨ullermeier E.(2010): Bayes optimal multilabel classification via probabilistic classifier chains. In: ICML 2010
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel & Bertrand Thirion (2011). Scikit-learn: Machine Learning in Python. Machine Learning Research. 12: 2825-2830.
Grigorios Tsoumakas I.K. & Ioannis Vlahavas (2009). Mining Multi-label Data. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. https://doi.org/10.1007/978-0-387-09 823-4_34. Springer.
Hoàng Xuân Huấn (2015). Giáo trình học máy. Nhà xuất Đại học Quốc gia, Hà Nội.
Huu‑Thanh Duong T.A.N.T. (2021). A review: preprocessing techniques and data augmentation for sentiment analysis. Computational Social Networks. 8: 1.
Jadon Mayurisingh Nareshpalsingh P. H. N. M. (2017). Multi-label Classification Methods: A Comparative Study. International Research Journal of Engineering and Technology (IRJET). 4: 8.
Phạm Thị Thài, Huynh Chi Nghia, Pham Thuy Huynh & Pham Thị Huyen Trang. (2013). Thực trạng ngôn ngữ nhắn tin (SMS language) của sinh viên trường Đại học Cần Thơ và học sinh THPT Trần Đại Nghĩa. Tạp chí Khoa học Trường Đại học Cần Thơ, Phần C: Khoa học Xã hội, Nhân văn và Giáo dục. 26: 55-63.
Tsoumakas G., Katakis I. & Vlahavas I.(2010). Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg.
Zhang M.L. & Zhou Z.H. (2007). ML-KNN: A Lazy Learning Approach to Multi-Label Learning. Pattern Recogn. 40: 2038-2048.