MỘT SỐ PHƯƠNG PHÁP PHÂN LỚP ĐA NHÃN VÀ ỨNG DỤNG PHÂN LOẠI TIN NHẮN SMS TIẾNG VIỆT

Hoàng Thị Hà; Đào Xuân Dương; Lê Thị Nhung

Multi-label Classification and its Application for Vietnamese SMS classification

Hoang Thi Ha (*) ¹ , Dao Xuan Duong ² , Le Thi Nhung ¹

¹ Khoa Công nghệ thông tin, Học viện Nông nghiệp Việt Nam

² Công ty Cổ phần Tin học Viễn thông Bưu điện

Keywords

Multi-label classification, SMS classification, spam messages, algorithm adaptation methods, problem transformation methods

Abstract

Today, most of the users of mobile devices are regularly bothered by a large number of scam messages, advertising messages in different fields such as entertainment, shopping, finance, and real estate. Among these, each SMS message can belong to one or more different message types at the same time. Therefore, using single-label classification methods to classify messages would be inappropriate. In this study, we have summarized multi-label classification techniques, collected a dataset of 2,000 Vietnamese SMS messages (SMSVN), and improved the accuracy of the methods for multi-label classification by using the preprocessing techniques to normalize and clean data. Moreover, we have also applied the well-known multiple classifiers to test classification on this dataset. The results show that, after applying the preprocessing techniques, most of the multi-label classification techniques had higher accuracy and lower classification error. The Classifier Chains technique using Naïve Bayes model was suitable for the Vietnamese SMS data classification issues.

References

Bkav (2015). Tổng kết tình hình an ninh mạng nửa đầu năm 2015. Truy cập từhttps://www.bkav.com.vn/ tin-tuc-noi-bat/-/view-content/141094/tong-ket-tinh -hinh-an-ninh-mang-nua-au-nam-2015ngày 20/11/2021

Chính Phủ (2020). Chống tin nhắn rác, thư điện tử rác, cuộc gọi rác. Truy cập từhttps://vanban.chinhphu. vn/default.aspx?pageid=27160&docid=200773ngày20/11/2021.

Cheng W. & H¨ullermeier E. (2009). Combining instance-based learning and logistic regression for multilabel classification. Machine Learning. 76(2-3): 211-225.

Dembczy´nski K., ChengW. & H¨ullermeier E.(2010): Bayes optimal multilabel classification via probabilistic classifier chains. In: ICML 2010

Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel & Bertrand Thirion (2011). Scikit-learn: Machine Learning in Python. Machine Learning Research. 12: 2825-2830.

Grigorios Tsoumakas I.K. & Ioannis Vlahavas (2009). Mining Multi-label Data. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. https://doi.org/10.1007/978-0-387-09 823-4_34. Springer.

Hoàng Xuân Huấn (2015). Giáo trình học máy. Nhà xuất Đại học Quốc gia, Hà Nội.

Huu‑Thanh Duong T.A.N.T. (2021). A review: preprocessing techniques and data augmentation for sentiment analysis. Computational Social Networks. 8: 1.

Jadon Mayurisingh Nareshpalsingh P. H. N. M. (2017). Multi-label Classification Methods: A Comparative Study. International Research Journal of Engineering and Technology (IRJET). 4: 8.

Phạm Thị Thài, Huynh Chi Nghia, Pham Thuy Huynh & Pham Thị Huyen Trang. (2013). Thực trạng ngôn ngữ nhắn tin (SMS language) của sinh viên trường Đại học Cần Thơ và học sinh THPT Trần Đại Nghĩa. Tạp chí Khoa học Trường Đại học Cần Thơ, Phần C: Khoa học Xã hội, Nhân văn và Giáo dục. 26: 55-63.

Tsoumakas G., Katakis I. & Vlahavas I.(2010). Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg.

Zhang M.L. & Zhou Z.H. (2007). ML-KNN: A Lazy Learning Approach to Multi-Label Learning. Pattern Recogn. 40: 2038-2048.

Multi-label Classification and its Application for Vietnamese SMS classification

Received: 24-02-2022

Accepted: 20-12-2022

DOI:

Views

Downloads

Issue: Vol. 20 No. 12 (2022)

Section:

How to Cite:

Multi-label Classification and its Application for Vietnamese SMS classification

Keywords

Abstract

References