Research on Web Attack Traffic Detection Based on TF-IDF and Random Forest Algorithm

Journal of Information Security Research ›› 2018, Vol. 4 ›› Issue (11): 1040-1045.

Previous Articles Next Articles

Research on Web Attack Traffic Detection Based on TF-IDF and Random Forest Algorithm

Received:2018-11-17 Online:2018-11-15 Published:2018-11-17

基于TF-IDF和随机森林算法的Web攻击流量检测方法研究

祝鹏程¹,陈洁²,黄诚²,刘强²

1. 四川大学电子信息学院
2. 四川大学网络空间安全学院

通讯作者: 祝鹏程
作者简介:祝鹏程硕士研究生，主要研究方向为Web安全、网络攻防技术. zpc.scu@gmail.com 方勇博士，教授，主要研究方向为信息安全、网络信息对抗. yfang@scu.edu.cn 黄诚博士，主要研究方向为信息安全、网络攻防技术. opcodesec@gmail.com 刘强硕士研究生，主要研究方向为Web安全、网络攻防技术. chance67vip@163.com

Abstract

Abstract: With the rapid development of network and application technology, Web server became the main attack target of hackers. However, the traditional Web intrusion detection system based on regular feature matching has some problems, such as difficult maintenance of rule base and bloated feature base. Some detection models based on machine learning algorithm must also be extracted by human hands, and still the recognition rate is not high. Aiming at these problems, this paper proposed a new model to train words and characters based on TF-IDF algorithm, which combines the word frequency matrices obtained by the two training methods as feature vectors, and classifies the vector sets by using random forest algorithm to identify malicious traffic and normal traffic. From the experiments we can found that our model's detection rate reached 98.7%. And the experimental results also showed that our model can realize automatic feature extraction and simplifies the detection method. It is very suitable for detecting malicious Web traffic.

Key words: TF-IDF, Random Forest, Data normalization, Feature extraction, Web attack traffic detection

摘要： 随着网络技术与应用的发展，Web服务器不可避免的成为了黑客的主要攻击目标。而传统基于正则匹配的Web入侵检测系统存在规则库维护困难、特征库臃肿的问题；基于机器学习的常规检测模型也存在特征提取复杂，识别率较低的问题。针对这些问题，本文提出一种基于TF-IDF和随机森林构架的Web攻击流量检测模型，该模型使用TF-IDF算法构建词频矩阵，自动提取有效载荷的特征,使用随机森林算法进行分类建模，识别出正常流量与攻击流量。实验结果表明，该方法对攻击流量的检测率达到了98.7%，实现了特征自动提取，简化了检测方法，适合于进行Web攻击流量的检测。

关键词: TF-IDF, 随机森林, 数据范化, 特征提取, Web攻击流量检测

祝鹏程陈洁黄诚刘强. 基于TF-IDF和随机森林算法的Web攻击流量检测方法研究[J]. 信息安全研究, 2018, 4(11): 1040-1045.

References

[1] Adeva J J G, Atxa J M P. Intrusion detection in Web applications using text mining[J]. Engineering Applications of Artificial Intelligence, 2007, 20(4): 555-566. [2] Almgren M, Debar H, Dacier M. A Lightweight Tool for Detecting Web Server Attacks[C]// Proceedings of Network and Distribu-ted Systems Security.2000:157-170. [3] 何鹏程, 方勇. 一种基于Web日志和网站参数的入侵检测和风险评估模型的研究[J]. 信息网络安全, 2015(1):61-65. [4] 吴少华, 程书宝, 胡勇. 基于SVM的Web攻击检测技术[J]. 计算机科学, 2015, 42(s1):362-364. [5] 张伟, 巢翌, 甘志强, 等. 结合特征分析和 Svm 优化的 Web 入侵检测系统[J]. 计算机仿真, 2018 (2018 年 05): 406-409,447. [6] Vishnu B A, Jevitha K P. Prediction of cross-site scripting attack using machine learning algorithms[C]//Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied Computing. ACM, 2014: 55. [7] Rathore S, Sharma P K, Park J H, et al. XSSClassifier: An Efficient XSS Attack Detection Approach Based on Machine Learning Classifier on SNSs[J]. Journal of Information Processing Systems, 2017, 13(4). [8] Nelms T, Perdisci R, Ahamad M. ExecScent: mining for new C&C domains in live networks with adaptive control protocol templates[C]// Usenix Conference on Security. 2013:589-604. [9] Fang Y, Li Y, Liu L, et al. DeepXSS: Cross Site Scripting Detection Based on Deep Learning[C]//Proceedings of the 2018 International Conference on Computing and Artificial Intelligence. ACM, 2018: 47-51. [10] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1987, 24(5):513-523. [11] Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5-32. [12] Giménez C T, Villegas A P, Marañón G Á. HTTP data set CSIC 2010 [EB/OL]. (2012-01-20) [2018-09-20]. http://www.isi.csic.es/dataset [13] Machine Learning algorithms and training/testing data [EB/OL]. (2016-06-24) [2018-09-20]. https://github.com/jeonglee/ML [14] DMOZ: The Open Directory Project [EB/OL]. [2018-09-20]. http://www.dmoztools.net [15] Pedregosa F, Gramfort A, Michel V, et al. Scikit-learn: MachineLearning in Python[J]. Journal of Machine Learning Research, 2013, 12(10):2825-2830.

Research on Web Attack Traffic Detection Based on TF-IDF and Random Forest Algorithm

基于TF-IDF和随机森林算法的Web攻击流量检测方法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 7

Recommended Articles 0

Metrics

[1]	. Obfuscated Android Malware Detection Based on Random Forest [J]. Journal of Information Security Research, 2021, 7(2): 126-135.
[2]	. Multimodal Public Sentiment Analysis Model Based on Local Semantic Information [J]. Journal of Information Security Research, 2019, 5(4): 340-345.
[3]	. Android Malware Detection and Analysisof Malware Behavior Base on Semi-supervised Learning [J]. Journal of Information Security Research, 2018, 4(3): 242-250.
[4]	. Android malicious application detection system based on multidimensional feature [J]. Journal of Information Security Research, 2018, 4(2): 133-139.
[5]	. Improved random forest algorithm and its application in Android malware detection [J]. Journal of Information Security Research, 2017, 3(11): 1020-1027.
[6]	Hao Chenxi and Fang Yong. PDF File Malicious Code Detection Method Based on Spectrum Analysis [J]. Journal of Information Security Research, 2016, 2(2): 166-171.
[7]	. Biometric Technology and Its Application in Financial Payment Security [J]. Journal of Information Security Research, 2016, 2(1): 27-32.