Journal of Information Security Research ›› 2018, Vol. 4 ›› Issue (11): 1040-1045.

Previous Articles     Next Articles

Research on Web Attack Traffic Detection Based on TF-IDF and Random Forest Algorithm

  

  • Received:2018-11-17 Online:2018-11-15 Published:2018-11-17

基于TF-IDF和随机森林算法的Web攻击流量检测方法研究

祝鹏程1,陈洁2,黄诚2,刘强2   

  1. 1. 四川大学电子信息学院
    2. 四川大学网络空间安全学院
  • 通讯作者: 祝鹏程
  • 作者简介:祝鹏程 硕士研究生,主要研究方向为Web安全、网络攻防技术. zpc.scu@gmail.com 方勇 博士,教授,主要研究方向为信息安全、网络信息对抗. yfang@scu.edu.cn 黄诚 博士,主要研究方向为信息安全、网络攻防技术. opcodesec@gmail.com 刘强 硕士研究生,主要研究方向为Web安全、网络攻防技术. chance67vip@163.com

Abstract: With the rapid development of network and application technology, Web server became the main attack target of hackers. However, the traditional Web intrusion detection system based on regular feature matching has some problems, such as difficult maintenance of rule base and bloated feature base. Some detection models based on machine learning algorithm must also be extracted by human hands, and still the recognition rate is not high. Aiming at these problems, this paper proposed a new model to train words and characters based on TF-IDF algorithm, which combines the word frequency matrices obtained by the two training methods as feature vectors, and classifies the vector sets by using random forest algorithm to identify malicious traffic and normal traffic. From the experiments we can found that our model's detection rate reached 98.7%. And the experimental results also showed that our model can realize automatic feature extraction and simplifies the detection method. It is very suitable for detecting malicious Web traffic.

Key words: TF-IDF, Random Forest, Data normalization, Feature extraction, Web attack traffic detection

摘要: 随着网络技术与应用的发展,Web服务器不可避免的成为了黑客的主要攻击目标。而传统基于正则匹配的Web入侵检测系统存在规则库维护困难、特征库臃肿的问题;基于机器学习的常规检测模型也存在特征提取复杂,识别率较低的问题。针对这些问题,本文提出一种基于TF-IDF和随机森林构架的Web攻击流量检测模型,该模型使用TF-IDF算法构建词频矩阵,自动提取有效载荷的特征,使用随机森林算法进行分类建模,识别出正常流量与攻击流量。实验结果表明,该方法对攻击流量的检测率达到了98.7%,实现了特征自动提取,简化了检测方法,适合于进行Web攻击流量的检测。

关键词: TF-IDF, 随机森林, 数据范化, 特征提取, Web攻击流量检测