信息安全研究 ›› 2019, Vol. 5 ›› Issue (4): 303-308.

• 学术论文 • 上一篇    下一篇

基于机器学习的防扫描技术研究

唐其彪,杨勃,潘利民   

  1. 杭州安恒信息技术股份有限公司风暴中心
  • 收稿日期:2019-04-08 出版日期:2019-04-15 发布日期:2019-04-08
  • 通讯作者: 唐其彪
  • 作者简介:唐其彪 硕士,主要研究方向为网络安全、 云waf智能防护、对抗防护. qibiao.tang@dbappsecurity.com.cn 杨勃 硕士,高级工程师,主要研究方向为威胁情报、态势感知、金融安全和智慧城市安全。 bob.yang@dbappsecurity.com.cn 潘利民 硕士,高级工程师,主要研究方向为云安全、边界安全、数据安全和智能安全。 limin.pan@dbappsecurity.com.cn

Research on Anti-Scanning Technology Based on Machine Learning

  • Received:2019-04-08 Online:2019-04-15 Published:2019-04-08

摘要: 随着互联网技术的发展,Web应用系统已经广泛应用于政府门户网站、电子商务、互联网等行业,方便生活和工作的同时也带来网络安全隐患.黑客利用扫描技术不仅能够找到服务器漏洞进行攻击,而且扫描产生的大量数据报文也占用了大量的网络带宽,导致正常的网络通信无法进行.针对这个问题,提出通过解析客户端访问日志提取2s时间内日志的本次IP访问的响应码、2s时间内本次IP的访问数占全部IP访问数的比例、2s时间内本次IP访问的404响应码个数占本次IP访问的比例、2s时间内本次IP访问的端口方差,提取100条日志本次IP的访问数占比、100条日志中本次IP访问的404响应码个数、100条日志本次IP访问的端口方差7个特征,通过机器学习中朴素贝叶斯分类算法识别扫描行为的方法.并且使用spark的mLlib贝叶斯算法训练存储HDFS平台的扫描日志,定时更新算法模板,实现对抗恶意扫描的能力,最终通过iptables对扫描IP进行网络层封禁.该方法提高识别准确率,降低误报率,有效降低恶意流量,防护客户网站.

关键词: 防扫描, 机器学习, 朴素贝叶斯算法, 网络安全, spark, iptables

Abstract: With the development of Internet technology, web application systems have been widely used in government portals, ecommerce, Internet and other industries, which are convenient for life and work, but also bring network security risks. Hackers can not only find server vulnerabilities by scanning technology, but also generate a large amount of network bandwidth due to scanning, which causes normal network communication to fail. To solve this problem, it is proposed to analyze the client access log, extract the response code of the past 2s IP access in the log, the proportion of the number of IP accesses in the past 2s to the total number of IP accesses, and the response code of the IP access in the past 2s. The proportion of 404 accounts for the current IP access, the port variance of the IP access in the past 2s, the number of IP addresses in the past 100 logs, and the number of 404 responses in the past 100 logs. In the past 100 logs, the port variance of this IP access has 7 characteristics, and the scanning behavior is identified by the naive Bayesian classification algorithm in machine learning. And use the spark MLlib Bayesian algorithm to train the scan log of the hdfs platform, update the algorithm template regularly, and realize the ability to resist malicious scanning. Finally, the network layer is blocked by iptables. The method improves recognition accuracy, reduces false positive rate, effectively reduces malicious traffic, and protects customer websites.

Key words: anti-scanning, machine learning, naive bayesian algorithm, cyber security, spark, iptables