基于数据挖掘和机器学习的恶意代码检测方法

信息安全研究 ›› 2016, Vol. 2 ›› Issue (1): 74-79.

基于数据挖掘和机器学习的恶意代码检测方法

廖国辉

四川大学电子信息学院

收稿日期:2015-11-25 出版日期:2016-01-05 发布日期:2016-01-18
通讯作者: 廖国辉
作者简介:廖国辉硕士研究生，主要研究方向为恶意代码检测、网络信息处理与信息安全. sculiaoguohui@yeah.net 刘嘉勇教授，博士生导师，主要研究方向为信息安全理论与应用、网络信息处理与信息安全. ljy@scu.edu.cnthe 8th Annual

A Malicious Code Detection Method Based on Data Mining and Machine Learning

Received:2015-11-25 Online:2016-01-05 Published:2016-01-18

摘要/Abstract

摘要： 近年来，恶意代码采用花指令以及加壳等方法来绕过杀毒软件的检测，而现有的方法对于变种恶意代码无法准确的识别.鉴于恶意代码对计算机安全性的威胁以及恶意代码传播速度快、种类繁多的特点，采用数据挖掘和机器学习的方法对恶意代码进行识别与检测.首先，提出了一种基于数据挖掘和机器学习的恶意代码检测框架，并分别从文本结构层、字节层、代码层3个角度提取了代码特征；然后采用主成分分析的方法对3种层次的组合特征进行特征降维；最后采用不同的分类方法对恶意代码进行识别与分类.分类结果表明：基于组合特征的不同分类方法对恶意代码的识别准确率都在90%以上，能够实现对变种恶意代码的有效检测，为恶意代码查杀提供了一种十分有效的方法，其中决策树分类方法的识别准确率最优.

关键词: 恶意代码, 多维特征, 数据挖掘, 机器学习, 代码检测

Abstract: In recent years, malicious code uses flower instructions and packers and other methods to bypass the detection of antivirus software, while the identification of existing methods for variants of malicious code can not be accurate.In the view of threat of malicious code on computer security and features of fast spread and wide variety, this paper uses the data mining and machine learning method to recognize and detect malicious code. Firstly, it proposes a malicious code detection framework based on data mining and machine learning, and extracts the code features from text structure layer, byte layer and code layer respectively. Secondly, it adapts the principal component analysis to reduce the dimension of combined feature matrix. Finally, it recognizes and classifies the malicious code using various classification methods. The result shows that the accuracy rate of every classification method based on combined feature matrix is higher than 90%, and among them, the method of decision tree gets the best .It is able to achieve effective detection of variants of malicious code, and provide a very effective method for malware killing to detect the variants of malicious code.

Key words: malicious code, multidimensional feature, data mining, machine learning, code detection

廖国辉. 基于数据挖掘和机器学习的恶意代码检测方法[J]. 信息安全研究, 2016, 2(1): 74-79.

参考文献

［1］Wang Z, Nascimento M, MacGregor M H. A multidisciplinary approach for online detection of X86 malicious executables［C］ Proc of Communication Networks and Services Research Conf (CNSR). Piscataway, NJ: IEEE, 2010: 160167［2］Patel S, Patel V, Jinwala D. Privacy preserving distributed kmeans clustering in malicious model using zero knowledge proof［M］ Distributed Computing and Internet Technology. Berlin: Springer, 2013: 420431［3］Fu L, Zhang T, Zhang H, et al. A fuzzy classification method based on feature selection algorithm in malicious script code detection［C］ Proc of 2011 IEEE Int Conf on System Science, Engineering Design and Manufacturing Informatization (ICSEM). Piscataway, NJ: IEEE, 2011: 218221［4］Hsiao H W, Chen D N, Wu T J. Detecting hiding malicious website using network traffic mining approach［C］ Proc of the 2nd Int Conf on Education Technology and Computer (ICETC). Piscataway, NJ: IEEE, 2010: V5276V5280［5］Thuraisingham B M. Data mining for security applications［M］ Intelligence and Security Informatics. Berlin: Springer, 2006: 13［6］黄聪会, 陈靖, 龚水清, 等. 一种基于危险理论的恶意代码检测方法［J］. 中南大学学报: 自然科学版, 2014, 45(9): 30553060［7］Lee T, Kim D, Jeong H, et al. Risk prediction of malicious codeinfected websites by mining vulnerability features［J］. International Journal of Security and Its Applications, 2014, 8(1): 291294［8］Ramani R G, Kumar S S, Jacob S G. Rootkit (malicious code) prediction through data mining methods and techniques［C］ Proc of 2013 IEEE Int Conf on Computational Intelligence and Computing Research (ICCIC). Piscataway, NJ: IEEE, 2013: 15［9］Li X, Dong X, Wang Y. Malicious code forensics based on data mining［C］ Proc of the 10th Int Conf on Fuzzy Systems and Knowledge Discovery (FSKD). Piscataway, NJ: IEEE, 2013: 978983［10］Li Y, Ma R, Jiao R. A hybrid malicious code detection method based on deep learning［J］. Methods, 2015, 9(5): 205216

[1]	李仁杰华驰鲁志萍. 基于FP-growth优化SVM分类器的XSS攻击检测研究[J]. 信息安全研究, 2020, 6(9): 0-0.
[2]	蹇诗婕卢志刚姜波刘玉岭刘宝旭. 基于层次聚类方法的流量异常检测[J]. 信息安全研究, 2020, 6(6): 0-0.
[3]	肖喜生彭凯飞龙春魏金侠赵静. 基于人工智能的安全态势预测技术研究综述[J]. 信息安全研究, 2020, 6(6): 0-0.
[4]	戴纯兴刘刚韩春超王传国. KVM环境下基于异常行为的恶意软件检测技术研究[J]. 信息安全研究, 2020, 6(6): 0-0.
[5]	王兴凤黄琨茗张文杰. 基于API序列和卷积神经网络的恶意代码检测[J]. 信息安全研究, 2020, 6(3): 212-219.
[6]	杨频朱悦张磊. 基于属性数据流图的恶意代码家族分类[J]. 信息安全研究, 2020, 6(3): 226-234.
[7]	黄莉峥刘嘉勇郑荣锋李孟铭. 一种基于暗网的威胁情报主动获取框架[J]. 信息安全研究, 2020, 6(2): 131-138.
[8]	雷惊鹏. 基于云计算和深度学习的协议监测系统设计[J]. 信息安全研究, 2020, 6(12): 1127-1132.
[9]	吕彬张悦齐标石志鑫. 大数据在信息安全领域的应用分析[J]. 信息安全研究, 2019, 5(7): 599-607.
[10]	马泽辉. 基于逻辑回归算法的Webshell检测方法研究[J]. 信息安全研究, 2019, 5(4): 298-302.
[11]	唐其彪杨勃潘利民. 基于机器学习的防扫描技术研究[J]. 信息安全研究, 2019, 5(4): 303-308.
[12]	刘正宵段丁阳唐志浩符天枢. 基于稳定风险特征选择的支付风险识别模型[J]. 信息安全研究, 2019, 5(10): 858-864.
[13]	何平胡勇. 一种基于本地代码特征的Android恶意代码检测方法[J]. 信息安全研究, 2018, 4(6): 511-517.
[14]	刘亮刘露平何帅刘嘉勇. 一种基于多特征的恶意代码家族静态标注方法[J]. 信息安全研究, 2018, 4(4): 322-328.
[15]	陈泽峰方勇刘亮左政李抒霞. 基于多维特征的Android恶意应用检测系统[J]. 信息安全研究, 2018, 4(2): 133-139.