Journal of Information Security Research ›› 2016, Vol. 2 ›› Issue (1): 74-79.

Previous Articles     Next Articles

A Malicious Code Detection Method Based on Data Mining and Machine Learning


  • Received:2015-11-25 Online:2016-01-05 Published:2016-01-18



  1. 四川大学电子信息学院
  • 通讯作者: 廖国辉
  • 作者简介:廖国辉 硕士研究生,主要研究方向为恶意代码检测、网络信息处理与信息安全. 刘嘉勇 教授,博士生导师,主要研究方向为信息安全理论与应用、网络信息处理与信息安全. 8th Annual

Abstract: In recent years, malicious code uses flower instructions and packers and other methods to bypass the detection of antivirus software, while the identification of existing methods for variants of malicious code can not be accurate.In the view of threat of malicious code on computer security and features of fast spread and wide variety, this paper uses the data mining and machine learning method to recognize and detect malicious code. Firstly, it proposes a malicious code detection framework based on data mining and machine learning, and extracts the code features from text structure layer, byte layer and code layer respectively. Secondly, it adapts the principal component analysis to reduce the dimension of combined feature matrix. Finally, it recognizes and classifies the malicious code using various classification methods. The result shows that the accuracy rate of every classification method based on combined feature matrix is higher than 90%, and among them, the method of decision tree gets the best .It is able to achieve effective detection of variants of malicious code, and provide a very effective method for malware killing to detect the variants of malicious code.

Key words: malicious code, multidimensional feature, data mining, machine learning, code detection

摘要: 近年来,恶意代码采用花指令以及加壳等方法来绕过杀毒软件的检测,而现有的方法对于变种恶意代码无法准确的识别.鉴于恶意代码对计算机安全性的威胁以及恶意代码传播速度快、种类繁多的特点,采用数据挖掘和机器学习的方法对恶意代码进行识别与检测.首先,提出了一种基于数据挖掘和机器学习的恶意代码检测框架,并分别从文本结构层、字节层、代码层3个角度提取了代码特征;然后采用主成分分析的方法对3种层次的组合特征进行特征降维;最后采用不同的分类方法对恶意代码进行识别与分类.分类结果表明:基于组合特征的不同分类方法对恶意代码的识别准确率都在90%以上,能够实现对变种恶意代码的有效检测,为恶意代码查杀提供了一种十分有效的方法,其中决策树分类方法的识别准确率最优.

关键词: 恶意代码, 多维特征, 数据挖掘, 机器学习, 代码检测