基于API序列和卷积神经网络的恶意代码检测

信息安全研究 ›› 2020, Vol. 6 ›› Issue (3): 212-219.

基于API序列和卷积神经网络的恶意代码检测

王兴凤,黄琨茗,张文杰

四川大学网络空间安全学院

收稿日期:2020-03-02 出版日期:2020-03-10 发布日期:2020-03-02
通讯作者: 王兴凤

Malware Detection Based on Application Programming Interface Sequence and Convolutional Neural Network

Received:2020-03-02 Online:2020-03-10 Published:2020-03-02
About author:王兴凤硕士研究生，主要研究方向为网络空间安全. 1209380476@qq.com 黄琨茗硕士研究生，主要研究方向为恶意代码检测. 1135750477@qq.com 张文杰硕士研究生，主要研究方向为恶意代码检测. 1265616844@qq.com

摘要/Abstract

摘要： 卷积神经网络(convolutional neural network, CNN)在诸多领域得到了广泛应用，Windows API序列在结构上存在前后依赖关系，仅仅使用卷积神经网络实现恶意代码检测将忽略词的上下文语义,因此使用了词向量模型来训练API序列，并且融合5个大小不同的卷积核来弥补传统卷积网络丢失序列时序信息和语法信息的缺点.在Cuckoo沙箱中运行样本文件，提取动态API序列并进行去重处理，预训练得到词向量，输入到多核融合的CNN网络中训练恶意代码检测模型.最后使用测试集测试模型的有效性，测试集的正确率值达到了98.1%，结果表明所提出的方法能有效地检测恶意代码.

关键词: 恶意代码, API序列, 词嵌入, 多核融合, 卷积神经网络

Abstract: Convolutional neural network (CNN) has been widely used in many fields. Windows Application Programming Interface(API) sequences are structurally dependent on each other. Only using convolutional neural networks to detect malware will ignore the context semantics of words. Therefore, this paper uses the word embedding model to pretrain the API sequences to from word vectors. Then, five convolution kernels of different sizes are fused to make up for the shortcomings of traditional convolutional networks in losing sequence timing information and ignoring word context semantics and grammatical information. This paper runs a sample in the Cuckoo sandbox, extracts dynamic API sequences and performs deduplication processing. The word vectors are pretrained using the word embedding method, and input to a multicore fusion CNN network to train a malware detection model. Finally, this paper uses the testset to test the validity of the model. The accuracy value of the testset reaches 98.1%. The results show that the method proposed in this paper can effectively detect malware.

Key words: malware, API sequence, word embedding, multi-core fusion, CNN

王兴凤黄琨茗张文杰. 基于API序列和卷积神经网络的恶意代码检测[J]. 信息安全研究, 2020, 6(3): 212-219.

参考文献

[1] Ahmadi M, Ulyanov D, Semenov S, et al. Novel feature extraction, selection and fusion for effective malware family classification[C]//Proc of the 6th ACM Conf on Data and Application Security and Privacy. New York:ACM, 2016: 183-194 [2] 周紫瞻,王俊峰. 基于GPU加速的恶意代码字节码特征提取方法研究 [J]. 四川大学学报: 自然科学版, 2019, 56(2):45-52 [3] Hassen M, Chan P K. Scalable function call graph-based malware classification[C]//Proc of the 7th ACM on Conf on Data and Application Security and Privacy. New York:ACM, 2017: 239-248 [4] Obeis N T, Bhaya W. Malware analysis using APIs pattern mining[J]. Int Journal of Engineering & Technology, 2018, 7(3): 502-506. [5] Wüchner T, Cisłak A, Ochoa M, et al. Leveraging compression-based graph mining for behavior-based malware detection[J]. IEEE Trans on Dependable and Secure Computing, 2017, 16(1): 99-112 [6] Salehi Z, Sami A, Ghiasi M. MAAR: Robust features to detect malicious activity based on API calls, their arguments and return values[J]. Engineering Applications of Artificial Intelligence, 2017, 59:93-102 [7] 朱雪冰,周安民,左政.基于家族行为频繁子图挖掘的恶意代码检测[J].信息安全研究,2019,5(2):105-113 [8] 荣俸萍,方勇,左政,等.MACSPMD:基于恶意API调用序列模式挖掘的恶意代码检测[J].计算机科学,2018,45(5):131-138 [9] Ding Y, Xia X, Chen S, et al. A malware detection method based on family behavior graph[J]. Computers & Security, 2018, 73: 73-86 [10] Ki Y, Kim E, Kim H K. A novel approach to detect malware based on API call sequence analysis[J]. Int Journal of Distributed Sensor Networks, 2015, 11(6): 659101. [11] Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint. arXiv:1408.5882, 2014 [12]Guarnieri C, Tanasia A, Bremer J,et al.Cuckoo-sandbox [EB/OL]. [2019-11-11].https://cuckoosandbox.org [13]Melissa.VirusShare[EB/OL].[2019-11-25].https://virusshare.com/torrents.4n6 [14]VirusTotal.VirusTotal[EB/OL].[2019-11-25]https://www.virustotal.com [15] 芦效峰, 蒋方朔, 周箫，等. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报:自然科学版, 2018, 58(5): 500-508