信息安全研究 ›› 2020, Vol. 6 ›› Issue (3): 212-219.

• 学术论文 • 上一篇    下一篇

基于API序列和卷积神经网络的恶意代码检测

王兴凤,黄琨茗,张文杰   

  1. 四川大学网络空间安全学院
  • 收稿日期:2020-03-02 出版日期:2020-03-10 发布日期:2020-03-02
  • 通讯作者: 王兴凤

Malware Detection Based on Application Programming Interface Sequence and Convolutional Neural Network

  • Received:2020-03-02 Online:2020-03-10 Published:2020-03-02
  • About author:王兴凤 硕士研究生, 主要研究方向为网络空间安全. 1209380476@qq.com 黄琨茗 硕士研究生,主要研究方向为恶意代码检测. 1135750477@qq.com 张文杰 硕士研究生,主要研究方向为恶意代码检测. 1265616844@qq.com

摘要: 卷积神经网络(convolutional neural network, CNN)在诸多领域得到了广泛应用,Windows API序列在结构上存在前后依赖关系,仅仅使用卷积神经网络实现恶意代码检测将忽略词的上下文语义,因此使用了词向量模型来训练API序列,并且融合5个大小不同的卷积核来弥补传统卷积网络丢失序列时序信息和语法信息的缺点.在Cuckoo沙箱中运行样本文件,提取动态API序列并进行去重处理,预训练得到词向量,输入到多核融合的CNN网络中训练恶意代码检测模型.最后使用测试集测试模型的有效性,测试集的正确率值达到了98.1%,结果表明所提出的方法能有效地检测恶意代码.

关键词: 恶意代码, API序列, 词嵌入, 多核融合, 卷积神经网络

Abstract: Convolutional neural network (CNN) has been widely used in many fields. Windows Application Programming Interface(API) sequences are structurally dependent on each other. Only using convolutional neural networks to detect malware will ignore the context semantics of words. Therefore, this paper uses the word embedding model to pretrain the API sequences to from word vectors. Then, five convolution kernels of different sizes are fused to make up for the shortcomings of traditional convolutional networks in losing sequence timing information and ignoring word context semantics and grammatical information. This paper runs a sample in the Cuckoo sandbox, extracts dynamic API sequences and performs deduplication processing. The word vectors are pretrained using the word embedding method, and input to a multicore fusion CNN network to train a malware detection model. Finally, this paper uses the testset to test the validity of the model. The accuracy value of the testset reaches 98.1%. The results show that the method proposed in this paper can effectively detect malware.

Key words: malware, API sequence, word embedding, multi-core fusion, CNN