信息安全研究 ›› 2020, Vol. 6 ›› Issue (2): 113-121.

• 学术论文 • 上一篇    下一篇

基于汇编指令词向量特征的恶意软件检测研究

杨频1,潘岳镭1,贾鹏1,刘亮2   

  1. 1. 四川大学网络空间安全学院
    2. 四川大学 网络空间学院
  • 收稿日期:2020-02-08 出版日期:2020-02-10 发布日期:2020-02-08
  • 通讯作者: 杨频
  • 作者简介:杨频,1967年生,博士,教授,主要研究方向为软件安全,yangpin@scu.edu.cn 潘岳镭,1994年生,硕士研究生,主要研究方向为二进制安全,475779554@qq.com 贾鹏,1988年生,博士,助理研究员,主要研究方向为移动安全、二进制安全、复杂网络,pengjia@scu.edu.cn 刘亮,1982年生,博士,高级工程师,主要研究方向为系统及应用安全, liangzhai118@163.com

Malware Detection Based on Word Embedding Features of Assembly Instruction

  • Received:2020-02-08 Online:2020-02-10 Published:2020-02-08

摘要: 目前基于机器学习的恶意软件检测方法其主要思路是通过静态分析和动态分析的方法提取特征,再选择机器学习分类器进行分类模型的训练.该方法的准确性取决于人工选择的特征质量,在选择的过程中会丢失有价值的特征信息,影响分类效果.针对这个问题,提出了一种基于汇编指令词向量特征的恶意软件检测模型.首先利用反汇编工具提取恶意软件的汇编指令,制定规则替换部分指令,减少复杂度.然后,通过自然语言处理中的词向量模型学习指令的相似性,得到指令的向量表示.最后,使用卷积神经网络和双向长短期记忆的混合模型对可执行文件进行分类.上述方法有效解决了人工特征选择中特征质量不佳、重要信息丢失等问题.对数据集上进行的多组对比实验的结果表明,该方法达到了98.8%的分类准确率和98.7%的F1值,明显优于对比算法.

关键词: 恶意软件检测, 汇编指令, 词向量, 卷积神经网络, 双向长短期记忆

Abstract: The main idea of current malware detection methods based on machine learning is to extract features through static analysis and dynamic analysis, and then select a machine learning classifier to train the classification model. The accuracy of such methods depends on the quality of manually selected features, otherwise valuable information will be lost during the selection process, affecting the classification effect. Aiming at this problem, a malware detection model based on the word embedding feature of assembly instruction was proposed. First, the disassembly tool was used to extract assembly instructions of the malware, and rules were formulated to replace some instructions to reduce complexity. Then, the similarity of the instruction was learned through the word embedding model of Natural Language Processing to obtain a vector representation of the instruction. Finally, executable files were classified using the hybrid model of convolutional neural network (CNN) and Bidirectional long shortterm memory (BiLSTM). The above method effectively solves the problems of poor feature quality and loss of important information in manual feature selection. The results of multiple sets of comparison experiments on the data set show that the method achieves 98.8% classification accuracy and 98.7% F1 value, which is significantly better than the comparison algorithm.

Key words: malware detection, assembly instruction, word vector, convolutional neural network, Bi-directional long short-term memory