基于汇编指令词向量特征的恶意软件检测研究

信息安全研究 ›› 2020, Vol. 6 ›› Issue (2): 113-121.

基于汇编指令词向量特征的恶意软件检测研究

杨频¹,潘岳镭¹,贾鹏¹,刘亮²

1. 四川大学网络空间安全学院
2. 四川大学网络空间学院

收稿日期:2020-02-08 出版日期:2020-02-10 发布日期:2020-02-08
通讯作者: 杨频
作者简介:杨频，1967年生，博士，教授，主要研究方向为软件安全，yangpin@scu.edu.cn 潘岳镭，1994年生，硕士研究生，主要研究方向为二进制安全，475779554@qq.com 贾鹏，1988年生，博士，助理研究员，主要研究方向为移动安全、二进制安全、复杂网络，pengjia@scu.edu.cn 刘亮，1982年生，博士，高级工程师，主要研究方向为系统及应用安全， liangzhai118@163.com

Malware Detection Based on Word Embedding Features of Assembly Instruction

Received:2020-02-08 Online:2020-02-10 Published:2020-02-08

摘要/Abstract

摘要： 目前基于机器学习的恶意软件检测方法其主要思路是通过静态分析和动态分析的方法提取特征，再选择机器学习分类器进行分类模型的训练.该方法的准确性取决于人工选择的特征质量，在选择的过程中会丢失有价值的特征信息，影响分类效果.针对这个问题，提出了一种基于汇编指令词向量特征的恶意软件检测模型.首先利用反汇编工具提取恶意软件的汇编指令，制定规则替换部分指令，减少复杂度.然后，通过自然语言处理中的词向量模型学习指令的相似性，得到指令的向量表示.最后，使用卷积神经网络和双向长短期记忆的混合模型对可执行文件进行分类.上述方法有效解决了人工特征选择中特征质量不佳、重要信息丢失等问题.对数据集上进行的多组对比实验的结果表明，该方法达到了98.8%的分类准确率和98.7%的F1值，明显优于对比算法.

关键词: 恶意软件检测, 汇编指令, 词向量, 卷积神经网络, 双向长短期记忆

Abstract: The main idea of current malware detection methods based on machine learning is to extract features through static analysis and dynamic analysis, and then select a machine learning classifier to train the classification model. The accuracy of such methods depends on the quality of manually selected features, otherwise valuable information will be lost during the selection process, affecting the classification effect. Aiming at this problem, a malware detection model based on the word embedding feature of assembly instruction was proposed. First, the disassembly tool was used to extract assembly instructions of the malware, and rules were formulated to replace some instructions to reduce complexity. Then, the similarity of the instruction was learned through the word embedding model of Natural Language Processing to obtain a vector representation of the instruction. Finally, executable files were classified using the hybrid model of convolutional neural network (CNN) and Bidirectional long shortterm memory (BiLSTM). The above method effectively solves the problems of poor feature quality and loss of important information in manual feature selection. The results of multiple sets of comparison experiments on the data set show that the method achieves 98.8% classification accuracy and 98.7% F1 value, which is significantly better than the comparison algorithm.

Key words: malware detection, assembly instruction, word vector, convolutional neural network, Bi-directional long short-term memory

杨频潘岳镭贾鹏刘亮. 基于汇编指令词向量特征的恶意软件检测研究[J]. 信息安全研究, 2020, 6(2): 113-121.

参考文献

[1] SonicWall, Inc. 2019 SonicWall cyber threat report [EB/OL]. [2019-08-05]. https://www.sonicwall.com/lp/2019-cyber-threat-repo- rt-lp/ [2] Christodorescu M, Jha S, Seshia S A, et al. Semantics-aware malware detection [C]// S&P'05: Proceedings of the 2005 IEEE Symposium on Security and Privacy. Washington, DC: IEEE, 2005: 32-46. [3] Markel Z, Bilzor M. Building a machine learning classifier for malware detection [C]// WATeR 2014: 2014 Second Workshop on Anti-malware Testing Research. Washington, DC: IEEE, 2014: 1-4. [4] 陈泽峰, 方勇, 刘亮, 等. 基于多维特征的 Android 恶意应用检测系统[J]. 信息安全研究, 2018, 4(2): 133-139. [5] Raff E, Sylvester J, Nicholas C. Learning the pe header, malware detection with minimal domain knowledge [C]// AISec2017: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. New York, NY: ACM, 2017: 121-132. [6] Ma Z, Ge H, Liu Y, et al. A Combination Method for Android Malware Detection Based on Control Flow Graphs and Machine Learning Algorithms [J]. IEEE Access, 2019, 7: 21235-21245. [7] 孙润康, 彭国军, 李晶雯, 等. 基于行为的 Android 恶意软件判定方法及其有效性[J]. 计算机应用, 2016, 36(4): 973-978. [8] Wang S, Chen Z, Yan Q, et al. A mobile malware detection method using behavior features in network traffic [J]. Journal of Network and Computer Applications, 2019, 133: 15-25. [9] 张晨斌, 张云春, 郑杨, 等. 基于灰度图纹理指纹的恶意软件分类[J]. 计算机科学, 2018, 45(6A): 383-386. [10] Cui Z, Xue F, Cai X, et al. Detection of malicious code variants based on deep learning [J]. IEEE Transactions on Industrial Informatics, 2018, 14(7): 3187-3196. [11] Shalaginov A, Banin S, Dehghantanha A, et al. Machine learning aided static malware analysis: A survey and tutorial [M]// Cyber Threat Intelligence. Cham: Springer, 2018: 7-45. [12] Zhao J, Zhang S, Liu B, et al. Malware Detection Using Machine Learning Based on the Combination of Dynamic and Static Features [C]// ICCCN 2018: 2018 27th International Conference on Computer Communication and Network. Washington, DC: IEEE, 2018: 1-6. [13] Damodaran A, Di T F, Visaggio C A, et al. A comparison of static, dynamic, and hybrid analysis for malware detection [J]. Journal of Computer Virology and Hacking Techniques, 2017, 13(1): 1-12. [14] 苏志达,祝跃飞,刘龙.基于深度学习的安卓恶意应用检测[J].计算机应用, 2017, 37(6): 1650-1656. [15] Liu L, Wang B. Automatic malware detection using deep learning based on static analysis [C]// ICPCSEE 2017: International Conference of Pioneering Computer Scientists, Engineers and Educators 2017. Berlin, German: Springer, 2017: 500-507. [16] Karbab E M B, Debbabi M, Derhab A, et al. MalDozer: automatic framework for android malware detection using deep learning [J]. Digital Investigation, 2018, 24: S48-S59. [17] Massarelli L, Di Luna G A, Petroni F, et al. Safe: Self-attentive function embeddings for binary similarity [C]// DIMVA2019: International Conference on Detection of Intrusions and Malware & Vulnerability Assessment. Berlin, German: Springer, 2019: 309-329. [18] Redmond K, Luo L, Zeng Q. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis [J]. arXiv preprint arXiv:1812.09652, 2018. [19] Bouvrie, Jake. "Notes on convolutional neural networks." (2006). [20] Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780. [21] Mikolov T, Le Q V, Sutskever I. Exploiting similarities among languages for machine translation [J]. arXiv preprint arXiv:1309.4168, 2013. [22] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality [C]// NIPS2013: Advances in Neural Information Processing Systems 26. Cambridge, MA: MIT Press, 2013: 3111-3119.