基于开集识别的恶意代码家族同源性分析

信息安全研究 ›› 2023, Vol. 9 ›› Issue (8): 762-.

基于开集识别的恶意代码家族同源性分析

刘亚倩

(北京天融信网络安全技术有限公司北京100085)

出版日期:2023-08-01 发布日期:2023-09-05
通讯作者: 刘亚倩硕士.主要研究方向为机器学习与网络安全. 2395744091@qq.com
作者简介:刘亚倩硕士.主要研究方向为机器学习与网络安全. 2395744091@qq.com

Analysis on the Homology of Malware Families Based on Openset Recognition

Online:2023-08-01 Published:2023-09-05

摘要/Abstract

摘要： 目前，恶意代码家族同源性分析方法多侧重于闭集分类问题的研究，即假定待测样本一定属于某个已知家族类别.然而真实环境中的恶意代码家族众多，未知类别的家族通常占大多数，采用闭集识别的方法，无法准确识别真实环境中的恶意代码家族.针对上述问题，提出了一种基于开集识别的恶意代码家族同源性分析方法.通过NGram滑动窗口和Doc2vec句嵌入方法将恶意代码可执行文件转换成灰度图像，基于卷积神经网络模型MobileNet获取灰度图像数据的特征，利用Open Longtailed Recognition模型实现恶意代码家族的开集识别.在9个已知类别和9个未知类别恶意代码家族上进行识别，实验结果表明，所提出的方法能够识别出未知类别恶意代码家族，同时在已知类别和未知类别家族上都能保持较高的准确率.

Abstract: At present, analysis on the homology of malware families mostly focuses on the closedset problem, that is, it is assumed that the samples to be tested must belong to a certain known class.However, there are many malware families in an open world, and the unknown classes usually account for the majority. The closedset recognition cannot accurately identify the malware families in an open world. Aiming at the above problems, this paper proposes a homology analysis method for malware families based on openset recognition. The malware executable files are converted into grayscale images through NGram sliding window and Doc2vec sentence embedding method, the features of the grayscale images are obtained based on the convolutional neural network model MobileNet, and the Open Longtailed Recognition model is used to realize openset recognition of malware families. Identifying 9 known classes and 9 unknown classes of malware families, the experimental results show that the proposed method can identify the malware family of the unknown classes while maintaining high accuracy on both known and unknown families.

刘亚倩. 基于开集识别的恶意代码家族同源性分析[J]. 信息安全研究, 2023, 9(8): 762-.

参考文献

［1］Cho I K, Kim T G, Shim Y J, et al. Malware similarity analysis using API sequence alignments［J］. Journal of Internet Services and Information Security, 2014, 4(4): 103114［2］陈琪, 蒋国平, 夏玲玲. 基于静态结构的恶意代码同源性分析［J］. 计算机工程与应用, 2017, 53(14): 9398［3］钱雨村, 彭国军, 王滢, 等. 恶意代码同源性分析及家族聚类［J］. 计算机工程与应用, 2015, 51(18): 7681［4］Giannella C, Bloedorn E. Spectral malware behavior clustering［C］ Proc of IEEE Int Conf on Intelligence and Security Informatics (ISI). Piscataway, NJ: IEEE, 2015: 712［5］刘凯, 方勇, 张磊, 等. 基于图卷积网络的恶意代码聚类［J］. 四川大学学报: 自然科学版, 2019, 56(4): 654660［6］Xue D, Li J, Wu W, et al. Homology analysis of malware based on ensemble learning and multifeatures［J］. PloS One, 2019, 14(8): e0211373［7］乔延臣, 姜青山, 古亮, 等. 基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究［J］. 信息网络安全, 2019 (4): 2028［8］Zhu X, Huang J, Wang B, et al. Malware homology determination using visualized images and feature fusion［J］. PeerJ Computer Science, 2021, 7: e494［9］Jia J. Deep learning and open set malware classification: A survey［J］. arXiv preprint, arXiv:2004.04272, 2020［10］陈雁佳. 恶意软件组织的开集识别模型研究［D］. 广州: 暨南大学, 2020［11］Le Q, Mikolov T. Distributed representations of sentences and documents［C］ Proc of the 31st Int Conf on Machine Learning. New York: ACM, 2014: 11881196［12］Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications［J］. arXiv preprint, arXiv:1704.04861, 2017［13］Liu Z, Miao Z, Zhan X, et al. Largescale longtailed recognition in an open world［C］ Proc of the IEEECVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 25372546［14］Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint, arXiv:1301.3781, 2013［15］Tran T K, Sato H. NLPbased approaches for malware classification from API sequences［C］ Proc of the 21st Asia Pacific Symp on Intelligent and Evolutionary Systems (IES). Piscataway, NJ: IEEE, 2017: 101105［16］张景莲, 彭艳兵. 基于特征融合的恶意代码分类研究［J］. 计算机工程, 2019, 45(8): 281286, 295［17］Chu Q, Liu G, Zhu X. Visualization feature and CNN based homology classification of malicious code［J］. Chinese Journal of Electronics, 2020, 29(1): 154160［18］刘亮, 刘露平, 何帅, 等. 一种基于多特征的恶意代码家族静态标注方法［J］. 信息安全研究, 2018, 4(4): 322328［19］Microsoft. Microsoft malware classification challenge［EBOL］. ［20180310］. https:www.kaggle.comcmalwareclassification