基于词嵌入和特征融合的恶意软件检测研究

信息安全研究 ›› 2025, Vol. 11 ›› Issue (5): 412-.

基于词嵌入和特征融合的恶意软件检测研究

师智斌1孙文琦2窦建民3于孟洋1

1(中北大学计算机科学与技术学院太原030051)
2(公安部第三研究所上海200031)
3(北方导航控制技术股份有限公司北京100176)

出版日期:2025-06-03 发布日期:2025-06-03
通讯作者: 孙文琦博士，副研究员.主要研究方向为网络空间安全. sunwenqi@gass.ac.cn
作者简介:师智斌博士，副教授.主要研究方向为网络安全. 1637350520@qq.com 孙文琦博士，副研究员.主要研究方向为网络空间安全. sunwenqi@gass.ac.cn 窦建民硕士.主要研究方向为网络安全. 2753863392@qq.com 于孟洋硕士研究生.主要研究方向为网络安全. yumychn@163.com

Research on Malware Detection Based on Word Embedding and Feature Fusion#br#
#br#

Shi Zhibin1, Sun Wenqi2, Dou Jianmin3, and Yu Mengyang1#br#

#br#

1(School of Computer Science and Technology, North University of China, Taiyuan 030051)
2(Third Research Institute of Ministry of Public Security, Shanghai 200031)
3(North Navigation Control Technology Co., Ltd., Beijing 100176)

Online:2025-06-03 Published:2025-06-03

摘要/Abstract

摘要： 针对现有传统方法存在特征提取和表示受限、无法同时捕获API序列的空间语义特征和时序特征、无法捕获能决定目标任务的关键特征信息等问题，利用自然语言处理领域的词嵌入技术和多模型特征抽取以及特征融合技术，提出一种基于词嵌入和特征融合的恶意软件检测方法.首先使用自然语言处理领域的词嵌入技术对API序列编码，得到其语义特征编码表示；然后分别利用多重卷积网络和BiLSTM网络提取API序列的ngram局部空间特征和时序特征；最后利用自注意力机制对捕获的特征进行关键位置信息的深度融合，通过刻画深层恶意行为特征实现分类任务.实验结果表明，在二分类任务中，该方法准确率达到94.79%，相较于传统机器学习方法平均提高了12.37%，比深度学习方法平均提高5.78%.在多分类任务中，该方法的准确率也达到91.95%，能够有效地提高对恶意软件的检测准确率.

关键词: 恶意软件检测, 软件调用序列, 多重卷积网络, 长短期记忆网络, 特征融合

Abstract: To address the limitations of traditional methods in feature extraction and representation, which are unable to simultaneously capture the spatial and temporal features of API sequences and fail to capture key features that determine the target task, a malware detection method based on word embedding and feature fusion has been proposed. First, the word embedding technology from the field of natural language processing is utilized to encode API sequences, obtaining their semantic feature representations. Then, multiple convolutional networks and BiLSTM networks are employed to extract ngram local spatial features and temporal features of the API sequences, respectively. Finally, a selfattention mechanism is used to deeply fuse the captured features of critical positions, thereby achieving the classification task by characterizing deep malicious behavior features. Experimental results show that in binary classification tasks, the accuracy of this method reaches 94.79%, which is an improvement of 12.37% on average compared to traditional machine learning algorithms, and 5.78% higher on average compared to deep learning algorithms. In multiclass classification tasks, the accuracy of this model also reaches 91.95%, effectively enhancing the detection accuracy of malware.

Key words: malware detection, software call sequence, multiple convolutional networks, long short term memory network, feature fusion

中图分类号:

TP309

师智斌, 孙文琦, 窦建民, 于孟洋, . 基于词嵌入和特征融合的恶意软件检测研究[J]. 信息安全研究, 2025, 11(5): 412-.

参考文献

参考文献
［1］Moser A, Kruegel C, Kirda E. Limits of static analysis for malware detection［C］ Proc of the 23rd Annual Computer Security Applications Conference (ACSAC 2007). Piscataway, NJ: IEEE, 2007: 421430［2］Amer E, ElSappagh S, Hu J W. Contextual identification of windows malware through semantic interpretation of API call sequence［J］. Applied Sciences, 2020, 10(21): 7673［3］Ucci D, Aniello L, Baldoni R. Survey of machine learning techniques for malware analysis［J］. Computers & Security, 2019, 81: 123147［4］Pekta瘙塂 A, Acarman T. Malware classification based on API calls and behaviour analysis［J］. IET Information Security, 2018, 12(2): 107117［5］Soni H, Kishore P, Mohapatra D P. Opcode and API based machine learning framework for malware classification［C］ Proc of the 2nd Int Conf on Intelligent Technologies (CONIT). Piscataway, NJ: IEEE, 2022: 17［6］Garg V, Yadav R K. Malware detection using multilevel ensemble supervised learning［C］ Proc of the 4th Int Conf on Communication and Intelligent Systems: Proceedings of ICCIS 2019. Berlin: Springer, 2020: 219231［7］乔延臣, 姜青山, 古亮, 等. 基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究［J］. 信息网络安全, 2019 (4): 2028［8］唐永旺, 刘欣. 基于BiLSTM和自注意力的恶意代码检测方法［J］. 计算机应用与软件, 2021, 38(3): 327333［9］Wu Xuan, Song Yafei. An efficient malware classification method based on the AIFSIDL and multifeature fusion［J］. Information, 2022, 13(12): 119［10］Lv Z, Qiao L, Singh A K, et al. Finegrained visual computing based on deep learning［J］. ACM Trans on Multimidia Computing Communications and Applications, 2021, 17(1s): 119［11］Liu Min, Li Hailong. Malicious code classification method based on API sequence and textCNN［C］ Proc of Int Conf on Cloud Computing, Internet of Things, and Computer Applications (CICA 2022). Washington: SPIE, 2022: 190199［12］陈克. 基于深度学习的恶意代码检测技术研究［D］. 北京: 北京交通大学, 2020［13］Kale A S, Pandya V, Di Troia F, et al. Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo［J］. Journal of Computer Virology and Hacking Techniques, 2022, 19: 116［14］Mahdavifar S, Alhadidi D, Ghorbani A A. Effective and efficient hybrid android malware classification using pseudolabel stacked autoencoder［J］. Journal of Network and Systems Management, 2022, 30: 134［15］郑锐, 汪秋云, 傅建明, 等. 一种基于深度学习的恶意软件家族分类模型［J］. 信息安全学报, 2020, 5(1): 19

[1]	刘连山, 黄瑜, . 基于三通道深度融合技术的图像隐写方法[J]. 信息安全研究, 2025, 11(3): 257-.
[2]	李聪聪, 袁子龙, 滕桂法, . 基于深度学习的时空特征融合网络入侵检测模型研究[J]. 信息安全研究, 2025, 11(2): 122-.
[3]	李猛坤, 李柯锦, 王琪, 袁晨, 吕慧颖, 应作斌, . 面向社交网络平台的多模态网络欺凌检测模型研究[J]. 信息安全研究, 2025, 11(2): 154-.
[4]	文津, 蒋凯元, 韩禹洋, 王志强, 罗乐琦, 田文亮, . 基于Transformer与图卷积网络的行为冲突检测模型[J]. 信息安全研究, 2024, 10(8): 729-.
[5]	钟家豪, 张新有, 冯力, 邢焕来, . 基于卷积注意力机制的恶意软件样本增强方案[J]. 信息安全研究, 2024, 10(5): 431-.
[6]	张淑慧, 胡长栋, 王连海, 徐淑奖, 邵蔚, 兰田, . 基于GHM可视化和深度学习的恶意代码检测与分类[J]. 信息安全研究, 2024, 10(3): 216-.
[7]	陈颖, 林雨衡, 王志强, 都迎迎, 文津, . 基于Transformer的安卓恶意软件多分类模型[J]. 信息安全研究, 2023, 9(12): 1138-.
[8]	张天月, 陈伟, 刘宇啸, . 基于多尺度时空残差网络的入侵检测方法[J]. 信息安全研究, 2023, 9(11): 1045-.
[9]	时林, 时绍森, 文伟平. 基于LSTM的Linux系统下APT攻击检测研究[J]. 信息安全研究, 2022, 8(8): 736-.
[10]	杨频潘岳镭贾鹏刘亮. 基于汇编指令词向量特征的恶意软件检测研究[J]. 信息安全研究, 2020, 6(2): 113-121.
[11]	朱雪冰周安民左政. 基于家族行为频繁子图挖掘的恶意代码检测[J]. 信息安全研究, 2019, 5(2): 105-113.
[12]	杜炜李剑. 基于半监督学习的安卓恶意软件检测及其恶意行为分析[J]. 信息安全研究, 2018, 4(3): 242-250.
[13]	宋丹. 生物识别技术及其在金融支付安全领域的应用[J]. 信息安全研究, 2016, 2(1): 27-32.