信息安全研究 ›› 2025, Vol. 11 ›› Issue (5): 412-.

• 学术论文 • 上一篇    下一篇

基于词嵌入和特征融合的恶意软件检测研究

师智斌1孙文琦2窦建民3于孟洋1


  

  1. 1(中北大学计算机科学与技术学院太原030051)
    2(公安部第三研究所上海200031)
    3(北方导航控制技术股份有限公司北京100176)

  • 出版日期:2025-06-03 发布日期:2025-06-03
  • 通讯作者: 孙文琦 博士,副研究员.主要研究方向为网络空间安全. sunwenqi@gass.ac.cn
  • 作者简介:师智斌 博士,副教授.主要研究方向为网络安全. 1637350520@qq.com 孙文琦 博士,副研究员.主要研究方向为网络空间安全. sunwenqi@gass.ac.cn 窦建民 硕士.主要研究方向为网络安全. 2753863392@qq.com 于孟洋 硕士研究生.主要研究方向为网络安全. yumychn@163.com

Research on Malware Detection Based on Word Embedding and Feature Fusion#br#
#br#

Shi Zhibin1, Sun Wenqi2, Dou Jianmin3, and Yu Mengyang1#br#

#br#
  

  1. 1(School of Computer Science and Technology, North University of China, Taiyuan 030051)
    2(Third Research Institute of Ministry of Public Security, Shanghai 200031)
    3(North Navigation Control Technology Co., Ltd., Beijing 100176)

  • Online:2025-06-03 Published:2025-06-03

摘要: 针对现有传统方法存在特征提取和表示受限、无法同时捕获API序列的空间语义特征和时序特征、无法捕获能决定目标任务的关键特征信息等问题,利用自然语言处理领域的词嵌入技术和多模型特征抽取以及特征融合技术,提出一种基于词嵌入和特征融合的恶意软件检测方法.首先使用自然语言处理领域的词嵌入技术对API序列编码,得到其语义特征编码表示;然后分别利用多重卷积网络和BiLSTM网络提取API序列的ngram局部空间特征和时序特征;最后利用自注意力机制对捕获的特征进行关键位置信息的深度融合,通过刻画深层恶意行为特征实现分类任务.实验结果表明,在二分类任务中,该方法准确率达到94.79%,相较于传统机器学习方法平均提高了12.37%,比深度学习方法平均提高5.78%.在多分类任务中,该方法的准确率也达到91.95%,能够有效地提高对恶意软件的检测准确率.

关键词: 恶意软件检测, 软件调用序列, 多重卷积网络, 长短期记忆网络, 特征融合

Abstract: To address the limitations of traditional methods in feature extraction and representation, which are unable to simultaneously capture the spatial and temporal features of API sequences and fail to capture key features that determine the target task, a malware detection method based on word embedding and feature fusion has been proposed. First, the word embedding technology from the field of natural language processing is utilized to encode API sequences, obtaining their semantic feature representations. Then, multiple convolutional networks and BiLSTM networks are employed to extract ngram local spatial features and temporal features of the API sequences, respectively. Finally, a selfattention mechanism is used to deeply fuse the captured features of critical positions, thereby achieving the classification task by characterizing deep malicious behavior features. Experimental results show that in binary classification tasks, the accuracy of this method reaches 94.79%, which is an improvement of 12.37% on average compared to traditional machine learning algorithms, and 5.78% higher on average compared to deep learning algorithms. In multiclass classification tasks, the accuracy of this model also reaches 91.95%, effectively enhancing the detection accuracy of malware.

Key words: malware detection, software call sequence, multiple convolutional networks, long short term memory network, feature fusion

中图分类号: