Journal of Information Security Reserach ›› 2023, Vol. 9 ›› Issue (7): 687-.

Previous Articles     Next Articles

Research on Vulnerability Text Feature Classification Technology  Based on BERT

  

  • Online:2023-07-01 Published:2023-07-01

基于BERT的漏洞文本特征分类技术研究

杜林1,2许传淇1   

  1. 1(国家计算机网络应急技术处理协调中心天津分中心天津300100)
    2(北京交通大学计算机与信息技术学院北京100044)
  • 通讯作者: 杜林 硕士研究生,工程师.主要研究方向为网络安全、数据挖掘. dulin@cert.org.cn
  • 作者简介:杜林 硕士研究生,工程师.主要研究方向为网络安全、数据挖掘. dulin@cert.org.cn 许传淇 硕士,高级工程师.主要研究方向为信息安全、社会情报学. xuchuanqi@126.com

Abstract: With the development of informatization and the increase of network applications, many software and hardware products are affected by various types of cybersecurity vulnerabilities. Vulnerability analysis and management often require people to classify large amounts of vulnerability intelligence texts. In order to efficiently and accurately determine the category of the vulnerability described by the vulnerability intelligence text, this paper proposes a cybersecurity vulnerability classification model based on BERT (bidirectional encoder representation from Transformers). First, the vulnerability classification dataset is constructed, and the pretrained model represents the vulnerability intelligence text as feature vectors. Then the feature vectors complete the classification through the classifier. At last, we use the test set to evaluate the classification effect. In our experiment, we use TextCNN, TextRNN, TextRNN_Att, fastText and the proposed model to classify 48000 vulnerability intelligence texts containing vulnerability descriptions. Experimental results show that the proposed model scored the highest on the classification evaluation indicators on the test set, and it can be effectively applied to cybersecurity vulnerability classification tasks and reduce manual workload.

Key words: natural language processing systems, cybersecurity, feature extraction, classifier, deep learning

摘要: 随着信息化的发展和网络应用的增多,许多软硬件产品受到各种类型的网络安全漏洞影响.漏洞分析和管理工作往往需要对大量漏洞情报文本进行人工分类.为了高效准确地判断漏洞情报文本所描述漏洞的类别,提出了一种基于多层双向Transformer编码器表示(bidirectional encoder representation from Transformers, BERT)的网络安全漏洞分类模型.首先,构建漏洞分类数据集,用预训练模型对漏洞情报文本进行特征向量表示.然后,将所得的特征向量通过分类器完成分类.最后,使用测试集对分类效果进行评估.实验共使用了48000个包含漏洞描述的漏洞情报文本,分别用TextCNN,TextRNN,TextRNN_Att,fastText和所提模型进行分类.实验结果表明,所提模型在测试集上的分类评价指标得分均为最高,能够有效应用于网络安全漏洞分类任务,降低人工工作量.

关键词: 自然语言处理系统, 网络安全, 特征抽取, 分类器, 深度学习