Research on Source Code Vulnerability Detection Based on BERT Model

Journal of Information Security Reserach ›› 2024, Vol. 10 ›› Issue (4): 294-.

Previous Articles Next Articles

Research on Source Code Vulnerability Detection Based on BERT Model

Luo Leqi, Zhang Yanshuo, Wang Zhiqiang, Wen Jin, and Xue Peiyang#br#

(Beijing Electronic Science and Technology Institute, Beijing 100070)

Online:2024-04-20 Published:2024-04-21

基于BERT模型的源代码漏洞检测技术研究

罗乐琦张艳硕王志强文津薛培阳

(北京电子科技学院北京100070)

通讯作者: 王志强博士，副教授，硕士生导师.主要研究方向为人工智能安全、漏洞发现、恶意软件检测. wangzq@besti.edu.cn
作者简介:罗乐琦硕士研究生.主要研究方向为漏洞挖掘、网络攻防. 20211909@mail.besti.edu.cn 张艳硕博士，副教授，硕士生导师，CCF高级会员.主要研究方向为密码理论及其应用. zhang_yanshuo@163.com 王志强博士，副教授，硕士生导师.主要研究方向为人工智能安全、漏洞发现、恶意软件检测. wangzq@besti.edu.cn 文津硕士研究生.主要研究方向为人工智能安全、动作识别. 1065253065@qq.com 薛培阳硕士研究生.主要研究方向为网络空间安全和信息安全. 20212905@mail.besti.edu.cn

Abstract

Abstract: Techniques such as code metrics, machine learning, and deep learning are commonly employed in source code vulnerability detection. However, these techniques have problems, such as their inability to retain the syntactic and semantic information of the source code and the requirement of extensive expert knowledge to define vulnerability features. To cope with the problems of existing techniques, this paper proposed a source code vulnerability detection model based on BERT(bidirectional encoder representations from transformers) model. The model splits the source code to be detected into multiple small samples, converted each small sample into the form of approximate natural language, realized the automatic extraction of vulnerability features in the source code through the BERT model, and then trained a vulnerability classifier with good performance to realize the detection of multiple types of vulnerabilities in Python language. The model achieved an average detection accuracy of 99.2%, precision of 97.2%, recall of 96.2%, and an F1 score of 96.7% across various vulnerability types. This represents a performance improvement of 2% to 14% over existing vulnerability detection methods. The experimental results showed that the model was a general, lightweight and scalable vulnerability detection method.

Key words: vulnerability detection, deep learning, Python language, BERT model, natural language processing

摘要： 源代码漏洞检测常使用代码指标、机器学习和深度学习等技术.但是这些技术存在无法保留源代码中的句法和语义信息、需要大量专家知识对漏洞特征进行定义等问题.为应对现有技术存在的问题，提出基于BERT(bidirectional encoder representations from transformers)模型的源代码漏洞检测模型.该模型将需要检测的源代码分割为多个小样本，将每个小样本转换成近似自然语言的形式，通过BERT模型实现源代码中漏洞特征的自动提取，然后训练具有良好性能的漏洞分类器，实现Python语言多种类型漏洞的检测.该模型在不同类型的漏洞中实现了平均99.2%的准确率、97.2%的精确率、96.2%的召回率和96.7%的F1分数的检测水平，对比现有的漏洞检测方法有2%~14%的性能提升.实验结果表明，该模型是一种通用的、轻量级的、可扩展的漏洞检测方法.

关键词: 漏洞检测, 深度学习, Python语言, BERT模型, 自然语言处理

CLC Number:

TP393.08

罗乐琦, 张艳硕, 王志强, 文津, 薛培阳, . 基于BERT模型的源代码漏洞检测技术研究[J]. 信息安全研究, 2024, 10(4): 294-.

References

［1］Morrison P, Herzig K, Murphy B, et al. Challenges with applying vulnerability prediction models［C］ Proc of the 2015 Symp and Bootcamp on the Science of Security. New York: ACM, 2015: 19［2］Shin Y, Williams L. An empirical model to predict security vulnerabilities using code complexity metrics［C］ Proc of the 2nd ACMIEEE Int Symp on Empirical Software Engineering and Measurement. New York: ACM, 2013: 315317［3］Chowdhury I, Zulkernine M. Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities［J］. Journal of Systems Architecture, 2011, 57(3): 294313［4］Zhou Yaqin, Sharma A. Automated identification of security issues from commit messages and bug reports［C］ Proc of the 11th Joint Meeting on Foundations of Software Engineering. New York: ACM, 2017: 914919［5］Pang Yulei, Xue Xiaozhen, Namin A S. Predicting vulnerable software components through NGram analysis and statistical feature selection［C］ Proc of the 14th Int Conf on Machine Learning and Applications (ICMLA). Piscataway, NJ: IEEE, 2015: 543548［6］Hovsepyan A, Scandariato R, Joosen W, et al. Software vulnerability prediction using text analysis techniques［C］ Proc of the Int Workshop on Security Measurements and Metrics. Piscataway, NJ: IEEE, 2012: 710［7］Nguyen V A, Nguyen D Q, Nguyen V, et al. ReGVD: Rrevisiting graph neural networks for vulnerability detection［C］ Proc of the 44th ACMIEEE Int Conf on Software Engineering: Companion Proceedings. New York: ACM, 2022: 178182［8］陈传涛, 潘丽敏, 龚俊, 等. 基于抽象语法树压缩编码的漏洞检测方法［J］.信息安全研究, 2022, 8(1): 3542［9］Wartschinski L, Noller Y, Vogel T, et al. Vudenc: Vulnerability detection with deeplearning on a natural codebase for python［J］. Information and Software Technology, 2022, 144: 106809［10］Vaswani A, Shazeer N, Parmar N, et al. Attentionis all you need［J］. arXiv preprint, arXiv:1706.03762, 2017［11］Taylor W L. “Cloze Procedure”: A new tool for measuring readability［J］. The Journalism Quarterly, 1953, 30(4): 415433［12］Zeiler M D. Adadelta: An adaptive learning rate method［J］. arXiv preprint, arXiv:1212.5701, 2012［13］Ruder S. An overview of gradient descent optimization algorithms［J］. arXiv preprint, arXiv:1609.04747, 2016［14］Kingma D P, BA J. Adam: A method for stochastic optimization［J］. arXiv preprint, arXiv:1412.6980, 2014［15］Loshchilov I, Hutter F. Fixing weight decay regularization in Adam［J］. arXiv preprint, arXiv:1711.05101, 2017［16］Robbins H, Monro S. A stochastic approximation method［J］. Annals of Mathematical Statistics, 1951, 22(3): 400407

[1]	. An Automatic Vulnerability Classification Framework Based on BiGRU TextCNN [J]. Journal of Information Security Reserach, 2024, 10(5): 446-.
[2]	. Adversarial Attack Algorithm Based on Multimodel Scheduling Optimization#br# #br# [J]. Journal of Information Security Reserach, 2024, 10(5): 403-.
[3]	. Research on Network Traffic Intrusion Detection Method Based on Denoising Diffusion Probability Model [J]. Journal of Information Security Reserach, 2024, 10(5): 421-.
[4]	. Source Code Vulnerability Detection Based on Fewshot Learning#br# #br# [J]. Journal of Information Security Reserach, 2024, 10(5): 440-.
[5]	. A Network Intrusion Detection Model Integrating CNN-BiGRU and Attention Mechanism [J]. Journal of Information Security Reserach, 2024, 10(3): 202-.
[6]	. Malicious TLS Traffic Detection Based on Graph Representation#br# #br# [J]. Journal of Information Security Reserach, 2024, 10(3): 209-.
[7]	. Malware Detection and Classification Based on GHM Visualization and Deep Learning [J]. Journal of Information Security Reserach, 2024, 10(3): 216-.
[8]	. Research on Location Attack Detection of VANET Based on Incremental Learning [J]. Journal of Information Security Reserach, 2024, 10(3): 277-.
[9]	. Private Information Extraction Algorithm Incorporating Prior Structural Knowledge [J]. Journal of Information Security Reserach, 2024, 10(2): 139-.
[10]	. Cyberbullying Detection Model Based on ELMoTextCNN [J]. Journal of Information Security Reserach, 2023, 9(9): 868-.
[11]	. Research on Blockchain Abnormal Transaction Detection Technology Based on LightGBM [J]. Journal of Information Security Reserach, 2023, 9(9): 877-.
[12]	. Encrypted Proxy Traffic Identification Method Based on Convolutional Neural Network#br# [J]. Journal of Information Security Reserach, 2023, 9(8): 722-.
[13]	. Comparison Research on Intrusion Detection Model Based on Machine Learning [J]. Journal of Information Security Reserach, 2023, 9(8): 739-.
[14]	. Research on Vulnerability Text Feature Classification Technology Based on BERT [J]. Journal of Information Security Reserach, 2023, 9(7): 687-.
[15]	. Research on the Integration of Full Lifecycle Data Security Management and Artificial Intelligence Technology#br# [J]. Journal of Information Security Reserach, 2023, 9(6): 543-.

Research on Source Code Vulnerability Detection Based on BERT Model

基于BERT模型的源代码漏洞检测技术研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics