Research on Source Code Vulnerability Detection Based on Abstract Syntax Tree Compression Coding

Abstract

Abstract: In source code vulnerability detection method based on abstract syntax tree, it is difficult to fully extract the syntax and structure features from the large-scale syntax tree, which lead to the problem of insufficient capability of vulnerability characterization and low detection accuracy. Aiming at above problem, an abstract syntax tree compression coding (abstract syntax tree compressed coding, ASTCC) based method for source code vulnerability detection is proposed. Firstly, the abstract syntax tree is divided into a group of subtrees by code statements, and then the subtrees are encoded by recursive neural network to extract the syntax information of code statements. Then, the subtree of the original syntax tree is replaced with its encoding node to reduce the depth and the number of leaf nodes of the abstract syntax tree while retaining the structural features. Finally, the tree based convolutional neural network with attention mechanism is used to detect source code vulnerabilities. Experimental results on NVD and SARD open datasets show that the proposed method reduced the size of the abstract syntax tree through compression coding, enhanced the model's ability to represent source code vulnerabilities, and effectively improved the accuracy of vulnerability detection.

Key words: vulnerability detection, abstract syntax tree, tree based convolutional neural network, attentional mechanism

摘要： 针对基于抽象语法树的源代码漏洞检测方法难以从大规模语法树中充分提取语法和结构特征，导致漏洞表征能力不足、检测准确率低的问题，提出了一种基于抽象语法树压缩编码（abstract syntax tree compressed coding，ASTCC）的源代码漏洞检测方法．该方法首先将程序抽象语法树以代码语句为单元分割成一组子树，然后通过递归神经网络对子树进行编码以提取代码语句内语法信息；再将原始语法树中的子树替换为其编码节点，从而在保留结构特征的同时减小原始语法树的深度和叶子节点数量；最后，通过带注意力机制的树卷积神经网络实现源代码漏洞检测．在NVD和SARD公开数据集上的实验结果表明，ASTCC方法能够降低抽象语法树的规模，增强模型对源代码漏洞的表征能力，有效提升漏洞检测准确率．

关键词: 漏洞检测, 抽象语法树, 树卷积神经网络, 注意力机制

陈传涛潘丽敏罗森林 . 基于抽象语法树压缩编码的漏洞检测方法[J]. 信息安全研究, 2022, 8(1): 35-.

References

Lenarduzzi V, Taibi D, Tosi D, et al. Open source software evaluation, selection, and adoption: A systematic literature review [C] //Proc of the 46th Euromicro Conf on Software Engineering and Advanced Applications (SEAA). Piscataway, NJ: IEEE, 2020: 437-444

[2] Li Zhen, Zou Deqing, Xu Shouhuai, et al. VulDeePecker: A deep learning-based System for vulnerability detection [C] //Proc of the 2018 Network and Distributed System Security Symp. San Diego, CA: ISOC, 2018

[3] Dam H K, Tran T, Pham T T M, et al. Automatic feature learning for predicting vulnerable software components [J]. IEEE Trans on Software Engineering, 2021, 47(1): 67-85

[4] Duan Xu, Wu Jingzheng, Ji Shouling, et al. VulSniper: Focus your attention to shoot fine-grained vulnerabilities [C] //Proc of the 28th Int Joint Conf on Artificial Intelligence (IJCAI-19). Palo Alto, CA: AAAI Press, 2019: 4665-4671

[5] Li Jian, He Pinjia, Zhu Jieming, et al. Software defect prediction via convolutional neural network [C] /Proc of the 2017 IEEE Int Conf on Software Quality, Reliability and Security (QRS). Piscataway, NJ: IEEE, 2017: 318-328

[6] Liang Hongliang, Sun Lu, Wang Meilin, et al. Deep learning with customized abstract syntax tree for bug localization [J]. IEEE Access, 2019, 7(1): 116309-116320

[7] Lin Guanjun, Zhang Jun, Luo Wei, et al. Poster: Vulnerability discovery with function representation learning from unlabeled projects [C] //Proc of the 2017 ACM SIGSAC Conf on Computer and Communications. New York: ACM, 2017: 2539-2541

[8] Wang Song, Liu Taiyue, Tan Lin. Automatically learning semantic features for defect prediction [C] //Proc of the 38th IEEE Int Conf on Software Engineering. New York: ACM, 2016: 297-308

[9] 孙伟, 陈林. 基于抽象语法树的C#源代码SQL注入漏洞检测算法 [J]. 信息安全研究, 2015, 1(2): 112-125

[10] Mou Lili, Li Ge, Zhang Lu, et al. Convolutional neural networks over tree structures for programming language processing [C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2016: 1287-1293

[11] Zhang Jian, Wang Xu, Zhang Hongyu, et al. A novel neural source code representation based on abstract syntax tree [C] //Proc of the 41st Int Conf on Software Engineering (ICSE). Piscataway, NJ: IEEE, 2019: 783-794

[12] Bendersky E. Pycparser: Complete C99 parser in pure Python [EB/OL]. [2020-03-26]. https://github.com/eliben/pycparser

[13] Socher R, Pennington J, Huang E H, et al. Semi-supervised recursive autoencoders for predicting sentiment distributions [C] //Proc of the

Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2011: 151-161

[14] NVD: National vulnerability database [EB/OL]. [2021-05-14]. https://nvd.nist.gov/

[15] Software assurance reference dataset [EB/OL]. [2021-05-14]. https://samate.nist.gov/SRD/index.php

[16] 2019 CWE top 25 most dangerous software errors [EB/OL]. [2021-01-21]. https://cwe.mitre.org/archive/2019_cwe_top25.html

[1]	（国网湖南省电力有限公司信息通信分公司长沙）. Practical Research of IAST Technology under DevOps Development Model [J]. Journal of Information Security Reserach, 2021, 7(12): 1198-.
[2]	. Method on the Detection of Second-Order Vulnerability for PHP Applications [J]. Journal of Information Security Research, 2018, 4(4): 380-386.
[3]	. A High Code Coverage Static and Dyamic Combined Fuzzing Method [J]. Journal of Information Security Research, 2016, 2(8): 699-705.
[4]	Sun Wei. A Review on Cross-Site Scripting [J]. Journal of Information Security Research, 2016, 2(12): 1068-1080.
[5]	Sun Wei Chen Lin. A C# Source Code SQL Injection Attack Detection Algorithm Based on Abstract Syntax Tree [J]. Journal of Information Security Research, 2015, 1(2): 112-125.