数据驱动的网络安全风险事件预测技术研究

摘要/Abstract

摘要： 大规模网络安全风险事件的频繁发生给网络安全研究者们敲响了警钟，工业界及学术界理解和防御网络威胁的方式也随着风险事件种类和数量的增多而不断发生改变，现已逐渐从反应型检测转变到主动型预测上来.其中基于历史网络安全风险事件数据特征来预测网络中潜在安全风险的主动型预测手段，被认为在改善网络弹性的方面具有很大的潜力.近年来，研究机构已经开始提出了数据驱动的网络安全风险事件预测方法与技术，用以挖掘网络安全事件与多维度网络特征间的关联性，并利用机器学习、深度学习等算法预测潜在的网络安全风险事件.重点介绍了网络安全风险事件预测的背景、定义及其关键技术.此外，数据的不平衡性是数据驱动的网络安全事件预测的重要壁垒，探讨了解决该问题的相关方法.

关键词: 网络安全, 风险事件预测, 特征工程, 模型训练, 模型评估, 不平衡

Abstract: The frequent occurrence of large-scale cybersecurity risk incidents alarms the current researchers, both industry and academia have witnessed a shift in understanding and defending against the evolving cyber threats, from primarily reactive detection towards proactive prediction. Undoubtedly, the proactive prediction method based on the historical datafeature is deemed to have excellent potential for improving cyber resilience. The research institute have begun proposing cybersecurity incident prediction schemes for mining the correlation between cybersecurity incidents and multi-dimensional network features, and have predicted the potential cybersecurity risk incidents by using the machine leaning algorithms, deep learning algorithms, and so on. This paper introduces the background, definition and key technology of the cybersecurity risk incident prediction. In addition, the problems of imbalance datasets is considered a barrier for predicting the cybersecurity risk incidents by datadriven, and the methods of solving the problems has been discussed.

Key words: cybersecurity, risk incident prediction, feature engineering, model training, model evaluation, dataset imbalance

孔斌吕遒健吴峥嵘. 数据驱动的网络安全风险事件预测技术研究[J]. 信息安全研究, 2019, 5(6): 477-487.

参考文献

[1] Sun N , Zhang J , Rimba P , et al. Data-driven cybersecurity incident prediction: A survey[J]. IEEE Communications Surveys & Tutorials, 2018, PP(99):1-1. [2] 吴涛, 马军. 网络安全风险评估方法的研究[J]. 东北师大学报: 自然科学版, 2010, 42(1): 53-58. [3] Liu Y, Sarabi A, Zhang J, et al. Cloudy with a chance of breach: Forecasting cyber security incidents[C]//24th USENIX Security Symposium (USENIX Security 15). Washington, D.C., USA, 2015: 1009-1024. [4] Borgolte K, Kruegel C, Vigna G. Delta: automatic identification of unknown web-based infection campaigns[C]//Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, Berlin, Germany, 2013: 109-120. [5] Soska K, Christin N. Automatically detecting vulnerable websites before they turn malicious[C]//23rd {USENIX} Security Symposium ({USENIX} Security 14). San Diego, CA, USA, 2014: 625-640. [6] Liu Y, Dong M, Ota K, et al. ActiveTrust: Secure and trustable routing in wireless sensor networks[J]. IEEE Transactions on Information Forensics and Security, 2016, 11(9): 2013-2027... [7] Han Y F, Shen Y. Accurate spear phishing campaign attribution and early detection[C]//Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM, Pisa, Italy, 2016: 2079-2086 [8] Bilge L, Han Y, Dell'Amico M. Riskteller: Predicting the risk of cyber incidents[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, Dallas, USA, 2017: 1299-1311. [9] Canali D, Bilge L, Balzarotti D. On the effectiveness of risk prediction based on users browsing behavior[C]//Proceedings of the 9th ACM symposium on Information, computer and communications security. ACM, Kyoto, Japan, 2014: 171-182. [10] Rajab A, Huang C T, Al-Shargabi M. Decision tree rule learning approach to counter burst header packet flooding attack in Optical Burst Switching network[J]. Optical Switching and Networking, 2018, 29: 15-26 [11] Rhode M, Burnap P, Jones K. Early-stage malware prediction using recurrent neural networks[J]. computers & security, 2018, 77: 578-594. [12] Sharif M, Urakawa J, Christin N, et al. Predicting impending exposure to malicious content from user behavior[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, Toronto, ON, Canada, 2018: 1487-1501. [13] Banescu S, Collberg C, Pretschner A. Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning[C]//26th USENIX Security Symposium (USENIX Security 17). Vancouver, BC, Canada, 2017: 661-678. [14] Najafabadi M M, Villanustre F, Khoshgoftaar T M, et al. Deep learning applications and challenges in big data analytics[J]. Journal of Big Data, 2015, 2(1): 1-21. [15] Li H, Ota K, Dong M. Learning IoT in edge: deep learning for the internet of things with edge computing[J]. IEEE Network, 2018, 32(1): 96-101. [16] Rossi R A, Gallagher B, Neville J, et al. Modeling dynamic behavior in large evolving graphs[C]//Proceedings of the sixth ACM international conference on Web search and data mining. ACM, Rome, Italy ,2013: 667-676. [17] Shen Y, Mariconti E, Vervier P A, et al. Tiresias: Predicting Security Events Through Deep Learning[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, , Toronto, ON, Canada, 2018: 592-605. [18] Okutan A, Werner G, Yang S J, et al. Forecasting cyberattacks with incomplete, imbalanced, and insignificant data[J]. Cybersecurity, 2018, 1(1): 15-31.. [19] Sarabi A, Naghizadeh P, Liu Y, et al. Prioritizing Security Spending: A Quantitative Analysis of Risk Distributions for Different Business Profiles[C]//Workshop on the Economics of Information Security (WEIS) . Delft, the Netherlands, 2015. [20] Khandpur R P, Ji T, Jan S, et al. Crowdsourcing cybersecurity: Cyber attack detection using social media[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, Singapore, Singapore, 2017: 1049-1057. [21] Yang H, Ma X, Du K, et al. How to learn klingon without a dictionary: Detection and measurement of black keywords used by the underground economy[C]//2017 IEEE Symposium on Security and Privacy (SP). IEEE, San Jose, CA, USA, 2017: 751-769. [22] 周志华. 机器学习[M]. Beijing, China: Tsinghua University Press, 2016. [23] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python[J]. Journal of machine learning research, 2011, 12(Oct): 2825-2830. [24] Lin G, Zhang J, Luo W, et al. Cross-project transfer representation learning for vulnerable function discovery[J]. IEEE Transactions on Industrial Informatics, 2018, 14(7): 3289-3297. [25] Sarabi A, Liu M. Characterizing the Internet Host Population Using Deep Learning: A Universal and Lightweight Numerical Embedding[C]//Proceedings of the Internet Measurement Conference 2018. ACM, Boston, MA, USA, 2018: 133-146. [26] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelligence research, 2001, 16(1): 321-357. [27] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China, 2008: 1322-1328. [28] Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539-550. [29] Provost F. Machine learning from imbalanced data sets 101[C]//Proceedings of the AAAI’2000 workshop on imbalanced data sets. AAAI Press, Austin, TX, 2000, 68: 1-3. [30] Degang Sun, Zheng Wu, Yan Wang, Qiujian Lv, Bo Hu. Risk Prediction for Imbalanced Data in Cyber Security : A Siamese Network-based Deep Learning Classification Framework [C]. //The 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019.