基于半监督学习的安卓恶意软件检测及其恶意行为分析

信息安全研究 ›› 2018, Vol. 4 ›› Issue (3): 242-250.

基于半监督学习的安卓恶意软件检测及其恶意行为分析

杜炜,李剑

北京邮电大学计算机学院

收稿日期:2018-03-21 出版日期:2018-03-15 发布日期:2018-03-21
通讯作者: 杜炜
作者简介:杜炜 1992年出生，硕士研究生，主要研究领域为信息安全、机器学习。李剑 1976年出生，博士，教授，博士生导师，主要研究领域为智能网络安全、量子密码学。

Android Malware Detection and Analysisof Malware Behavior Base on Semi-supervised Learning

Received:2018-03-21 Online:2018-03-15 Published:2018-03-21

摘要/Abstract

摘要： 为了更好的检测安卓恶意软件以及分析其恶意行为，本文提出一种基于半监督学习的安卓恶意软件检测及其恶意行为分析的研究方案。首先收集了16179个安卓良性软件以及31964个安卓恶意软件，随后反编译安卓软件，提取了权限、服务和敏感API作为静态特征，然后使用了DroidBox动态分析工具提取了7种动态特征。虽然安卓恶意软件家族体现了安卓恶意软件的恶意行为，但不同的恶意软件家族可能具有相同的恶意行为，因此，本文对样本数据中最主要的20种恶意软件家族进行人工分析和聚类分析，确定了五种恶意行为的类别。由于本文只标注了20种恶意软件家族的恶意行为，其他恶意家族的恶意行为并未标注。为了充分利用数据，本论文提出一种名为Co-RFGBDT的半监督学习算法，其结合了随机森林和GBDT的优点。使用半监督学习Co-RFGBDT算法结合未标注的样本重新训练，整体准确率达到91.56%，但恶意行为层出不穷，因此本文通过设置置信度阈值的方式识别未知的恶意行为。最终，与基准实验相比，整体准确率提升了2%，证明了本文提出的Co-RFGBDT半监督学习算法在该场景下具有更好的性能。

关键词: 安卓, 恶意软件检测, 随机森林, 半监督学习, GBDT

Abstract: In order to better detect Android malware and analyze its malicious behavior,a research approach of Android malware detection and its malicious behavior analysis based on semi-supervisedlearning is presented in this paper.First 16179 Android benign software and 31964 Android malware are collected, and thendecompiling the APK for Android, extracting permissions, services and sensitive APIs as static features, and then using the dynamic analysis tool called DroidBox to extract seven kinds of dynamic features. Android malware family reflects the malicious behavior of Android malware, but different malware family might have the same malicious behavior. Therefore, the main twenty kinds of malicious families in the sample data are divided into five kinds of malicious behaviors by manual analysis and clustering analysis in this paper. Only the malicious behavior of twenty kinds of malicious families are labelled and malicious behavior of other families are not labelled. In order to make full use of the data, the Co-RFGBDT algorithm which combines the advantages of Random Forest and GBDT in semi-supervised learning is presented. Re-train the data sets using Co-RFGBDT algorithm in semi-supervised learning in combination with unlabeled samples with the overall accuracy of 91.5%. However, malicious behavior is endless, so the unknown malicious behavior is identified by setting the threshold of confidence in the paper.Finally, compared with the benchmark experiment, the overall accuracy is improved by 2%, which proves that the proposed Co-RFGBDT semi-supervised learning algorithm has better performance in this scenario.

Key words: Android, Malware detection, Random Forest, Semi-supervised learning, GBDT

杜炜李剑. 基于半监督学习的安卓恶意软件检测及其恶意行为分析[J]. 信息安全研究, 2018, 4(3): 242-250.

参考文献

[1] Gartner Says Worldwide Sales of Smartphones Grew 9 Percent in First Quarter of 2017. https://www.gartner.com/newsroom/id/3725117. [2]Number of available applications in the Google Play Store from December 2009 to September 2017.https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/. [3]Liang S, Du X. Permission-combination-based scheme for android mobile malware detection[C]//Communications (ICC), 2014 IEEE International Conference on. IEEE, 2014: 2301-2306. [4] Idrees F, Rajarajan M. Investigating the android intents and permissions for malware detection[C]//Wireless and Mobile Computing, Networking and Communications (WiMob), 2014 IEEE 10th International Conference on. IEEE, 2014: 354-358. [5] Wu D J, Mao C H, Wei T E, et al. Droidmat: Android malware detection through manifest and api calls tracing[C]//Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on. IEEE, 2012: 62-69. [6] Sanz B, Santos I, Laorden C, et al. Puma: Permission usage to detect malware in android[C]//International Joint Conference CISIS’12-ICEUTE´ 12-SOCO´ 12 Special Sessions. Springer, Berlin, Heidelberg, 2013: 289-298. [7] BläSing T, Batyuk L, Schmidt A D, et al. An Android Application Sandbox system for suspicious software detection[C]// International Conference on Malicious and Unwanted Software. IEEE, 2010:55-62. [8] Narudin F A, Feizollah A, Anuar N B, et al. Evaluation of machine learning classifiers for mobile malware detection[J]. Soft Computing, 2014, 20(1):1-15. [9] Shabtai A, Kanonov U, Elovici Y, et al. “Andromaly”: a behavioral malware detection framework for android devices[J]. Journal of Intelligent Information Systems, 2012, 38(1): 161-190. [10] Total V. VirusTotal-Free online virus, malware and URL scanner[J]. Online: https://www. virustotal. com/en, 2012. [11]Foremost J. Apktool set-up for Android lab[J]. 2013. [12] Chaurasia P. Dynamic analysis of Android malware using DroidBox[J]. Dissertations & Theses - Gradworks, 2015. [13] Amorim, R.C.; Mirkin, B. (2012). "Minkowski Metric, Feature Weighting and Anomalous Cluster Initialisation in K-Means Clustering". Pattern Recognition. 45 (3): 1061–1075. doi:10.1016/j.patcog.2011.08.012.

[1]	钟越付迪阳. Android应用程序隐私权限安全研究[J]. 信息安全研究, 2021, 7(3): 287-292.
[2]	王柯林杨珂赵瑞哲辛丽玲汪秋云. 基于随机森林的抗混淆Android恶意应用检测[J]. 信息安全研究, 2021, 7(2): 126-135.
[3]	刘林刘亮张磊吴润浦. Android应用威胁等级评估技术的设计与实现 [J]. 信息安全研究, 2021, 7(1): 27-36.
[4]	杨频潘岳镭贾鹏刘亮. 基于汇编指令词向量特征的恶意软件检测研究[J]. 信息安全研究, 2020, 6(2): 113-121.
[5]	李创丰李云龙孙伟. 基于CNN和朴素贝叶斯方法的安卓恶意应用检测算法[J]. 信息安全研究, 2019, 5(6): 470-476.
[6]	朱雪冰周安民左政. 基于家族行为频繁子图挖掘的恶意代码检测[J]. 信息安全研究, 2019, 5(2): 105-113.
[7]	何平胡勇. 一种基于本地代码特征的Android恶意代码检测方法[J]. 信息安全研究, 2018, 4(6): 511-517.
[8]	庞鹏飞韩文聪薛源于长富谢长达. 基于Android内核的图像视频数据保护技术[J]. 信息安全研究, 2018, 4(4): 342-351.
[9]	陈泽峰方勇刘亮左政李抒霞. 基于多维特征的Android恶意应用检测系统[J]. 信息安全研究, 2018, 4(2): 133-139.
[10]	王涛李剑. 基于深度学习的Android恶意软件检测的设计和实现 [J]. 信息安全研究, 2018, 4(2): 140-144.
[11]	祝鹏程陈洁黄诚刘强. 基于TF-IDF和随机森林算法的Web攻击流量检测方法研究[J]. 信息安全研究, 2018, 4(11): 1040-1045.
[12]	李剑. 基于权限的安卓恶意软件检测方法[J]. 信息安全研究, 2017, 3(9): 817-822.
[13]	雷磊. Android应用权限检测技术研究[J]. 信息安全研究, 2017, 3(2): 139-144.
[14]	朱月俊文爽李剑. 改进随机森林在安卓恶意检测中的应用[J]. 信息安全研究, 2017, 3(11): 1020-1027.
[15]	王凯. 基于eMMC芯片安卓智能手机数据直读技术研究[J]. 信息安全研究, 2016, 2(4): 317-323.