信息安全研究 ›› 2018, Vol. 4 ›› Issue (3): 242-250.

• 学术论文 • 上一篇    下一篇

基于半监督学习的安卓恶意软件检测及其恶意行为分析

杜炜,李剑   

  1. 北京邮电大学计算机学院
  • 收稿日期:2018-03-21 出版日期:2018-03-15 发布日期:2018-03-21
  • 通讯作者: 杜炜
  • 作者简介:杜炜 1992年出生,硕士研究生,主要研究领域为信息安全、机器学习。 李剑 1976年出生,博士,教授,博士生导师,主要研究领域为智能网络安全、量子密码学。

Android Malware Detection and Analysisof Malware Behavior Base on Semi-supervised Learning

  • Received:2018-03-21 Online:2018-03-15 Published:2018-03-21

摘要: 为了更好的检测安卓恶意软件以及分析其恶意行为,本文提出一种基于半监督学习的安卓恶意软件检测及其恶意行为分析的研究方案。首先收集了16179个安卓良性软件以及31964个安卓恶意软件,随后反编译安卓软件,提取了权限、服务和敏感API作为静态特征,然后使用了DroidBox动态分析工具提取了7种动态特征。虽然安卓恶意软件家族体现了安卓恶意软件的恶意行为,但不同的恶意软件家族可能具有相同的恶意行为,因此,本文对样本数据中最主要的20种恶意软件家族进行人工分析和聚类分析,确定了五种恶意行为的类别。由于本文只标注了20种恶意软件家族的恶意行为,其他恶意家族的恶意行为并未标注。为了充分利用数据,本论文提出一种名为Co-RFGBDT的半监督学习算法,其结合了随机森林和GBDT的优点。使用半监督学习Co-RFGBDT算法结合未标注的样本重新训练,整体准确率达到91.56%,但恶意行为层出不穷,因此本文通过设置置信度阈值的方式识别未知的恶意行为。最终,与基准实验相比,整体准确率提升了2%,证明了本文提出的Co-RFGBDT半监督学习算法在该场景下具有更好的性能。

关键词: 安卓, 恶意软件检测, 随机森林, 半监督学习, GBDT

Abstract: In order to better detect Android malware and analyze its malicious behavior,a research approach of Android malware detection and its malicious behavior analysis based on semi-supervisedlearning is presented in this paper.First 16179 Android benign software and 31964 Android malware are collected, and thendecompiling the APK for Android, extracting permissions, services and sensitive APIs as static features, and then using the dynamic analysis tool called DroidBox to extract seven kinds of dynamic features. Android malware family reflects the malicious behavior of Android malware, but different malware family might have the same malicious behavior. Therefore, the main twenty kinds of malicious families in the sample data are divided into five kinds of malicious behaviors by manual analysis and clustering analysis in this paper. Only the malicious behavior of twenty kinds of malicious families are labelled and malicious behavior of other families are not labelled. In order to make full use of the data, the Co-RFGBDT algorithm which combines the advantages of Random Forest and GBDT in semi-supervised learning is presented. Re-train the data sets using Co-RFGBDT algorithm in semi-supervised learning in combination with unlabeled samples with the overall accuracy of 91.5%. However, malicious behavior is endless, so the unknown malicious behavior is identified by setting the threshold of confidence in the paper.Finally, compared with the benchmark experiment, the overall accuracy is improved by 2%, which proves that the proposed Co-RFGBDT semi-supervised learning algorithm has better performance in this scenario.

Key words: Android, Malware detection, Random Forest, Semi-supervised learning, GBDT