Journal of Information Security Reserach ›› 2022, Vol. 8 ›› Issue (8): 777-.

Previous Articles     Next Articles

  


  • Online:2022-08-08 Published:2022-08-08

基于指令序列嵌入的安卓恶意应用检测框架

孙才俊1白冰1王伟忠2何能强3王之宇1孙天宁1张奕鹏1   

  1. 1(之江实验室智能网络研究院杭州311121)
    2(中国工业互联网研究院北京100102)
    3(国家互联网应急中心浙江分中心杭州310052)
  • 通讯作者: 王伟忠 博士,高级工程师.主要研究方向为工业互联网安全、车联网安全、密码相关技术. wangweizhong@china-aii.com
  • 作者简介:孙才俊 博士,助理研究员.主要研究方向为网络安全、移动安全、拟态防御. sun.cj@zhejianglab.com 白冰 博士,助理研究员.主要研究方向为网络空间安全、人工智能. baibing@zhejianglab.com 王伟忠 博士,高级工程师.主要研究方向为工业互联网安全、车联网安全、密码相关技术. wangweizhong@chinaaii.com 何能强 博士,高级工程师.主要研究方向为网络安全应急响应、数据安全评估. hnq@cert.org.cn 王之宇 博士,助理研究员.主要研究方向为人工智能、网络安全. wangzhy@zhejianglab.com 孙天宁 硕士,助理工程师.主要研究方向为网络安全. tiannings@zhejianglab.com 张奕鹏 硕士,工程师.主要研究方向为恶意代码检测、二进制安全. z1pwn@protonmail.com

Abstract: With the rapid growth of mobile applications and their users, the security of mobile applications has increasingly become the primary concern of the users. At present, there are more and more variants of malware based on the Android platform. There is an urgent need for efficient and effective malware detection methods to ensure the security and reliability of the Android app platform. To address these concerns, we present our lightweight solution ISEDroid which is based on the Instruction Sequence Embedding method to detect Android malware. ISEDroid extracts the instruction execution sequences from the Dalvik code fragments of Android apps, which are used to represent all executable and traceable paths of malware during runtime. Then, it transforms the instruction sequence into a low dimensional numerical vector through the embedding method in natural language processing, and then generates the semantic summary of the sample code behaviors using the average pooling algorithm. Finally, by evaluating different machine learning algorithms, adjusting the dimension of embedded vectors, and optimizing various hyperparameters, we ensure that the parameters of the model are all optimal, so as to achieve the best classification performance. A large number of experiments show that the method proposed in this paper can accurately identify Android malware, and achieved an F1 score of 0.952.


Key words: Android malware detection, NLP, word embedding, paragraph embedding, Doc2vec

摘要: 随着移动应用程序及其用户的增长,移动应用的安全性成为各利益相关者的首要关注点.目前,基于安卓平台的恶意软件变种日益增多,亟需高效且有效的恶意软件检测方法,用于保障移动应用的安全性与可靠性.为解决该问题,提出一种基于指令序列嵌入(instruction sequence embedding, ISE)的轻量级安卓恶意应用检测框架ISEDroid.ISEDroid从安卓应用的Dalvik代码片段中提取出指令执行序列,用于表示恶意软件在运行期间所有可执行、可跟踪的路径.然后,通过自然语言处理中的嵌入(embedding)方法将指令序列转化为低维度数值向量.接着,通过average pooling算法生成样本代码行为的语义摘要.最后,通过评估不同的机器学习算法、调整指令片段嵌入的维度以及优化各种机器学习超参数,保证模型的各项参数达到最优,从而实现最佳的分类性能.大量实验证明,提出的方法能够准确识别安卓恶意应用,并且取得了0.952的F1得分.

关键词: 安卓恶意应用检测, 自然语言处理, 词嵌入, 段落嵌入, Doc2vec