信息安全研究 ›› 2026, Vol. 12 ›› Issue (2): 151-.

• 学术论文 • 上一篇    下一篇

基于大语言模型的钓鱼邮件检测技术研究

袁斌1,2杨克涵1邹德清1刘勇3,4张乾坤1   

  1. 1(华中科技大学网络空间安全学院武汉430074)
    2(嵩山实验室郑州452470)
    3(中关村实验室北京100190)
    4(奇安信集团股份有限公司北京100044)
  • 出版日期:2026-02-07 发布日期:2026-01-28
  • 通讯作者: 张乾坤 博士,副研究员.主要研究方向为人工智能安全、在线动态规划. qiankun@hust.edu.cn
  • 作者简介:袁斌 博士,副教授.主要研究方向为逻辑漏洞检测、物联网安全. yuanbin@hust.edu.cn 杨克涵 硕士.主要研究方向为协议安全、人工智能安全. 2577261040@qq.com 邹德清 博士,教授.主要研究方向为软件安全、云安全. deqingzou@hust.edu.cn 刘勇 博士,研究员.主要研究方向为云安全、网络安全、数据安全. liuyong@zgclab.edu.cn 张乾坤 博士,副研究员.主要研究方向为人工智能安全、在线动态规划. qiankun@hust.edu.cn
  • 基金资助:
    国家自然科学基金项目(62372191);湖北省自然科学基金项目(2023AFB258);嵩山实验室项目(241110210200)

Research on Phishing Email Detection Based on Large Language Model

Yuan Bin1,2, Yang Kehan1, Zou Deqing1, Liu Yong3,4, and Zhang Qiankun1   

  1. 1(School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074)
    2(Songshan Laboratory, Zhengzhou 452470)
    3(Zhongguancun Laboratory, Beijing 100190)
    4(Qi An Xin Technology Group Inc, Beijing 100044)
  • Online:2026-02-07 Published:2026-01-28

摘要: 随着钓鱼邮件数量的迅速增加以及对抗技术的不断演进,传统的钓鱼邮件检测方法在效率和准确性方面面临严峻挑战.为此,提出了一种基于大语言模型(large language model, LLM)的钓鱼邮件检测方法,以解决现有系统检测率低、漏报率高及人机交互性差等问题.通过全面分析钓鱼邮件的关键特征,包括邮件头部字段、正文内容、URL、二维码、附件及HTML页面,利用特征插入算法构建高质量的训练数据集.基于预训练语言模型LLaMA和低秩自适应微调技术(lowrank adaptation, LoRA),在仅更新0.72%模型参数(约50MB)条件下实现领域知识迁移,获得钓鱼邮件检测大模型.实验结果显示,与传统方法相比,基于大语言模型的检测方法显著提升了检测的准确性与鲁棒性,整体准确率达到94.5%,有效降低了误报率,增强了钓鱼邮件特征的分类与解释能力,提供了更具实用性和可靠性的钓鱼邮件检测方案.

关键词: 钓鱼邮件, 大语言模型, 预训练语言模型, 低秩自适应, 微调

Abstract: With the rapid increase in phishing email volumes and the continuous evolution of adversarial techniques, traditional phishing detection methods have encountered significant challenges regarding efficiency and accuracy. To address issues such as low detection rates, high falsenegative rates, and poor humancomputer interaction in existing systems, the authors proposed a phishing email detection system based on large language model. Through comprehensive analysis of key phishing email characteristics—including header fields, body content, URLs, QR codes, attachments, and HTML pages—they constructed a highquality training dataset using feature insertion algorithms. Building upon the pretrained LLaMA model, the researchers implemented LoRA finetuning technology, achieving domain knowledge transfer by updating only 0.72% of model parameters (approximately 50MB). Experimental results demonstrate that compared to traditional methods, the LLMbased detection approach achieves 94.5% overall accuracy with enhanced robustness, effectively reduces falsepositive rates, improves classification and interpretation capabilities for phishing email features, and provides a more practical and reliable solution for phishing detection.

Key words: phishing email, large language model, pretrained language model, lowrank adaptation, finetuning

中图分类号: