信息安全研究 ›› 2024, Vol. 10 ›› Issue (3): 233-.

• 学术论文 • 上一篇    下一篇

一种基于内容和ERNIE3.0-CapsNet的中文垃圾邮件识别方法

单晨棱1张新有1,2邢焕来1,2冯力2


  

  1. 1(西南交通大学唐山研究院河北唐山063000)
    2(西南交通大学计算机与人工智能学院成都611756)

  • 出版日期:2024-03-23 发布日期:2024-03-08
  • 通讯作者: 张新有 博士,副教授.主要研究方向为分布式计算与应用、网络安全. xyzhang@swjtu.edu.cn
  • 作者简介:单晨棱 硕士研究生.主要研究方向为自然语言处理、网络安全. sclinghy@163.com 张新有 博士,副教授.主要研究方向为分布式计算与应用、网络安全. xyzhang@swjtu.edu.cn 邢焕来 博士,副教授.主要研究方向为人工智能、网络安全. hxx@home.swjtu.edu.cn 冯力 博士,教授.主要研究方向为人工智能、网络安全. fengli@swjtu.edu.cn

A Chinese Spam Detection Method Based on Content and ERNIE3.0-CapsNet

Shan Chenling1, Zhang Xinyou1,2, Xing Huanlai1,2, and Feng Li2#br#

#br#
  

  1. 1(Tangshan Graduate School, Southwest Jiaotong University, Tangshan, Hebei 063000)
    2(School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756)

  • Online:2024-03-23 Published:2024-03-08

摘要: 针对目前中文垃圾邮件识别方法中的深度学习检测方法词向量表示不足和特征提取丰富度欠缺的问题,提出融合ERNIE3.0预训练模型的胶囊神经网络改进识别模型——ERNIE3.0CapsNet.对于中文垃圾邮件内容文本,利用ERNIE3.0生成对于知识具备优异记忆和推理能力且语义丰富的词向量矩阵,再使用胶囊神经网络进行特征提取及分类,对于胶囊神经网络,改进了结构并使用GELU作为其动态路由的激活函数,设计了5组同类模型和4组激活函数的对比实验.在开源的TREC06C中文邮件数据集上,提出的ERNIE3.0CapsNet模型效果在总体上表现突出,其准确率达到99.45%.实验结果表明,ERNIE3.0CapsNet优于ERNIE3.0TextCNN,ERNIE3.0RNN等方法,证明了该模型在中文垃圾邮件识别效果的有效性和优异性.

关键词: 中文垃圾邮件, ERNIE3.0, 胶囊神经网络, 激活函数, 文本分类

Abstract: In order to solve the problems of inadequate word vector representation and limited feature extraction richness in the current Chinese spam recognition methods based on deep learning, this paper proposes an improved recognition model by integrating the ERNIE3.0 pretraining model with the capsule neural network, referred to as ERNIE3.0CapsNet. For the Chinese spam content text, we leverage ERNIE3.0 to generate a word vector matrix with outstanding memory and reasoning capabilities, along with rich semantics. Subsequently, we employ the capsule neural network for feature extraction and classification. For the capsule neural network, we enhance its structure, adopting GELU as the activation function of its dynamic routing, and conduct a comparative experiment between five groups of similar models and four groups of activation functions. On the open source TREC06C Chinese email dataset, the proposed ERNIE3.0CapsNet model exhibits remarkable overall performance, achieving an accuracy rate of 99.45%. The experimental results demonstrate the superiority of ERNIE3.0CapsNet over methods such as ERNIE3.0TextCNN, ERNIE3.0RNN confirming the model’s effectiveness and superiority in Chinese spam recognition.

Key words: Chinese spam, ERNIE3.0, capsule neural network, activation function, text classification

中图分类号: