基于中文预训练的安全事件实体识别研究

信息安全研究 ›› 2021, Vol. 7 ›› Issue (7): 652-660.

基于中文预训练的安全事件实体识别研究

朱磊1,2 董林靖1 黑新宏1,2 王一川1,2 彭伟1 刘雁孝1 盘隆3

1 (西安理工大学西安 710048)
2 (西安理工大学陕西省网络计算与安全技术重点实验西安 710048)
3 (深圳市腾讯计算机系统有限公司广东深圳 518054)

出版日期:2021-07-09 发布日期:2021-07-08
通讯作者: 朱磊
作者简介:朱磊博士,讲师. 主要研究方向为数据处理董林靖硕士研究生.主要研究方向为数据处理与数据安全. 2191221084@stu.xaut.edu.cn 黑新宏教授,博士生导师.主要研究方向为轨道交通信息化与数据安全. heixinhong@xaut.edu.cn 刘雁孝博士,副教授.主要研究方向为图像秘密共享和信息隐藏. liuyanxiao@xaut.edu.cn 王一川博士,副教授.主要研究方向为区块链与网络安全. chuan@xaut.edu.cn 彭伟硕士研究生.主要研究方向为数据处理与数据安全. 154396224@qq.com 盘隆工程师.主要研究方向为系统安全. hydrapn@tencent.com

Research on Chinese Named Entity Recognition Method Based on Pre-training Model for Public Safety Events

Online:2021-07-09 Published:2021-07-08

摘要/Abstract

摘要： 为提高公共安全事件中中文命名实体识别的效率,本文对《中文突发事件语料库》进行研究,通过对预训练任务的优化和训练集的迁移学习,提出基于领域预训练的公共安全事件实体识别方法.首先,对预训练模型RoBERTa进行优化,更新安全领域词典实现数据增强,并将中文单字符的掩码机制替换为全词掩码机制,获取公共安全事件中领域实体特征和语义信息.接着,使用10万条在线新闻语料进行领域预训练,生成了公共安全领域预训练模型RoBERTa+,增强下游任务命名实体识别的能力.最后,采用双向长短时记忆网络BiLSTM获取语料文本的上下文信息特征,经过条件随机场CRF进行序列解码标注,完成公共安全领域的中文命名实体识别任务.实验结果表明,改进的模型在中文突发事件语料库中准确率平均可达到87%以上,召回率和F1值都达到了80%以上,从而证明了领域预训练可以有效提升公共安全事件中实体信息的识别能力.

关键词: 公共安全事件, 中文实体识别, 领域预训练, 双向长短时记忆网络, 条件随机场, RoBERTa预训练语言模型

Abstract: To improve the efficiency of chinese named entity recognition in public safety events, we study the "Chinese Emergency Corpus", and propose a novel domain adaptive pre-training based named entity recognition model by optimizing the pre-training subtasks and transfer learning of domain datasets. First, the dictionary of pre-training model RoBERTa is updated by adding the terms of public safety events, and the mask subtask of pre-training with single character in chinese RoBERTa model is replaced with the chinese whole word masking, which can learn the more grammatical and semantic information of public safety events. And then, 100k online news unlabeled corpus is pre-trained to enhance the ability of identify downstream named entities, and the chinese pre-training model RoBERTa+ is generated for public security. The bidirectional long short-term memory network BiLSTM is employed to acquire the contextual abstraction feature, and the entities is final recognized by the sequence decoding with the conditional random field. Experimental result shows that the proposed model reaches an accuracy rate of 87%, the recall rate and F1-value of 81%, which indicates that the domain adaptive pre-training has considerable potential for natural language processing tasks.

Key words: public security, chinese named entity recognition, domain pre-training, BiLSTM, CRF, pre-trained language model RoBERTa

朱磊董林靖黑新宏王一川彭伟刘雁孝盘隆. 基于中文预训练的安全事件实体识别研究[J]. 信息安全研究, 2021, 7(7): 652-660.

参考文献

[1] 张磊.特定领域的命名实体识别方法的研究[J].计算机与现代化,2018(3):60-64
[2] Li Jing , Sun Aixin , Han Jianglei , et al. A survey on deep Learning for named entity recognition[J] .IEEE Trans on Knowledge and Data Engineering, 2020,(99):1-1
[3] Kim Ji-Hwan, Woodland P C. A rule-based named entity recognition system for speech input[C] //Proc of the 6th Int Conf on Spoken Language Processing (ICSLP). 2000: 528-531
[4] Schuster Benjamin, Bateman Alex. An introduction to hidden Markov models [J]. Curr Protoc Bioinformatics, 2007, 123(6): 87-123
[5] Borthwick Andrew, Sterling John, Agichtein Eugene, et al, NYU: Description of the MENE named entity system as used in MUC7[C] //Proc of the 7th Message Understanding Conf.1998
[6] Lecun Yann, Bengio Yoshua, Hinton Geoffrey. Deep learning[J]. Nature, 2015, 521(7553):436
[7] Yao Lin, Liu Hong, Liu Yi, et al. Biomedical named entity recognition based on deep neutral network[J]. Int Journal of Hybrid Information Technology, 2015, 8(8): 279-288
[8] 窦宇宸,胡勇.基于BERT的安全事件命名实体识别研究[J].信息安全研究,2021,7(3):242-249
[9] Devlin Jacob, Chang Ming-Wei, Lee Kenton, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proc of Conf on the North American Chapter of the ACL: Human Language Technologies. Stroudsburg, PA:ACL, 2018: 4171-4186
[10] Hochreiter Sepp , Schmidhuber Jürgen. Long Short-Term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[11] Graves Alex, Schmidhuber Jürgen. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6):602-610
[12] Lafferty John, Mccallum Andrew, Pereira Fernando. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C] //Proc of the 18th Int Conf on Machine Learning. New York: ACM, 2001
[13] Google. RoBERTa中文预训练模型:RoBERTa for Chinese [DB/OL][2020-12-08].https://drive.google.com/open?id=1ykENKV7dIFAqRRQbZIh0mSb7Vjc2MeFA
[14]刘宗田.中文突发事件语料库[DB/OL][2021-01-08]. https://github.com/shijiebei2009/CEC-Corpus
[15] Google. Chinese BERT model [DB/OL][2020-12-15]. https://storage.googleapis.com/bert_models/2018_1l_03/chinese_L-12_H-768_A-12.zip
[16] 范晓霞, 周安民，郑荣锋,等.基于深度学习的暗网市场命名实体识别研究[J].信息安全研究,2021,7(1):37-43