Private Information Extraction Algorithm Incorporating Prior  Structural Knowledge

Journal of Information Security Reserach ›› 2024, Vol. 10 ›› Issue (2): 139-.

Previous Articles Next Articles

Private Information Extraction Algorithm Incorporating Prior Structural Knowledge

Zhao Yuyuan1, Wang Bin2, Zhang Zedan2, Li Qingshan3, and Hu Jianbin4#br#

#br#

1(School of Software and Microelectronics, Peking University, Beijing 102627)
2(Chinese Medicine Data Center, China Academy of Chinese Medical Sciences, Beijing 100700)
3(Boya RegChain Beijing Inc., Beijing 100037)
4(School of Computer Science, Peking University, Beijing 100871)

Online:2024-02-21 Published:2024-02-26

融入结构先验知识的隐私信息抽取算法

赵玉媛1王斌2张泽丹2李青山3胡建斌4

1(北京大学软件与微电子学院北京102627)
2(中国中医科学院中医药数据中心北京100700)
3(博雅正链(北京)科技有限公司北京100037)
4(北京大学计算机学院北京100871)

作者简介:赵玉媛硕士.主要研究方向为自然语言处理、数据挖掘. zhaoyuyuan@pku.edu.cn 王斌研究员，硕士生导师，中国中医科学院中医药数据中心副主任.主要研究方向为中医药信息学，临床科研数据的采集、汇交、质量控制及共享利用方法. gam_wb@hotmail.com 张泽丹硕士.主要研究方向为数据共享与利用方法. zzdnj75@163.com 李青山博士.主要研究方向为网络安全、区块链技术. liqs@pku.edu.cn 胡建斌副教授.主要研究方向为网络安全. hujianbin@pku.edu.cn

Abstract

Abstract: With the continuous advancement of data anonymization technology, accurately identifying private data has become a key challenge. Currently, privacy information extraction algorithms are primarily based on traditional natural language processing techniques, such as bidirectional recurrent neural networks and attention mechanismbased pretrained language models (like BERT and its variants). These models leverage their powerful ability to represent contextual features, overcoming the limitations of traditional methods in representing polysemous words. However, there is still room for improvement in their ability to accurately determine entity boundaries. This study proposes a novel privacy information extraction algorithm that integrates structural prior knowledge and a unique privacy data structural knowledge enhancement mechanism, enhancing the model’s understanding of sentence semantic structures, thereby improving the accuracy of privacy information boundary determination. Moreover, we have evaluated the model on multiple public datasets and provided a detailed analysis of the experimental results, demonstrating its effectiveness.

Key words: structural prior knowledge, structural enhancement mechanism, privacy information extraction algorithm, entity boundary determination, data desensitization, natural language processing

摘要： 随着数据脱敏技术的持续进步，精确识别隐私数据已成为关键挑战.目前，隐私信息抽取算法主要基于传统自然语言处理技术，如双向循环神经网络和基于注意力机制的预训练语言模型(如BERT).这些模型利用其强大的上下文特征表示能力，克服了传统方法在多义词表示方面的限制.然而，它们在精确判断实体边界方面仍有改进空间.提出了一种新颖的隐私信息抽取算法，该算法融合结构先验知识，通过一种隐私数据结构知识增强机制，提高模型对句子语义结构的理解，从而提高了隐私信息边界判断的准确性.此外，还在多个公开数据集上对模型进行评估，详细的实验结果展示了其有效性.

关键词: 结构先验知识, 结构增强机制, 隐私信息抽取算法, 实体边界判断, 数据脱敏, 自然语言处理

CLC Number:

TP309.2

赵玉媛, 王斌, 张泽丹, 李青山, 胡建斌, . 融入结构先验知识的隐私信息抽取算法[J]. 信息安全研究, 2024, 10(2): 139-.

References

［1］Adam N R, Worthmann J C. Securitycontrol methods for statistical databases: A comparative study［J］. ACM Computing Surveys(CSUR), 1989, 21(4): 515556［2］Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations［C］ Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana: NAACL, 2018: 22272237［3］Elman J L. Finding structure in time［J］. Cognitive Science, 1990, 14(2): 179211［4］Dernoncourt F, Lee J Y, Uzuner O, et al. Deidentification of patient notes with recurrent neural networks［J］. Journal of the American Medical Informatics Association, 2017, 24(3): 596606［5］Memory L S T. Long shortterm memory［J］. Neural Computation, 2010, 9(8): 17351780［6］Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data［C］ Proc of the 18th Int Conf on Machine Learning (ICML2001). New York: ACM, 2001: 282289［7］Liu Z, Yang M, Wang X, et al. Entity recognition from clinical texts via recurrent neural network［J］. BMC Medical Informatics and Decision Making, 2017, 17: 5361［8］Huang Z, Xu W, Yu K. Bidirectional LSTMCRF models for sequence tagging［J］. arXiv preprint, arXiv:1508.01991, 2015［9］Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need［COL］ Proc of NIPS. 2017［20240122］. https:proceedings.neurips.ccpaper2017hash3f5ee24354 7dee91fbd053c1c4a845aaAbstract.html［10］Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding［C］ Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Piscataway, Minneapolis, Minnesota: NAACL, 2019: 41714186［11］Khin K, Burckhardt P, Padman R. A deep learning architecture for deidentification of patient notes: Implementation and evaluation［J］. arXiv preprint, arXiv:1810.01570, 2018［12］Strubell E, Verga P, Andor D, et al. Linguisticallyinformed selfattention for semantic role labeling［C］ Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL, 2018: 50275038［13］Zhang Z, Wu Y, Zhou J, et al. SGNet: Syntaxguided machine reading comprehension［C］ Proc of the AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2020: 96369643［14］Bugliarello E, Okazaki N. Enhancing machine translation with dependencyaware selfattention［C］ Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 16181627［15］Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks［COL］ Proc of Int Conf on Learning Representations. 2018［20240122］. https:openreview.netforum?id=rJXMpikCZ［16］Levow G A. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition［C］ Proc of Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2006［17］Stubbs A, Uzuner . Annotating longitudinal clinical narratives for deidentification: The 2014 i2b2UTHealth corpus［J］. Journal of Biomedical Informatics, 2015, 58: S20S29［18］Zhang Y, Yang J. Chinese NER using lattice LSTM［C］ Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 15541564［19］Dozat T, Manning C D. Deep biaffine attention for neural dependency parsing［COL］ Proc of Intl Conf on Learning Representations. 2017［20240122］. https:openreview.netforum?id=Hk95PK9le［20］Gardner M, Grus J, Neumann M, et al. Allennlp: A deep semantic natural language processing platform［J］. arXiv preprint, arXiv:1803.07640, 2018［21］Kingma D P, Ba J. Adam: A method for stochastic optimization［J］. arXiv preprint, arXiv:1412.6980, 2014［22］郑旭如. 基于深度学习的数据脱敏研究［D］. 哈尔滨: 哈尔滨工业大学, 2020［23］Kitazono J, Grozavu N, Rogovschi N, et al. tDistributed stochastic neighbor embedding with inhomogeneous degrees of freedom［C］ Proc of the 23rd Int Conf on Neural Information Processing. Berlin: Springer, 2016: 119128

[1]	. Research on Vulnerability Text Feature Classification Technology Based on BERT [J]. Journal of Information Security Reserach, 2023, 9(7): 687-.
[2]	. Research on the Integration of Full Lifecycle Data Security Management and Artificial Intelligence Technology#br# [J]. Journal of Information Security Reserach, 2023, 9(6): 543-.
[3]	. Data Analysis of Top International Conferences on Cyberspace Security in Mainland China Based on Knowledge Graph [J]. Journal of Information Security Reserach, 2023, 9(2): 180-.
[4]	. Data Security Protection Technology in Big Data Platform [J]. Journal of Information Security Research, 2019, 5(3): 242-247.
[5]	Yuan Yujiao,Luo Senlin,Lin Meng and Pan Limin. Research on Short Text Recommendation Merging Sentential Semantic Structure Model [J]. Journal of Information Security Research, 2015, 1(1): 67-73.

Private Information Extraction Algorithm Incorporating Prior Structural Knowledge

融入结构先验知识的隐私信息抽取算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 5

Recommended Articles

Metrics