Journal of Information Security Reserach ›› 2024, Vol. 10 ›› Issue (8): 706-.

Previous Articles     Next Articles

A Differential Privacy Text Desensitization Method for Enhancing Semantic Consistency

Guan Yeli1, Luo Senlin1, Pan Limin1, Zhang Ji1, and Yu Jingwei2#br#

#br#
  

  1. 1(School of Information and Electronics, Beijing Institute of Technology, Beijing 100081)
    2(North Institute for Scientific and Technical Information, Beijing 100089)

  • Online:2024-08-08 Published:2024-08-08

强化语义一致性的差分隐私文本脱敏方法

关业礼1罗森林1潘丽敏1张笈1于经纬2


  

  1. 1(北京理工大学信息与电子学院北京100081)
    2(北方科技信息研究所北京100089)

  • 通讯作者: 关业礼 硕士研究生.主要研究方向为自然语言处理和数据隐私. bit1120161210@163.com
  • 作者简介:关业礼 硕士研究生.主要研究方向为自然语言处理和数据隐私. bit1120161210@163.com 罗森林 博士,教授,博士生导师.主要研究方向为机器学习、医疗数据挖掘和信息安全. luosenlin@bit.edu.cn 潘丽敏 硕士,高级实验师.主要研究方向为数据挖掘和图像处理、自然语言处理和机器学习. panlimin2016@gmail.com 张笈 硕士,副教授.主要研究方向为网络安全、数据挖掘、文本安全和媒体安全. Zhangji@bit.edu.cn 于经纬 硕士,助理研究员.主要研究方向为数据挖掘和图像处理. 13811887504@163.com

Abstract: Text desensitization is an extremely important privacy protection method, and the balance between its privacy protection effect and semantic consistency with the original text is a challenge. When existing differential privacy desensitization methods are used to desensitize sensitive words, the similarity calculation probability method is used to select substitute words for sensitive words, which can easily cause inconsistency or even irrelevance between the substitute words and the original text semantics, seriously affecting the preservation of the original text semantics in the desensitized text. A differential privacy text desensitization method is proposed to enhance semantic consistency. A truncation distance measurement formula is given to adjust the probability of selecting replacement words and limit semantic irrelevant replacement words. The experimental results on real datasets show that it effectively improves the semantic consistency between desensitized text and the original text, and has great practical application value.

Key words: text desensitization, differential privacy protection, semantic consistency enhancement, word embedding, inference attack

摘要: 文本脱敏是一种极为重要的隐私保护方法,其隐私保护效果和与原文本语义一致性的平衡是一个难题.现有差分隐私脱敏方法对敏感词脱敏时,采用相似性计算概率法选取敏感词的替代词,易造成替代词与原文语义不一致甚至无关,严重影响脱敏文本对原文语义的保持.提出一种强化语义一致性的差分隐私文本脱敏方法,给定一种截断距离度量公式调整替换词选中概率限制语义无关替换词.真实数据集的实验结果表明,该方法有效提升了脱敏文本与原文的语义一致性,实际应用价值大.

关键词: 文本脱敏, 差分隐私保护, 语义一致性, 词嵌入, 推断攻击

CLC Number: