Journal of Information Security Reserach ›› 2023, Vol. 9 ›› Issue (10): 980-.

Previous Articles     Next Articles

Optimization and Application of Text Semantic Similarity Analysis Model Under Small Dataset

Dong Bo and Luo Senlin   

  1. (School of Information and Electronics, Beijing Institute of Technology, Beijing 100081)
  • Online:2023-10-17 Published:2023-10-28



  1. (北京理工大学信息与电子学院北京100081)

Abstract: Data usage compliance is a key link in data security governance, and its focus is on text traceability and intellectual property protection through text semantic similarity analysis. Aiming at the problem of limited public data resources, a contrastive learning framework is introduced. There are positive and negative sample coupling operators in the existing objective functions commonly used in contrastive learning, resulting in serious gradient attenuation of backpropagation, and there are few batches available for training with small datasets, so it is difficult for the model to converge to the local optimum. This paper proposes contrastive learning text semantic similarity analysis method under small dataset. By calculating the partial derivatives corresponding to the positive and negative samples in the backpropagation of the comparative learning objective function, and eliminating the common factor operator, the gradient decay of the backpropagation is suppressed, and the convergence speed of the model is improved. The experimental results on public datasets show that this method can improve the training efficiency of the model and the effect of text semantic similarity analysis in small datasets.

Key words: similarity analysis;data security governance;data usage compliance, Contrastive Learning, Limited Data

摘要: 数据使用合规性是数据安全治理的关键环节,其重点研究内容为通过文本语义相似性分析实现文本溯源与知识产权保护.针对公开数据资源受限的问题,引入对比学习框架,但现有对比学习常用目标函数存在正负样本耦合算子,导致反向传播梯度衰减严重,且小数据集训练时可利用批次少,模型难以收敛至局部最优.提出一种小数据集对比学习文本语义相似性分析方法,计算对比学习目标函数反向传播时正负样本分别对应的偏导数,消除其中的公因算子,抑制反向传播梯度衰减,提高模型的收敛速度.在公开数据集上的实验结果表明,该方法能够提高模型的训练效率,提升小数据集文本语义相似性分析效果.

关键词: 相似性分析, 数据安全治理, 数据使用合规性, 对比学习, 数据受限

CLC Number: