Journal of Information Security Reserach ›› 2026, Vol. 12 ›› Issue (5): 410-.

Previous Articles     Next Articles

Research on Large Model Security Assessment Technology Based on Group Polarization Nested Jailbreak Templates#br#
#br#

Wang Hongjie1, Sun Peiqi1, Du Yanhui1, and Liu Nan2   

  1. 1(Institute of Information and Network Security, People’s Public Security University of China, Beijing 100038)
    2(National Engineering Research Center for Cybersecurity Protection and Security Technology, Shanghai 201100)
  • Online:2026-05-23 Published:2026-05-23

基于群体极化嵌套越狱模板的大模型安全评估技术研究

王红杰1孙培淇1杜彦辉1刘楠2   

  1. 1(中国人民公安大学信息网络安全学院北京100038)
    2(网络安全等级保护与安全保卫技术国家工程研究中心上海201100)
  • 通讯作者: 杜彦辉 博士,教授,博士生导师.主要研究方向为人工智能、大数据. dyh6889@126.com
  • 作者简介:王红杰 硕士研究生.主要研究方向为人工智能安全、生成式人工智能. 860482975@qq.com 孙培淇 硕士研究生.主要研究方向为物联网安全、机器学习. 2023211507@stu.ppsuc.edu.cn 杜彦辉 博士,教授,博士生导师.主要研究方向为人工智能、大数据. dyh6889@126.com 刘楠 硕士,研究实习员.主要研究方向为网络安全等级保护、关键信息基础设施安全保护、人工智能安全. liunan1@gass.ac.cn

Abstract: As large model demonstrates excellent performance in natural language processing tasks, its security issues become increasingly prominent. Jailbreak attacks bypass model security mechanisms, weaken value alignment constraints, and induce models to generate harmful content. The risks of model abuse, hijacking, and information leakage caused by such attacks pose security threats to the large language model application ecosystem. To comprehensively evaluate large model security performance, a nested jailbreak template technique based on the group polarization psychological effect is proposed, which guides models to generate complex responses through progressively nested instructions. Based on this, the NesTHGA (nested templatehierarchical genetic algorithm) framework is constructed by integrating hierarchical genetic algorithms. Experimental results show that this method achieves an average attack success rate of over 80% across 8 mainstream large models, statistical tests confirm significant differences from existing methods, and ablation experiments verify component synergistic effects, effectively evaluating the security and robustness of large models against complex attacks.

Key words: jailbreak attack, group polarization effect, nested instruction, hierarchical genetic algorithm, large model security assessment

摘要: 随着大模型(large model)在自然语言处理任务中表现卓越,其安全性问题日益凸显.越狱攻击绕过模型安全机制,削弱价值观对齐约束,诱导模型生成有害内容.该攻击导致的模型滥用、劫持及信息泄露等风险,对大模型应用生态构成安全威胁.为更全面地评估大模型安全性能,提出一种基于群体极化心理效应的嵌套越狱模板技术,通过逐步嵌套指令引导模型产生复杂回应.在此基础上,结合层次遗传算法构建了NesTHGA方法.实验结果表明,该方法在8种主流大模型中实现了平均80%以上的攻击成功率,统计检验证实与现有方法存在显著差异,消融实验验证了组件协同作用,有效评估了大模型在面对复杂攻击时的安全性和鲁棒性.

关键词: 越狱攻击, 群体极化效应, 嵌套指令, 层次遗传算法, 大模型安全评估

CLC Number: