Research on Large Model Security Assessment Technology Based on Group Polarization Nested Jailbreak Templates#br#

	#br#

Abstract

Abstract: As large model demonstrates excellent performance in natural language processing tasks, its security issues become increasingly prominent. Jailbreak attacks bypass model security mechanisms, weaken value alignment constraints, and induce models to generate harmful content. The risks of model abuse, hijacking, and information leakage caused by such attacks pose security threats to the large language model application ecosystem. To comprehensively evaluate large model security performance, a nested jailbreak template technique based on the group polarization psychological effect is proposed, which guides models to generate complex responses through progressively nested instructions. Based on this, the NesTHGA (nested templatehierarchical genetic algorithm) framework is constructed by integrating hierarchical genetic algorithms. Experimental results show that this method achieves an average attack success rate of over 80% across 8 mainstream large models, statistical tests confirm significant differences from existing methods, and ablation experiments verify component synergistic effects, effectively evaluating the security and robustness of large models against complex attacks.

Key words: jailbreak attack, group polarization effect, nested instruction, hierarchical genetic algorithm, large model security assessment

摘要： 随着大模型(large model)在自然语言处理任务中表现卓越，其安全性问题日益凸显.越狱攻击绕过模型安全机制，削弱价值观对齐约束，诱导模型生成有害内容.该攻击导致的模型滥用、劫持及信息泄露等风险，对大模型应用生态构成安全威胁.为更全面地评估大模型安全性能，提出一种基于群体极化心理效应的嵌套越狱模板技术，通过逐步嵌套指令引导模型产生复杂回应.在此基础上，结合层次遗传算法构建了NesTHGA方法.实验结果表明，该方法在8种主流大模型中实现了平均80%以上的攻击成功率，统计检验证实与现有方法存在显著差异，消融实验验证了组件协同作用，有效评估了大模型在面对复杂攻击时的安全性和鲁棒性.

关键词: 越狱攻击, 群体极化效应, 嵌套指令, 层次遗传算法, 大模型安全评估

CLC Number:

TP18

王红杰, 孙培淇, 杜彦辉, 刘楠, . 基于群体极化嵌套越狱模板的大模型安全评估技术研究[J]. 信息安全研究, 2026, 12(5): 410-.

References

［1］Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models［J］. Nature Machine Intelligence, 2023, 5(4): 277291［2］李南, 丁益东, 江浩宇, 等. 面向大语言模型的越狱攻击综述［J］. 计算机研究与发展, 2024, 61(5): 11561181［3］赵月, 何锦雯, 朱申辰, 等. 大语言模型安全现状与挑战［J］. 计算机科学, 2024, 51(1): 6871［4］Subhash V, Bialas A, Pan W, et al. Why do universal adversarial attacks work on large language models: Geometry might be the answer［C］ Proc of the 2nd Workshop on New Frontiers in Adversarial Machine Learning. Menlo Park, CA: AAAI, 2023: 89101［5］Shin T, Razeghi Y, Logan IV R L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts［C］ Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 42224235［6］张玉清. 人工智能的安全风险与隐私保护［J］. 信息安全研究, 2023, 9(6): 498499［7］Liu F, Wang H, Chen Z, et al. JAILJUDGE: A comprehensive jailbreak judge benchmark with multiagent enhanced explanation evaluation framework［C］ Proc of the 13th Int Conf on Learning Representations. Vienna: ICLR, 2025: 12031219［8］Ding P, Kuang J, Ma D, et al. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily［C］ Proc of the 2024 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2024: 21362153［9］Yao J, Yi X, Wang X, et al. From instructions to intrinsic human values: A survey of alignment goals for big models［J］. IEEE Trans on Technology and Society, 2023, 4(4): 354367［10］Xie Y, Yi J, Shao J, et al. Defending ChatGPT against jailbreak attack via selfreminders［J］. Nature Machine Intelligence, 2023, 5(12): 14861496［11］Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need［C］ Proc of Advances in Neural Information Processing Systems 30. Long Beach: NIPS Foundation, 2017: 59986008［12］Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail［J］. Advances in Neural Information Processing Systems, 2023, 36: 8007980110［13］Qiang Y. Hijacking large language models via adversarial incontext learning［D］. Detroit: Wayne State University, 2024［14］Huang Y, Gupta S, Mohammed A, et al. Catastrophic jailbreak of opensource LLMs via exploiting generation［C］ Proc of the 12th Int Conf on Learning Representations. Vienna: ICLR, 2024: 567583［15］Yong Z X, Menghini C, Bach S H. Lowresource languages jailbreak GPT4［J］. ACM Trans on Information Systems, 2024, 42(3): 128［16］Chao P, Robey A, Chiang E, et al. Jailbreaking black box large language models in twenty queries［C］ Proc of the 41st Int Conf on Machine Learning. Vienna: ICML, 2024: 38643880［17］Shah R, FeuilladeMontixi Q, Pour S, et al. Scalable and transferable blackbox jailbreaks for language models via persona modulation［J］. IEEE Trans on Neural Networks and Learning Systems, 2024, 35(4): 47124725［18］Guo C, Sablayrolles A, Jégou H, et al. Gradientbased adversarial attacks against text transformers［C］ Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 57475757［19］Jones E, Dragan A D, Raghunathan A, et al. Automatically auditing large language models via discrete optimization［C］ Proc of the 40th Int Conf on Machine Learning. New York: ACM, 2023: 1530715329［20］Zou A, Wang Z, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models［C］ Proc of Advances in Neural Information Processing Systems 36. New Orleans: NIPS Foundation, 2023: 6632166342［21］Stoner J A F. A comparison of individual and group decisions involving risk［D］. Cambridge, MA: MIT Press, 1961［22］Myers D G, Lamm H. The group polarization phenomenon［J］. Psychological Bulletin, 1976, 83(4): 602627［23］Aher G V, Arriaga R I, Kalai A T. Using large language models to simulate multiple humans and replicate human subject studies［C］ Proc of the 40th Int Conf on Machine Learning. New York: ACM, 2023: 337371［24］Yi S, Liu Y, Sun Z, et al. Jailbreak attacks and defenses against large language models: A survey［J］. ACM Computing Surveys, 2024, 56(5): 135［25］Liu X, Xu N, Chen M, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models［C］ Proc of the 12th Int Conf on Learning Representations. Vienna: ICLR, 2024: 732748［26］Chen Y, Gao H, Cui G, et al. Why should adversarial perturbations be imperceptible: Rethink the research paradigm in adversarial NLP［C］ Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 1122211237［27］Mehrotra A, Zampetakis M, Kassianik P, et al. Tree of attacks: Jailbreaking blackbox LLMs automatically［C］ Proc of Advances in Neural Information Processing Systems 37. New Orleans: NIPS Foundation, 2024: 6106561105［28］Perez E, Huang S, Song F, et al. Red teaming language models with language models［C］ Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 34193448［29］Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP［C］ Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 21532162

[1]	. Research on Log Anomaly Detection Method Integrating Semantic Features [J]. Journal of Information Security Reserach, 2026, 12(4): 383-.
[2]	. Research on Twostage Network Intrusion Detection Method for Outofdistribution Traffic Data [J]. Journal of Information Security Reserach, 2026, 12(3): 265-.
[3]	. Copyright Open Licensing Rules and Their Implementation Paths in Data Training [J]. Journal of Information Security Reserach, 2026, 12(1): 68-.
[4]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 89-.
[5]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 107-.
[6]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 136-.
[7]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 146-.
[8]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 154-.
[9]	. [J]. Journal of Information Security Reserach, 2025, 11(E2): 290-.
[10]	. [J]. Journal of Information Security Reserach, 2025, 11(E1): 89-.
[11]	. [J]. Journal of Information Security Reserach, 2025, 11(E1): 106-.
[12]	. [J]. Journal of Information Security Reserach, 2025, 11(E1): 167-.
[13]	. [J]. Journal of Information Security Reserach, 2025, 11(E1): 199-.
[14]	. Highorder Program Driven by Large Language Model [J]. Journal of Information Security Reserach, 2025, 11(11): 1008-.
[15]	. Comparative Analysis and Countermeasures of Domestic and Foreign Laws and Regulations on Artificial Intelligence#br# [J]. Journal of Information Security Reserach, 2025, 11(11): 1048-.