[1]Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models[J]. Nature Machine Intelligence, 2023, 5(4): 277291[2]李南, 丁益东, 江浩宇, 等. 面向大语言模型的越狱攻击综述[J]. 计算机研究与发展, 2024, 61(5): 11561181[3]赵月, 何锦雯, 朱申辰, 等. 大语言模型安全现状与挑战[J]. 计算机科学, 2024, 51(1): 6871[4]Subhash V, Bialas A, Pan W, et al. Why do universal adversarial attacks work on large language models: Geometry might be the answer[C] Proc of the 2nd Workshop on New Frontiers in Adversarial Machine Learning. Menlo Park, CA: AAAI, 2023: 89101[5]Shin T, Razeghi Y, Logan IV R L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C] Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 42224235[6]张玉清. 人工智能的安全风险与隐私保护[J]. 信息安全研究, 2023, 9(6): 498499[7]Liu F, Wang H, Chen Z, et al. JAILJUDGE: A comprehensive jailbreak judge benchmark with multiagent enhanced explanation evaluation framework[C] Proc of the 13th Int Conf on Learning Representations. Vienna: ICLR, 2025: 12031219[8]Ding P, Kuang J, Ma D, et al. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily[C] Proc of the 2024 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2024: 21362153[9]Yao J, Yi X, Wang X, et al. From instructions to intrinsic human values: A survey of alignment goals for big models[J]. IEEE Trans on Technology and Society, 2023, 4(4): 354367[10]Xie Y, Yi J, Shao J, et al. Defending ChatGPT against jailbreak attack via selfreminders[J]. Nature Machine Intelligence, 2023, 5(12): 14861496[11]Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] Proc of Advances in Neural Information Processing Systems 30. Long Beach: NIPS Foundation, 2017: 59986008[12]Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail[J]. Advances in Neural Information Processing Systems, 2023, 36: 8007980110[13]Qiang Y. Hijacking large language models via adversarial incontext learning[D]. Detroit: Wayne State University, 2024[14]Huang Y, Gupta S, Mohammed A, et al. Catastrophic jailbreak of opensource LLMs via exploiting generation[C] Proc of the 12th Int Conf on Learning Representations. Vienna: ICLR, 2024: 567583[15]Yong Z X, Menghini C, Bach S H. Lowresource languages jailbreak GPT4[J]. ACM Trans on Information Systems, 2024, 42(3): 128[16]Chao P, Robey A, Chiang E, et al. Jailbreaking black box large language models in twenty queries[C] Proc of the 41st Int Conf on Machine Learning. Vienna: ICML, 2024: 38643880[17]Shah R, FeuilladeMontixi Q, Pour S, et al. Scalable and transferable blackbox jailbreaks for language models via persona modulation[J]. IEEE Trans on Neural Networks and Learning Systems, 2024, 35(4): 47124725[18]Guo C, Sablayrolles A, Jégou H, et al. Gradientbased adversarial attacks against text transformers[C] Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 57475757[19]Jones E, Dragan A D, Raghunathan A, et al. Automatically auditing large language models via discrete optimization[C] Proc of the 40th Int Conf on Machine Learning. New York: ACM, 2023: 1530715329[20]Zou A, Wang Z, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[C] Proc of Advances in Neural Information Processing Systems 36. New Orleans: NIPS Foundation, 2023: 6632166342[21]Stoner J A F. A comparison of individual and group decisions involving risk[D]. Cambridge, MA: MIT Press, 1961[22]Myers D G, Lamm H. The group polarization phenomenon[J]. Psychological Bulletin, 1976, 83(4): 602627[23]Aher G V, Arriaga R I, Kalai A T. Using large language models to simulate multiple humans and replicate human subject studies[C] Proc of the 40th Int Conf on Machine Learning. New York: ACM, 2023: 337371[24]Yi S, Liu Y, Sun Z, et al. Jailbreak attacks and defenses against large language models: A survey[J]. ACM Computing Surveys, 2024, 56(5): 135[25]Liu X, Xu N, Chen M, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[C] Proc of the 12th Int Conf on Learning Representations. Vienna: ICLR, 2024: 732748[26]Chen Y, Gao H, Cui G, et al. Why should adversarial perturbations be imperceptible: Rethink the research paradigm in adversarial NLP[C] Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 1122211237[27]Mehrotra A, Zampetakis M, Kassianik P, et al. Tree of attacks: Jailbreaking blackbox LLMs automatically[C] Proc of Advances in Neural Information Processing Systems 37. New Orleans: NIPS Foundation, 2024: 6106561105[28]Perez E, Huang S, Song F, et al. Red teaming language models with language models[C] Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 34193448[29]Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[C] Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 21532162
|