[1]Yao J, Ning K, Liu Z, et al. LLM Lies: Hallucinations are not bugs, but features as adversarial examples[J]. arXiv preprint, arXiv:2310.01469, 2024[2]Hughs S, Bae M, Li M, et al. Hallucination leaderboard[EBOL]. 2025 [20250930]. https:github.comvectarahallucinationleaderboard?tab=readmeovfile[3]Singh A, Schlesinger A, Fry A, et al. o3 and o4mini systemcard [EBOL]. 2025 [20250930]. https:cdn.openai.compdf2221c87502dc4789800be7758f3722c1o3ando4minisystemcard.pdf[4]Jannet P, Freud S, Loftus E, et al. False memory维基百科[EBOL]. [20250930]. https:en.wikipedia.orgwikiFalse_memory[5]Dan Y. 证人的记忆效应. MBA智库百科[EBOL]. [20250930]. https:wiki.mbalib.comwiki证人的记忆效应[6]Scheck B, Neufeld P. Innocence Project维基百科[EBOL]. [20250930]. https:en.wikipedia.orgwikiInnocence_Project[7]Huang L, Yu W, Ma W, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions [J]. ACM Trans on Information Systems, 2025, 43(2): 155[8]Min S, Krishna K, Lyu X, et al. Factscore: Finegrained atomic evaluation of factual precision in long form text generation[C] Proc of the Conf on EMNLP. Stroudsburg, PA: ACL, 2023: 1207612100[9]Dhuliawala S, Komeili M, Xu J, et al. Chainofverification reduces hallucination in large language models[C] Proc of Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 35633578[10]Fabbri A R, Wu C S, Liu W, et al. QAFactEval: Improved QAbased factual consistency evaluation for summarization[J]. arXiv perprint, arXiv:2112.08542, 2021[11]Laban P, Kryciński W, Agarwal D, et al. LLMs as factual reasoners: insights from existing benchmarks and beyond[J]. arXiv preprint, arXix:2305.14540, 2023[12]Adlakha V, BehnamGhader P, Lu X H, et al. Evaluating correctness and faithfulness of instructionfollowing models for question answering[J]. Trans of the Association for Computational Linguistics, 2024, 12: 681699[13]Wei J, Wang X, Schuurmans D, et al. Chainofthought prompting elicits reasoning in large language models[C] Proc of the 36th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 2482424837[14]Nye M, Andreassen A J, GurAri G, et al. Show your work: Scratchpads for intermediate computation with language models[J]. arXiv preprint, arXiv:2112.00114, 2021[15]Li C, Liang J, Zeng A, et al. Chain of code: Reasoning with a language modelaugmented code emulator[C] Proc of the 41st Int Conf on Machine Learning. Cambridge, MA: JMLR, 2024: 2825928277[16]Chen W, Ma X, Wang X, et al. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks[J]. arXiv perprint, arXiv:2211.12588, 2022[17]Gao L, Madaan A, Zhou S, et al. Pal: Programaided language models[C] Proc of Int Conf on Machine Learning. Cambridge, MA: JMLR, 2023: 1076410799[18]Wen J, Guan J, Wang H, et al. Codeplan: Unlocking reasoning potential in large language models by scaling codeform planning[J] arXiv perprint, arXiv:2409.12452, 2024[19]Taylor F. The Principles of Scientific Management[M]. New York: Harper and Brothers, 1911[20]Gawande A. The Checklist Manifesto[M]. New York: Metropolitan Books, 2009[21]Goldreich O P. NP, and NP Completeness: The Basics of Computational Complexity[M]. New York: Cambridge University Press, 2010[22]Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model[C] Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2023: 5372853741 [23]Luo Y, Yang Z, Meng F, et al. An empirical study of catastrophic forgetting in large language models during continual finetuning[J]. IEEE Trans on Audio, Speech and Language Processing, 2025, 33: 37763786
|