大语言模型的幻觉问题研究综述
作者:
基金项目:

国家重点研发计划(2022YFE0197600)


Survey on Hallucinations in Large Language Models
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [137]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随着以Transformer为代表的预训练模型等深度学习技术的发展, 大语言模型(LLM)日益展现出强大的理解力和创造力, 对抽象摘要、对话生成、机器翻译和数据到文本生成等下游任务产生了重要影响, 同时也在图像说明、视觉叙事等多模态领域展现出了广阔的应用前景. 虽然大语言模型具备显著的性能优势, 但深度学习架构使其难以避免内容幻觉问题, 这不仅会削减系统性能, 还严重影响其可信性和应用广泛性, 由此衍生的法律风险和伦理风险成为掣肘其进一步发展与落地的主要障碍. 聚焦大语言模型的幻觉问题, 首先, 对大语言模型的幻觉问题展开系统概述, 分析其来源及成因; 其次, 系统概述大语言模型幻觉问题的评估方法和缓解方法, 对不同任务的评估和缓解方法类型化并加以深入比较; 最后, 从评估和缓解角度展望应对幻觉问题的未来趋势和应对方案.

    Abstract:

    With the development of deep learning technologies such as pre-trained models, represented by Transformer, large language models (LLMs) have shown excellent comprehension and creativity. They not only have an important impact on downstream tasks such as abstractive summarization, dialogue generation, machine translation, and data-to-text generation but also exhibit promising applications in multimodal fields such as image description and visual narratives. While LLMs have significant advantages in performance, deep learning-based LLMs are susceptible to hallucinations, which may reduce the system performance and even seriously affect the trustworthiness and broad applications of LLMs. The accompanying legal and ethical risks have become the main obstacles to their further development and implementation. Therefore, this survey provides an extensive investigation and technical review of the hallucinations in LLMs. Firstly, the hallucinations in LLMs are systematically summarized, and their origin and causes are analyzed. Secondly, a systematical overview of hallucination evaluation and mitigation is provided, in which the evaluation and mitigation methods are categorized and thoroughly compared for different tasks. Finally, the future challenges and research directions of the hallucinations in LLMs are discussed from the perspectives of evaluation and mitigation.

    参考文献
    [1] 余同瑞, 金冉, 韩晓臻, 李家辉, 郁婷. 自然语言处理预训练模型的研究综述. 计算机工程与应用, 2020, 56(23): 12–22.
    Yu TR, Jin R, Han XZ, Li JH, Yu T. Review of pre-training models for natural language processing. Computer Engineering and Applications, 2020, 56(23): 12–22 (in Chinese with English abstract).
    [2] Bao SQ, He H, Wang F, Wu H, Wang HF. PLATO: Pre-trained dialogue generation model with discrete latent variable. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 85–96. [doi: 10.18653/v1/2020.acl-main.9]
    [3] Chen WH, Su Y, Yan XF, Wang WY. KGPT: Knowledge-grounded pre-training for data-to-text generation. arXiv:2010.02307, 2020.
    [4] Liu YH, Gu JT, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. Trans. of the Association for Computational Linguistics, 2020, 8: 726–742.
    [5] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [6] Zhao WX, Zhou K, Li JY, Tang TY, Wang XL, Hou YP, Min YQ, Zhang BC, Zhang JJ, Dong ZC, Du YF, Yang C, Chen YS, Chen ZP, Jiang JH, Ren RY, Li YF, Tang XY, Liu ZK, Liu PY, Nie JY, Wen JR. A survey of large language models. arXiv:2303.18223, 2023.
    [7] Pagnoni A, Balachandran V, Tsvetkov Y. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021. 4812–4829. [doi: 10.18653/v1/2021.naacl-main.383]
    [8] Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N. Generating Wikipedia by summarizing long sequences. In: Proc. of the 6th Int’l Conf. on Learning Representations. 2018. 1–18.
    [9] Wiseman S, Shieber S, Rush A. Challenges in data-to-document generation. In: Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. 2253–2263. [doi: 10.18653/v1/D17-1239]
    [10] Zhou CT, Neubig G, Gu JT, Diab M, Guzmán F, Zettlemoyer L, Ghazvininejad M. Detecting hallucinated content in conditional neural sequence generation. In: Proc. of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 2021. 1393–1404. [doi: 10.18653/v1/2021.findings-acl.120]
    [11] Zhang C, Lee G, D'Haro LF, Li HZ. D-Score: Holistic dialogue evaluation without reference. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2021, 29: 2502–2516.
    [12] Zhang XC, Ghorbani AA. An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 2020, 57(2): 102025.
    [13] Huang YC, Feng XC, Feng XC, Qin B. The factual inconsistency problem in abstractive text summarization: A survey. arXiv:2104.14839, 2023.
    [14] Cao ZQ, Wei FR, Li WJ, Li SJ. Faithful to the original: Fact aware neural abstractive summarization. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI, 2018. 4784–4791. [doi: 10.1609/aaai.v32i1.11912]
    [15] Sobieszek A, Price T. Playing games with ais: The limits of GPT-3 and similar large language models. Minds & Machines, 2022, 32(2): 341–364.
    [16] Ji ZW, Lee N, Frieske R, Yu TZ, Su D, Xu Y, Ishii E, Bang YJ, Madotto A, Fung P. Survey of hallucination in natural language generation. ACM Computing Surveys, 2023, 55(12): 248.
    [17] OpenAI: GPT-4 System Card. 2023. https://cdn.openai.com/papers/gpt-4-system-card.pdf
    [18] Evans O, Cotton-Barratt O, Finnveden L, Bales A, Balwit A, Wills P, Righetti L, Saunders W. Truthful AI: Developing and governing AI that does not lie. arXiv:2110.06674, 2021.
    [19] Zhang Y, Li YF, Cui LY, Cai D, Liu LM, Fu TC, Huang XT, Zhao EB, Zhang Y, Chen YL, Wang LY, Luu AT, Bi W, Shi F, Shi SM. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv:2309.01219, 2023.
    [20] Wang HM. Revisiting challenges in data-to-text generation with fact grounding. arXiv:2001.03830, 2020.
    [21] Lee K, Ippolito D, Nystrom A, Zhang CY, Eck D, Callison-Burch C, Carlini N. Deduplicating training data makes language models better. arXiv:2107.06499, 2022.
    [22] Jo ES, Gebru T. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In: Proc. of the 2020 Conf. on Fairness, Accountability, and Transparency. Barcelona: ACM, 2020. 306–316. [doi: 10.1145/3351095.3372829]
    [23] Parikh A, Wang XZ, Gehrmann S, Faruqui M, Dhingra B, Yang DY, Das D. ToTTo: A controlled table-to-text generation dataset. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020. 1173–1186. [doi: 10.18653/v1/2020.emnlp-main.89]
    [24] Tian R, Narayan S, Sellam S, Parikh AP. Sticking to the facts: Confident decoding for faithful data-to-text generation. arXiv:1910.08684, 2020.
    [25] Longpre S, Perisetla K, Chen A, Ramesh N, DuBois C, Singh S. Entity-based knowledge conflicts in question answering. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. 7052–7063. [doi: 10.18653/v1/2021.emnlp-main.565]
    [26] Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2022.
    [27] Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2022.
    [28] Wei J, Wang XZ, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022. 24824–24837.
    [29] Liu FX, Lin K, Li LJ, Wang JF, Yacoob Y, Wang LJ. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv:2306.14565, 2024.
    [30] Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022. 22199–22213.
    [31] Durmus E, He H, Diab M. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 5055–5070. [doi: 10.18653/v1/2020.acl-main.454]
    [32] Dhingra B, Faruqui M, Parikh A, Chang MW, Das D, Cohen W. Handling divergent reference texts when evaluating table-to-text generation. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 4884–4895. [doi: 10.18653/v1/P19-1483]
    [33] Manakul P, Liusie A, Gales MJF. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. arXiv:2303.08896, 2023.
    [34] Niu C, Wu YH, Zhu J, Xu SL, Shum K, Zhong R, Song JT, Zhang T. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. arXiv:2401.00396, 2024.
    [35] Wang ZY, Wang XY, An B, Yu D, Chen CY. Towards faithful neural table-to-text generation with content-matching constraints. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 1072–1086. [doi: 10.18653/v1/2020.acl-main.101]
    [36] Shuster K, Poff S, Chen MY, Kiela D, Weston J. Retrieval augmentation reduces hallucination in conversation. In: Proc. of the Association for Computational Linguistics: EMNLP. Punta Cana: Association for Computational Linguistics, 2021. 3784–3803.
    [37] Popović M. chrF: Character n-gram F-score for automatic MT evaluation. In: Proc. of the 10th Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, 2015. 392–395. [doi: 10.18653/v1/W15-3049]
    [38] Martindale M, Carpuat M, Duh K, McNamee P. Identifying fluently inadequate output in neural and statistical machine translation. In: Proc. of Machine Translation Summit XVII: Research Track. Dublin: European Association for Machine Translation, 2019. 233–243.
    [39] Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O. Open information extraction from the Web. In: Proc. of the 20th Int’l Joint Conf. on Artifical Intelligence. Hyderabad: Morgan Kaufmann Publishers Inc., 2007. 2670–2676.
    [40] Goodrich B, Rao V, Liu PJ, Saleh M. Assessing the factual accuracy of generated text. In: Proc. of the 25th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. Anchorage: ACM, 2019. 166–175. [doi: 10.1145/3292500.3330955]
    [41] Nan F, Nallapati R, Wang ZG, dos Santos CN, Zhu HH, Zhang DJ, McKeown K, Xiang B. Entity-level factual consistency of abstractive text summarization. In: Proc. of the 16th Conf. of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 2021. 2727–2733. [doi: 10.18653/v1/2021.eacl-main.235]
    [42] Lee N, Ping W, Xu P, Patwary M, Fung P, Shoeybi M, Catanzaro B. Factuality enhanced language models for open-ended text generation. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022. 34586–34599.
    [43] Dušek O, Novikova J, Rieser V. Evaluating the state-of-the-art of end-to-end natural language generation: The E2E NLG challenge. Computer Speech & Language, 2020, 59: 123–156.
    [44] Dušek O, Kasner Z. Evaluating semantic accuracy of data-to-text generation with natural language inference. In: Proc. of the 13th Int’l Conf. on Natural Language Generation. Dublin: Association for Computational Linguistics, 2020. 131–137. [doi: 10.18653/v1/2020.inlg-1.19]
    [45] Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In: Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Vol. 1 (Long Papers). New Orleans: Association for Computational Linguistics, 2018. 1112–1122. [doi: 10.18653/v1/N18-1101]
    [46] Nie YX, Williams A, Dinan E, Bansal M, Weston J, Kiela D. Adversarial NLI: A new benchmark for natural language understanding. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 4885–4901. [doi: 10.18653/v1/2020.acl-main.441]
    [47] Laban P, Schnabel T, Bennett PN, Hearst MA. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Trans. of the Association for Computational Linguistics, 2022, 10: 163–177.
    [48] Kryściński W, McCann B, Xiong CM, Socher R. Evaluating the factual consistency of abstractive text summarization. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020. 9332–9346. [doi: 10.18653/v1/2020.emnlp-main.750]
    [49] Yin WP, Radev D, Xiong CM. DocNLI: A large-scale dataset for document-level natural language inference. In: Proc. of the 2021 Association for Computational Linguistics. Association for Computational Linguistics, 2021. 4913–4922.
    [50] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2019.
    [51] Liu YH, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [52] He PC, Li XD, Gao JF, Chen WZ. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv:2006.03654, 2021.
    [53] Goyal T, Durrett G. Evaluating Factuality in generation with dependency-level entailment. In: Proc. of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 2020. 3592–3603.
    [54] Falke T, Ribeiro LFR, Utama PA, Dagan I, Gurevych I. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 2214–2220. [doi: 10.18653/v1/P19-1213]
    [55] Honovich O, Choshen L, Aharoni R, Neeman E, Szpektor I, Abend O. Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. 7856–7870. [doi: 10.18653/v1/2021.emnlp-main.619]
    [56] Fabbri AR, Wu CS, Liu WH, Xiong CM. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In: Proc. of the 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: Association for Computational Linguistics, 2022. 2587–2601. [doi: 10.18653/v1/2022.naacl-main.187]
    [57] Barrantes M, Herudek B, Wang R. Adversarial NLI for factual correctness in text summarisation models. arXiv:2005.11739, 2020.
    [58] Yang ZL, Dai ZH, Yang YM, Carbonell J, Salakhutdinov R, Le QV. XLNet: Generalized autoregressive pretraining for language understanding. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 5735–5763.
    [59] Dziri N, Rashkin H, Linzen T, Reitter D. Evaluating groundedness in dialogue systems: The BEGIN benchmark. arXiv:2105.00071, 2022.
    [60] Filippova K. Controlled Hallucinations: Learning to generate faithfully from noisy data. In: Proc. of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. 864–870. [doi: 10.18653/v1/2020.findings-emnlp.76]
    [61] Cao M, Dong Y, Cheung J. Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization. In: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Dublin: Association for Computational Linguistics, 2022. 3340–3354. [doi: 10.18653/v1/2022.acl-long.236]
    [62] Yu JF, Wang XZ, Tu SQ, Cao SL, Zhang-Li D, Lv X, Peng H, Yao ZJ, Zhang XH, Li HM, Li CY, Zhang ZY, Bai YS, Liu YT, Xin A, Lin NY, Yun KF, Gong LL, Chen JH, Wu ZL, Qi YJ, Li WK, Guan Y, Zeng KS, Qi J, Jin HL, Liu JX, Gu Y, Yao Y, Ding N, Hou L, Liu ZY, Xu B, Tang J, Li JZ. KoLA: Carefully benchmarking world knowledge of large language models. arXiv:2306.09296, 2024.
    [63] Deng MK, Tan BW, Liu ZZ, Xing E, Hu ZT. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. 7580–7605. [doi: 10.18653/v1/2021.emnlp-main.599]
    [64] Zha YH, Yang YC, Li RC, Hu ZT. AlignScore: Evaluating factual consistency with a unified alignment function. arXiv:2305.16739, 2023.
    [65] Yue X, Wang BS, Chen ZR, Zhang K, Su Y, Sun H. Automatic evaluation of attribution by large language models. arXiv:2305.06311, 2023.
    [66] Zhong M, Liu Y, Yin D, Mao YN, Jiao YZ, Liu PF, Zhu CG, Ji H, Han JW. Towards a unified multi-dimensional evaluator for text generation. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu Dhabi: Association for Computational Linguistics, 2022. 2023–2038. [doi: 10.18653/v1/2022.emnlp-main.131]
    [67] Clark C, Lee K, Chang MW, Kwiatkowski T, Collins M, Toutanova K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 2924–2936. [doi: 10.18653/v1/N19-1300]
    [68] Zhang TY, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: Evaluating text generation with BERT. arXiv:1904.09675, 2020.
    [69] Yuan WZ. Neubig G, Liu PF. BARTScore: Evaluating generated text as text generation. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. Curran Associates Inc., 2021. 27263–27277.
    [70] Mehri S, Eskenazi M. USR: An unsupervised and reference free evaluation metric for dialog generation. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 681–707.
    [71] Wei JH, Yao YS, Ton JF, Guo HY, Estornell A, Liu Y. Measuring and reducing LLM hallucination without gold-standard answers. arXiv:2402.10412, 2024.
    [72] Su WH, Wang CY, Ai QY, Hu YR, Wu ZJ, Zhou YJ, Liu YQ. Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv:2403.06448, 2024.
    [73] Zhang SY, Li Y, Wu R, Huang XT, Chen YR, Xu WH, Qi GL. DEE: Dual-stage explainable evaluation method for text generation. arXiv:2403.11509, 2024.
    [74] Wang A, Cho K, Lewis M. Asking and answering questions to evaluate the factual consistency of summaries. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 5008–5020. [doi: 10.18653/v1/2020.acl-main.450]
    [75] Nan F, dos Santos CN, Zhu HH, Ng P, McKeow K, Nallapati R, Zhang DJ, Wang ZG, Arnold AO, Xiang B. Improving factual consistency of abstractive summarization via question answering. In: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing (Vol. 1: Long Papers). Association for Computational Linguistics, 2021. 6881–6894. [doi: 10.18653/v1/2021.acl-long.536]
    [76] Shakeri S, dos Santos CN, Zhu HH, Ng P, Nan F, Wang ZG, Nallapati R, Xiang B. End-to-end synthetic data generation for domain adaptation of question answering systems. In: In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020. 5445–5460. [doi: 10.18653/v1/2020.emnlp-main.439]
    [77] Scialom T, Dray PA, Lamprier S, Piwowarski B, Staiano J, Wang A, Gallinari P. QuestEval: Summarization asks for fact-based evaluation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. 6594–6604. [doi: 10.18653/v1/2021.emnlp-main.529]
    [78] Rebuffel C, Scialom T, Soulier L, Piwowarski B, Lamprier S, Staiano J, Scoutheeten G, Gallinari P. Data-QuestEval: A referenceless metric for data-to-text semantic evaluation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. 8029–8036. [doi: 10.18653/v1/2021.emnlp-main.633]
    [79] Yehuda Y, Malkiel I, Barkan O, Weill J, Ronen R, Koenigstein N. In search of truth: An interrogation approach to hallucination detection. arXiv:2403.02889, 2024.
    [80] Vu T, Iyyer M, Wang XZ, Constant N, Wei J, Wei J, Tar C, Sung YH, Zhou D, Le Q, Luong T. FreshLLMs: Refreshing large language models with search engine augmentation. arXiv:2310.03214, 2023.
    [81] Zhang JX, Li ZH, Das K, Malin BA, Kumar S. SAC3: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. arXiv:2311.01740, 2024.
    [82] Dinan E, Roller S, Shuster K, Fan A, Auli M, Weston J. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv:1811.01241, 2019.
    [83] Santhanam S, Hedayatnia B, Gella S, Padmakumar A, Kim S, Liu Y, Hakkani-Tur D. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv:2110.05456, 2022.
    [84] Utama P, Bambrick J, Moosavi N, Gurevych I. Falsesum: Generating document-level NLI examples for recognizing factual inconsistency in summarization. In: Proc. of the 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: Association for Computational Linguistics, 2022. 2763–2776.
    [85] Jiang C, Qi BQ, Hong XY, Fu DY, Cheng Y, Meng FD, Yu M, Zhou BW, Zhou J. On large language models’ hallucination with regard to known facts. arXiv:2403.20009, 2024.
    [86] Li JY, Cheng XX, Zhao WX, Nie JY, Wen JR. HaluEval: A large-scale hallucination evaluation benchmark for large language models. arXiv:2305.11747, 2023.
    [87] Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods. In: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Dublin: Association for Computational Linguistics, 2022. 3214–3252. [doi: 10.18653/v1/2022.acl-long.229]
    [88] Tang LY, Shalyminov I, Wong AWM, Burnsky J, Vincent JW, Yang YA, Singh S, Feng S, Song H, Su H, Sun LJ, Zhang Y, Mansour S, McKeown K. TofuEval: Evaluating hallucinations of LLMs on topic-focused dialogue summarization. arXiv:2402.13249, 2024.
    [89] Chen KD, Chen Q, Zhou J, He YS, He L. DiaHalu: A dialogue-level hallucination evaluation benchmark for large language models. arXiv:2403.00896, 2024.
    [90] Zhu ZY, Yang YM, Sun ZQ. HaluEval-Wild: Evaluating hallucinations of language models in the wild. arXiv:2403.04307, 2024.
    [91] Chen SQ, Zhao YR, Zhang JH, Chern IC, Gao SY, Liu PF, He JX. FELM: Benchmarking factuality evaluation of large language models. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 44502–44523.
    [92] Yang SP, Sun RL, Wan XJ. A new benchmark and reverse validation method for passage-level hallucination detection. arXiv:2310.06498, 2023.
    [93] Lattimer BM, Chen P, Zhang XY, Yang Y. Fast and accurate factual inconsistency detection over long documents. arXiv:2310.13189, 2023.
    [94] Muhlgay D, Ram O, Magar I, Levine Y, Ratner N, Belinkov Y, Abend O, Leyton-Brown K, Shashua A, Shoham Y. Generating benchmarks for factuality evaluation of language models. arXiv:2307.06908, 2024.
    [95] Mündler N, He JX, Jenko S, Vechev M. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv:2305.15852, 2024.
    [96] Kocmi T, Federmann C. Large language models are state-of-the-art evaluators of translation quality. arXiv:2302.14520, 2023.
    [97] Gao MQ, Ruan J, Sun RL, Yin XJ, Yang SP, Wan XJ. Human-like summarization evaluation with chatgpt. arXiv:2304.02554, 2023.
    [98] Liu Y, Iter D, Xu YC, Wang SH, Xu RC, Zhu CG. G-Eval: NLG evaluation using GPT-4 with better human alignment. arXiv:2303.16634, 2023.
    [99] Min S, Krishna K, Lyu XX, Lewis M, Yih WT, Koh PW, Iyyer M, Zettlemoyer L, Hajishirzi H. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv:2305.14251, 2023.
    [100] Shafayat S, Kim E, Oh J, Oh A. Multi-FAct: Assessing multilingual LLMs’ multi-regional knowledge using FActScore. arXiv:2402.18045, 2024.
    [101] Gekhman Z, Herzig J, Aharoni R, Elkind C, Szpektor I. TrueTeacher: Learning factual consistency evaluation with large language models. arXiv:2305.11171, 2023.
    [102] Gardent C, Shimorina A, Narayan S, Perez-Beltrachini L. Creating training corpora for NLG micro-planning. In: Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Vancouver: Association for Computational Linguistics, 2017. 179–188. [doi: 10.18653/v1/P17-1017]
    [103] Gabriel S, Celikyilmaz A, Jha R, Choi Y, Gao JF. GO FIGURE: A meta evaluation of factuality in summarization. In: Proc. of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 2021. 478–487. [doi: 10.18653/v1/2021.findings-acl.42]
    [104] Dziri N, Kamalloo E, Milton S, Zaiane O, Yu M, Ponti EM, Reddy S. FaithDial: A faithful benchmark for information-seeking dialogue. Trans. of the Association for Computational Linguistics, 2022, 10: 1473–1490.
    [105] Cheng ZJ, Dong HY, Wang ZR, Jia R, Guo JQ, Gao Y, Han S, Lou JG, Zhang DM. HiTab: A hierarchical table dataset for question answering and natural language generation. arXiv:2108.06712, 2022.
    [106] Chen ZY, Chen WH, Zha HW, Zhou XY, Zhang YK, Sundaresan S, Wang WY. Logic2Text: High-fidelity natural language generation from logical forms. arXiv:2004.14579, 2020.
    [107] Xu L, Li AQ, Zhu L, Xue H, Zhu CT, Zhao KK, He HN, Zhang XW, Kang QY, Lan ZZ. SuperCLUE: A comprehensive Chinese large language model benchmark. arXiv:2307.15020, 2023.
    [108] Raunak V, Menezes A, Junczys-Dowmunt M. The curious case of hallucinations in neural machine translation. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologys. Association for Computational Linguistics, 2021. 1172–1183. [doi: 10.18653/v1/2021.naacl-main.92]
    [109] Nie F, Yao JG, Wang JP, Pan R, Lin CY. A simple recipe towards reducing hallucination in neural surface realisation. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 2673–2679. [doi: 10.18653/v1/P19-1256]
    [110] Liu TY, Zheng X, Chang BB, Sui ZF. Towards faithfulness in open domain table-to-text generation from an entity-centric view. In: Proc. of the AAAI Conf. on Artificial Intelligence. Virtually: AAAI Press, 2021. 13415–13423. [doi: 10.1609/aaai.v35i15.17583]
    [111] Shen L, Zhan HL, Shen X, Chen HS, Zhao XF, Zhu XD. Identifying untrustworthy samples: Data filtering for open-domain dialogues with Bayesian optimization. In: Proc. of the 30th ACM Int’l Conf. on Information and Knowledge Management. Virtual Event: ACM, 2021. 1598–1608. [doi: 10.1145/3459637.3482352]
    [112] Rebuffel C, Roberti M, Soulier L, Scoutheeten G, Cancelliere R, Gallinari P. Controlling hallucinations at word level in data-to-text generation. Data Mining and Knowledge Discovery, 2022, 36(1): 318–354.
    [113] Dušek O, Howcroft DM, Rieser V. Semantic noise matters for neural natural language generation. In: Proc. of the 12th Int’l Conf. on Natural Language Generation. Tokyo: Association for Computational Linguistics, 2019. 421–426. [doi: 10.18653/v1/W19-8652]
    [114] Honnibal M, Montani I. spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. Neural Machine Translation. Proc. of the Association for Computational Linguistics. 2017. 688–697.
    [115] Junczys-Dowmunt M. Dual conditional cross-entropy filtering of noisy parallel corpora. arXiv:1809.00197, 2019.
    [116] Zhang BL, Nagesh A, Knight K. Parallel corpus filtering via pre-trained language models. arXiv:2005.06166, 2020.
    [117] Nie F, Wang JP, Yao JG, Pan R, Lin CY. Operation-guided neural networks for high fidelity data-to-text generation. In: Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018. 3879–3889. [doi: 10.18653/v1/D18-1422]
    [118] Wang TS, Ladhak F, Durmus E, He H. Improving faithfulness by augmenting negative summaries from fake documents. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu Dhabi: Association for Computational Linguistics, 2022. 11913–11921. [doi: 10.18653/v1/2022.emnlp-main.816]
    [119] Longpre S, Perisetla K, Chen A, Ramesh N, DuBois C, Singh S. Entity-based knowledge conflicts in question answering. arXiv:2109.05052, 2022.
    [120] Chen SH, Zhang F, Sone K, Roth D. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021. 5935–5941. [doi: 10.18653/v1/2021.naacl-main.475]
    [121] Biten AF, Gómez L, Karatzas D. Let there be a clock on the beach: Reducing object hallucination in image captioning. In: Proc. of the 2022 IEEE/CVF Winter Conf. on Applications of Computer Vision. Waikoloa: IEEE, 2022. 2473–2482.
    [122] Bi B, Wu C, Yan M, Wang W, Xia JN, Li CL. Incorporating external knowledge into machine reading for generative question answering. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing. Hong Kong: Association for Computational Linguistics, 2019. 2521–2530. [doi: 10.18653/v1/D19-1255]
    [123] Lim J, Kang M, Hur Y, Jung S, Kim J, Jang Y, Lee D, Ji H, Shin D, Kim S, Lim H. You truly understand what I need: Intellectual and friendly dialogue agents grounding knowledge and persona. arXiv:2301.02401, 2023.
    [124] Zhu CG, William H, Xu RC, Zeng QK, Zeng M, Huang XD, Jiang M. Enhancing factual consistency of abstractive summarization. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021. 718–733. [doi: 10.18653/v1/2021.naacl-main.58]
    [125] Huang LY, Wu LF, Wang L. Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 5094–5107. [doi: 10.18653/v1/2020.acl-main.457]
    [126] Liu A, Sap M, Lu XM, Swayamdipta S, Bhagavatula C, Smith NA, Choi Y. DExperts: Decoding-time controlled text generation with experts and anti-experts. arXiv:2105.03023, 2021.
    [127] Xiao YJ, Wang WY. On hallucination and predictive uncertainty in conditional language generation. In: Proc. of the 16th Conf. of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 2021. 2734–2744. [doi: 10.18653/v1/2021.eacl-main.236]
    [128] Song KQ, Lebanoff L, Guo QP, Qiu XP, Xue XY, Li C, Yu D, Liu F. Joint parsing and generation for abstractive summarization. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI, 2020. 8894–8901. [doi: 10.1609/aaai.v34i05.6419]
    [129] Balakrishnan A, Rao JF, Upasani K, White M, Subba R. Constrained decoding for neural NLG from compositional representations in task-oriented dialogue. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 831–844. [doi: 10.18653/v1/P19-1080]
    [130] Lee K, Firat O, Agarwal A, Fannjiang C, Sussillo D. Hallucinations in neural machine translation. In: Proc. of the 2019 Int’l Conf. on Learning Representations. 2019.
    [131] Kang D, Hashimoto T. Improved natural language generation via loss truncation. arXiv:2004.14589, 2020.
    [132] Yoon S, Yoon E, Yoon HS, Kim J, Yoo CD. Information-theoretic text hallucination reduction for video-grounded dialogue. arXiv:2212.05765, 2022.
    [133] Dai WL, Liu ZH, Ji ZW, Su D, Fung P. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. arXiv:2210.07688, 2023.
    [134] Li T, Beirami A, Sanjabi M, Smith V. Tilted empirical risk minimization. In: Proc. of the 9th Int’l Conf. on Learning Representations. Virtual Event, 2021.
    [135] Wang CJ, Sennrich R. On exposure bias, hallucination and domain shift in neural machine translation. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 3544–3552. [doi: 10.18653/v1/2020.acl-main.326]
    [136] Ranzato MA, Chopra S, Auli M, Zaremba W. Sequence level training with recurrent neural networks. In: Proc. of the 4th Int’l Conf. on Learning Representatio
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

刘泽垣,王鹏江,宋晓斌,张欣,江奔奔.大语言模型的幻觉问题研究综述.软件学报,2025,36(3):1152-1185

复制
分享
文章指标
  • 点击次数:876
  • 下载次数: 1069
  • HTML阅读次数: 81
  • 引用次数: 0
历史
  • 收稿日期:2024-01-23
  • 最后修改日期:2024-05-03
  • 在线发布日期: 2024-12-10
文章二维码
您是第19745279位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号