国家自然科学基金(62272100); 中国工程院院地合作项目(JS2021ZT05); 中国工程院咨询项目(2023-XY-09)
事实验证旨在检查一个文本陈述是否被给定的证据所支持. 由于表格结构上具有依赖性、内容上具有隐含性, 以表格作为证据的事实验证任务仍面临很多挑战. 现有工作或者利用逻辑表达式来解析基于表格证据的陈述, 或者设计表格感知神经网络来编码陈述-表格对, 以此实现基于表格的事实验证任务. 但是, 这些方法没有充分利用陈述背后隐含的表格信息, 从而导致模型的推理性能下降, 并且基于表格证据的中文陈述具有更加复杂的语法和语义, 也给模型推理带来更大的困难. 为此, 提出基于胶囊异构图注意力网络(CapsHAN)的中文表格型数据事实验证方法, 所提方法能充分理解陈述的结构和语义, 进而挖掘和利用陈述所隐含的表格信息, 有效提升基于表格的事实验证任务准确性. 具体而言, 首先通过对陈述进行依存句法分析和命名实体识别来构建异构图, 接着对该图采用异构图注意力网络和胶囊图神经网络进行学习和理解, 然后将得到的陈述文本表示与经过编码的表格文本表示进行拼接, 最后完成结果的预测. 更进一步, 针对现有中文表格型事实验证数据集匮乏而难以支持基于表格的事实验证方法性能评价的难题, 首先对主流TABFACT和INFOTABS表格事实验证英文数据集进行中文转化, 并且专门针对中文表格型数据的特点构建了基于UCL国家标准的数据集UCLDS, 该数据集将维基百科信息框作为人工注释的自然语言陈述的证据, 并被标记为蕴含、反驳或中立3类. UCLDS在同时支持单表和多表推理方面比传统TABFACT和INFOTABS数据集更胜一筹. 在上述3个中文基准数据集上的实验结果表明, 所提模型的表现均优于基线模型, 证明该模型在基于中文表格的事实验证任务上的优越性.
Fact verification is intended to check whether a textual statement is supported by a given piece of evidence. Due to the structural dependence and implicit content of tables, the task of fact verification with tables as the evidence still faces many challenges. Existing literature has either used logical expressions to parse statements based on tabular evidence or designed table-aware neural networks to encode statement-table pairs and thereby accomplish table-based fact verification tasks. However, these approaches fail to fully utilize the implicit tabular information behind the statements, which leads to the degraded inference performance of the model. Moreover, Chinese statements based on tabular evidence have more complex syntax and semantics, which also adds to the difficulties in model inference. For this reason, the study proposes a method of fact verification with Chinese tabular data based on the capsule heterogeneous graph attention network (CapsHAN). This method can fully understand the structure and semantics of statements. On this basis, the tabular information implied by the statements is mined and utilized to effectively improve the accuracy of table-based fact verification tasks. Specifically, a heterogeneous graph is constructed by performing syntactic dependency parsing and named entity recognition of statements. Subsequently, the graph is learned and understood by the heterogeneous graph attention network and the capsule graph neural network, and the obtained textual representation of the statements is sliced together with the textual representation of the encoded tables. Finally, the result is predicted. Further, this study also attempts to address the problem that the datasets of fact verification based on Chinese tables are scarce and thus unable to support the performance evaluation of table-based fact verification methods. For this purpose, the study transforms the mainstream English table-based fact verification datasets TABFACT and INFOTABS into Chinese and constructs a dataset that is based on the uniform content label (UCL) national standard and specifically tailored to the characteristics of Chinese tabular data. This dataset, namely, UCLDS, takes Wikipedia infoboxes as evidence of manually annotated natural language statements and labels them into three classes: entailed, contradictory, and neutral. UCLDS outperforms the traditional datasets TABFACT and INFOTABS in supporting both single-table and multi-table inference. The experimental results on the above three Chinese benchmark datasets show that the proposed model outperforms the baseline model invariably, demonstrating its superiority for Chinese table-based fact verification tasks.