多粒度单元格对比的文本和表格数值问答模型
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP18

基金项目:

国家自然科学基金(62376120, 61936012, 62206126)


Text and Table Numerical Question-answering Model Based on Multi-granularity Cell Contrast
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在文本和表格的数值问答任务中, 模型需要在给定的文本和表格下进行数值推理. 任务目标是生成一个包含多步数值计算的计算程序, 并将计算程序结果作为问题的答案. 为了建模文本和表格, 当前工作通过模板将表格线性化为一系列单元格句子, 再基于文本和单元格句子设计生成器以产生计算程序. 然而, 这种方法面临一个特定问题: 由模板生成的单元格句子间差异微小, 生成器难以区分回答问题所必需的单元格句子(支撑单元格句子)和回答问题无关的单元格句子(干扰单元格句子), 最终导致模型基于干扰单元格句子生成错误的计算程序. 为了解决这个问题, 在生成器上设计一个多粒度单元格语义对比方法, 其主要目的是增加支撑单元格句子和干扰单元格句子表示距离, 进而帮助生成器区分它们. 这个方法由粗粒度单元格语义对比和细粒度单元格语义构成元素对比(包括行名对比, 列名对比及单元格数值对比)共同构成. 实验结果验证所提出的多粒度单元格语义对比方法可以使生成器在FinQA和MultiHiertt数值推理数据集上取得优于基准模型的表现. 在FinQA数据集上, 多粒度单元格语义对比方法上最高可以提升答案正确率达到3.38%; 特别地, 在更为困难的层次化表格数据集MultiHiertt中, 该方法使生成器的正确率显著提高了7.8%. 同大语言模型GPT-3结合思维链相比, 基于多粒度单元格语义对比的生成器性能在FinQA和MultiHiertt上分别表现出 5.44%和1.69%的答案正确率提升. 后续分析实验进一步验证多粒度单元格语义对比方法有助于生成器区分支撑单元格句子和干扰单元格句子.

    Abstract:

    In the task of numerical question-answering with texts and tables, the models are required to perform numerical reasoning based on given texts and tables. The goal is to generate a computational program consisting of multi-step numerical calculations, and the program’s results are used as the answer to the question. To model the texts and tables, the current work linearizes the table into a series of cell sentences through templates and then designs a generator based on the texts and cell sentences to produce the computational program. However, this approach faces a specific problem: the differences between cell sentences generated by templates are minimal, making it difficult for the generator to distinguish between cell sentences that are essential for answering the question (supporting cell sentences) and those irrelevant to the question (distracting cell sentences). Ultimately, the model generates incorrect computational programs based on distracting cell sentences. To tackle this issue, this study proposes an approach called multi-granularity cell semantic contrast (MGCC) for our generator. The main purpose of this approach is to enhance the representation distances between supporting and distracting cell sentences, thereby helping the generator differentiate between them. Specifically, this contrast mechanism is composed of coarse-grained cell semantic contrasts and fine-grained constituent element contrasts, including contrasts in row names, column names, and cell values. The experimental results validate that the proposed MGCC approach enables the generator to achieve better performance than the benchmark model on the FinQA and MultiHiertt numerical reasoning datasets. On the FinQA dataset, it leads to an improvement of up to 3.38% in answer accuracy. Notably, on the more challenging hierarchical table dataset MultiHiertt, it yields a 7.8% increase in the accuracy of the generator. Compared with GPT-3 combined with chain of chain of thought (CoT), MGCC results in respective improvements of 5.44% and 1.69% on the FinQA and MultiHiertt datasets. The subsequent analytical experiments further verify that the multi-granularity cell semantic contrast approach contributes to the model’s improved differentiation between supporting and distracting cell sentences.

    参考文献
    相似文献
    引证文献
引用本文

琚江舟,毛云麟,吴震,陈宇飞,戴新宇,陈家骏.多粒度单元格对比的文本和表格数值问答模型.软件学报,,():1-22

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-12-21
  • 最后修改日期:2024-03-01
  • 录用日期:
  • 在线发布日期: 2024-06-20
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号