多粒度单元格对比的文本和表格数值问答模型

doi:10.13328/j.cnki.jos.007206

微信服务号

微信订阅号

2025年8月7日 20:46 星期四

首页 > 过刊浏览>2025年第36卷第5期 >2167-2187. DOI:10.13328/j.cnki.jos.007206

PDF HTML阅读 XML下载导出引用引用提醒

多粒度单元格对比的文本和表格数值问答模型
DOI:
                        10.13328/j.cnki.jos.007206
                    
CSTR:
                        32375.14.jos.007206
                    
作者:
                        琚江舟琚江舟
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
毛云麟毛云麟
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
吴震吴震
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
陈宇飞陈宇飞
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
戴新宇戴新宇
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找
陈家骏陈家骏
计算机软件新技术国家重点实验室(南京大学), 江苏 南京 210023
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金(62376120, 61936012, 62206126)

Text and Table Numerical Question-answering Model Based on Multi-granularity Cell Contrast

Author:

JU Jiang-Zhou
JU Jiang-Zhou
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
MAO Yun-Lin
MAO Yun-Lin
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
WU Zhen
WU Zhen
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Yu-Fei
CHEN Yu-Fei
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
DAI Xin-Yu
DAI Xin-Yu
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Jia-Jun
CHEN Jia-Jun
State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

在文本和表格的数值问答任务中, 模型需要在给定的文本和表格下进行数值推理. 任务目标是生成一个包含多步数值计算的计算程序, 并将计算程序结果作为问题的答案. 为了建模文本和表格, 当前工作通过模板将表格线性化为一系列单元格句子, 再基于文本和单元格句子设计生成器以产生计算程序. 然而, 这种方法面临一个特定问题: 由模板生成的单元格句子间差异微小, 生成器难以区分回答问题所必需的单元格句子(支撑单元格句子)和回答问题无关的单元格句子(干扰单元格句子), 最终导致模型基于干扰单元格句子生成错误的计算程序. 为了解决这个问题, 在生成器上设计一个多粒度单元格语义对比方法, 其主要目的是增加支撑单元格句子和干扰单元格句子表示距离, 进而帮助生成器区分它们. 这个方法由粗粒度单元格语义对比和细粒度单元格语义构成元素对比(包括行名对比, 列名对比及单元格数值对比)共同构成. 实验结果验证所提出的多粒度单元格语义对比方法可以使生成器在FinQA和MultiHiertt数值推理数据集上取得优于基准模型的表现. 在FinQA数据集上, 多粒度单元格语义对比方法上最高可以提升答案正确率达到3.38%; 特别地, 在更为困难的层次化表格数据集MultiHiertt中, 该方法使生成器的正确率显著提高了7.8%. 同大语言模型GPT-3结合思维链相比, 基于多粒度单元格语义对比的生成器性能在FinQA和MultiHiertt上分别表现出 5.44%和1.69%的答案正确率提升. 后续分析实验进一步验证多粒度单元格语义对比方法有助于生成器区分支撑单元格句子和干扰单元格句子.

关键词:表格和文本学习;数值问答模型;多粒度对比学习

Abstract:

In the task of numerical question-answering with texts and tables, the models are required to perform numerical reasoning based on given texts and tables. The goal is to generate a computational program consisting of multi-step numerical calculations, and the program’s results are used as the answer to the question. To model the texts and tables, the current work linearizes the table into a series of cell sentences through templates and then designs a generator based on the texts and cell sentences to produce the computational program. However, this approach faces a specific problem: the differences between cell sentences generated by templates are minimal, making it difficult for the generator to distinguish between cell sentences that are essential for answering the question (supporting cell sentences) and those irrelevant to the question (distracting cell sentences). Ultimately, the model generates incorrect computational programs based on distracting cell sentences. To tackle this issue, this study proposes an approach called multi-granularity cell semantic contrast (MGCC) for our generator. The main purpose of this approach is to enhance the representation distances between supporting and distracting cell sentences, thereby helping the generator differentiate between them. Specifically, this contrast mechanism is composed of coarse-grained cell semantic contrasts and fine-grained constituent element contrasts, including contrasts in row names, column names, and cell values. The experimental results validate that the proposed MGCC approach enables the generator to achieve better performance than the benchmark model on the FinQA and MultiHiertt numerical reasoning datasets. On the FinQA dataset, it leads to an improvement of up to 3.38% in answer accuracy. Notably, on the more challenging hierarchical table dataset MultiHiertt, it yields a 7.8% increase in the accuracy of the generator. Compared with GPT-3 combined with chain of chain of thought (CoT), MGCC results in respective improvements of 5.44% and 1.69% on the FinQA and MultiHiertt datasets. The subsequent analytical experiments further verify that the multi-granularity cell semantic contrast approach contributes to the model’s improved differentiation between supporting and distracting cell sentences.

Key words:table and text learning;numerical question-answering model;multi-granularity contrastive learning

引用本文

琚江舟,毛云麟,吴震,陈宇飞,戴新宇,陈家骏.多粒度单元格对比的文本和表格数值问答模型.软件学报,2025,36(5):2167-2187

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-12-21
最后修改日期:2024-03-01
录用日期:
在线发布日期: 2024-06-20
出版日期: 2025-05-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码