预训练语言模型实体匹配的可解释性
CSTR:
作者:
作者单位:

作者简介:

梁峥(1998-),男,博士生,CCF学生会员,主要研究领域为数据集成,实体识别,异常检测;邵心玥(1996-),女,博士生,主要研究领域为黑盒算法可解释性,反事实解释;王宏志(1978-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库管理系统,大数据分析与治理;丁小欧(1993-),女,博士,助理教授,CCF专业会员,主要研究领域为数据质量,数据清洗,时序数据管理;戴加佳(2000-),女,硕士生,主要研究领域为数据质量,实体识别;穆添愉(1998-),男,博士生,CCF学生会员,主要研究领域为自动机器学习,模型自动选择,超参数优化.

通讯作者:

王宏志,wangzh@hit.edu.cn

中图分类号:

基金项目:

国家重点研发计划(2021YFB3300502);国家自然科学基金(62232005,62202126);CCF-华为胡杨林基金数据库专项(CCF-HuaweiDB202204);黑龙江省博士后资助项目(LBH-Z21137)


Interpretability of Entity Matching Based on Pre-trained Language Model
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    实体匹配可以判断两个数据集中的记录是否指向同一现实世界实体,对于大数据集成、社交网络分析、网络语义数据管理等任务不可或缺.作为在自然语言处理、计算机视觉中取得大量成功的深度学习技术,预训练语言模型在实体识别任务上也取得了优于传统方法的效果,引起了大量研究人员的关注.然而,基于预训练语言模型的实体匹配技术效果不稳定、匹配结果不可解释,给这一技术在大数据集成中的应用带来了很大的不确定性.同时,现有的实体匹配模型解释方法主要面向机器学习方法进行模型无关的解释,在预训练语言模型上的适用性存在缺陷.因此,以Ditto、JointBERT等BERT类实体匹配模型为例,提出3种面向预训练语言模型实体匹配技术的模型解释方法来解决这个问题:(1)针对序列化操作中关系数据属性序的敏感性,对于错分样本,利用数据集元特征和属性相似度实现属性序反事实生成;(2)作为传统属性重要性衡量的补充,通过预训练语言模型注意力机制权重来衡量并可视化模型处理数据时的关联性;(3)基于序列化后的句子向量,使用k近邻搜索技术召回与错分样本相似的可解释性优良的样本,增强低置信度的预训练语言模型预测结果.在真实公开数据集上的实验结果表明,通过增强方法提升了模型效果,同时,在属性序搜索空间中能够达到保真度上限的68.8%,为针对预训练语言实体匹配模型的决策解释提供了属性序反事实、属性关联理解等新角度.

    Abstract:

    Entity matching can determine whether records in two datasets point to the same real-world entity, and is indispensable for tasks such as big data integration, social network analysis, and web semantic data management. As a deep learning technology that has achieved a lot of success in natural language processing and computer vision, pre-trained language models have also achieved better results than traditional methods in entity matching tasks, which have attracted the attention of a large number of researchers. However, the performance of entity matching based on pre-trained language model is unstable and the matching results cannot be explained, which brings great uncertainty to the application of this technology in big data integration. At the same time, the existing entity matching model interpretation methods are mainly oriented to machine learning methods as model-agnostic interpretation, and there are shortcomings in their applicability on pre-trained language models. Therefore, this study takes BERT entity matching models such as Ditto and JointBERT as examples, and proposes three model interpretation methods for pre-training language model entity matching technology to solve this problem. (1) In the serialization operation, the order of relational data attributes is sensitive. Dataset meta-features and attribute similarity are used to generate attribute ranking counterfactuals for misclassified samples; (2) As a supplement to traditional attribute importance measurement, the pre-trained language model attention weights are used to measure and visualize model processing; (3) Based on the serialized sentence vector, the k-nearest neighbor search technique is used to recall the samples with good interpretability similar to the misclassified samples to enhance the low-confidence prediction results of pre-trained language model. Experiments on real public datasets show that while improving the model effect through the enhancing method, the proposed method can reach 68.8% of the upper limit of fidelity in the attribute order search space, which provides a decision explanation for the pre-trained language entity matching model. New perspectives such as attribute order counterfactual and attribute association understanding are also introduced.

    参考文献
    相似文献
    引证文献
引用本文

梁峥,王宏志,戴加佳,邵心玥,丁小欧,穆添愉.预训练语言模型实体匹配的可解释性.软件学报,2023,34(3):1087-1108

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-05-16
  • 最后修改日期:2022-07-29
  • 录用日期:
  • 在线发布日期: 2022-10-26
  • 出版日期: 2023-03-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号