预训练语言模型实体匹配的可解释性
作者:
作者简介:

梁峥(1998-),男,博士生,CCF学生会员,主要研究领域为数据集成,实体识别,异常检测;邵心玥(1996-),女,博士生,主要研究领域为黑盒算法可解释性,反事实解释;王宏志(1978-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库管理系统,大数据分析与治理;丁小欧(1993-),女,博士,助理教授,CCF专业会员,主要研究领域为数据质量,数据清洗,时序数据管理;戴加佳(2000-),女,硕士生,主要研究领域为数据质量,实体识别;穆添愉(1998-),男,博士生,CCF学生会员,主要研究领域为自动机器学习,模型自动选择,超参数优化.

通讯作者:

王宏志,wangzh@hit.edu.cn

基金项目:

国家重点研发计划(2021YFB3300502);国家自然科学基金(62232005,62202126);CCF-华为胡杨林基金数据库专项(CCF-HuaweiDB202204);黑龙江省博士后资助项目(LBH-Z21137)


Interpretability of Entity Matching Based on Pre-trained Language Model
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [36]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    实体匹配可以判断两个数据集中的记录是否指向同一现实世界实体,对于大数据集成、社交网络分析、网络语义数据管理等任务不可或缺.作为在自然语言处理、计算机视觉中取得大量成功的深度学习技术,预训练语言模型在实体识别任务上也取得了优于传统方法的效果,引起了大量研究人员的关注.然而,基于预训练语言模型的实体匹配技术效果不稳定、匹配结果不可解释,给这一技术在大数据集成中的应用带来了很大的不确定性.同时,现有的实体匹配模型解释方法主要面向机器学习方法进行模型无关的解释,在预训练语言模型上的适用性存在缺陷.因此,以Ditto、JointBERT等BERT类实体匹配模型为例,提出3种面向预训练语言模型实体匹配技术的模型解释方法来解决这个问题:(1)针对序列化操作中关系数据属性序的敏感性,对于错分样本,利用数据集元特征和属性相似度实现属性序反事实生成;(2)作为传统属性重要性衡量的补充,通过预训练语言模型注意力机制权重来衡量并可视化模型处理数据时的关联性;(3)基于序列化后的句子向量,使用k近邻搜索技术召回与错分样本相似的可解释性优良的样本,增强低置信度的预训练语言模型预测结果.在真实公开数据集上的实验结果表明,通过增强方法提升了模型效果,同时,在属性序搜索空间中能够达到保真度上限的68.8%,为针对预训练语言实体匹配模型的决策解释提供了属性序反事实、属性关联理解等新角度.

    Abstract:

    Entity matching can determine whether records in two datasets point to the same real-world entity, and is indispensable for tasks such as big data integration, social network analysis, and web semantic data management. As a deep learning technology that has achieved a lot of success in natural language processing and computer vision, pre-trained language models have also achieved better results than traditional methods in entity matching tasks, which have attracted the attention of a large number of researchers. However, the performance of entity matching based on pre-trained language model is unstable and the matching results cannot be explained, which brings great uncertainty to the application of this technology in big data integration. At the same time, the existing entity matching model interpretation methods are mainly oriented to machine learning methods as model-agnostic interpretation, and there are shortcomings in their applicability on pre-trained language models. Therefore, this study takes BERT entity matching models such as Ditto and JointBERT as examples, and proposes three model interpretation methods for pre-training language model entity matching technology to solve this problem. (1) In the serialization operation, the order of relational data attributes is sensitive. Dataset meta-features and attribute similarity are used to generate attribute ranking counterfactuals for misclassified samples; (2) As a supplement to traditional attribute importance measurement, the pre-trained language model attention weights are used to measure and visualize model processing; (3) Based on the serialized sentence vector, the k-nearest neighbor search technique is used to recall the samples with good interpretability similar to the misclassified samples to enhance the low-confidence prediction results of pre-trained language model. Experiments on real public datasets show that while improving the model effect through the enhancing method, the proposed method can reach 68.8% of the upper limit of fidelity in the attribute order search space, which provides a decision explanation for the pre-trained language entity matching model. New perspectives such as attribute order counterfactual and attribute association understanding are also introduced.

    参考文献
    [1] Doan AH, Halevy AY, Ives ZG.Principles of Data Integration.Morgan Kaufmann Publishers, 2012.
    [2] Dong XL, Rekatsinas T.Data integration and machine learning:A natural synergy.In:Proc.of the Int'l Conf.on Management of Data (SIGMOD 2018).ACM, 2018.1645-1650.
    [3] Wang J, Li G, Yu JX, Feng J.Entity matching:How similar is similar.Proc.of the VLDB Endowment, 2011, 4(10):622-633.
    [4] Chai C, Li G, Li J, Deng D, Feng J.A partial-order-based framework for cost-effective crowdsourced entity resolution.VLDB Journal, 2018, 27(6):745-770.
    [5] Das S, et al.Falcon:Scaling Up Hands-off Crowdsourced Entity Matching to Build Cloud Services.In:Proc.of the ACM Int'l Conf.on Management of Data (SIGMOD 2017).Chicago, 2017.1431-1446.
    [6] Ebraheem M, Thirumuruganathan S, Joty SR, Ouzzani M, Tang N.Distributed representations of tuples for entity resolution.Proc.of the VLDB Endowment, 2018, 11(11):1454-1467.
    [7] Li Y, Li J, Suhara Y, et al.Deep entity matching with pre-trained language models.Proc.of the VLDB Endowment, 2020, 14(1):50-60.
    [8] Tu JH, Fan J, Tang N, Wang P, Chai CL, Li GL, Fan RX, Du XY.Domain adaptation for deep entity resolution.In:Proc.of the Int'l Conf.on Management of Data (SIGMOD 2022).Philadelphia:ACM, 2022.443-457.
    [9] Peeters R, Bizer C.Dual-objective fine-tuning of BERT for entity matching.Proc.of the VLDB Endowment, 2021, 14(10):1913-1921.
    [10] Ebaid A, Thirumuruganathan S, Aref WG, et al.EXPLAINER:Entity resolution explanations.In:Proc.of the ICDE.2019.2000-2003.
    [11] Ribeiro MT, Singh S, Guestrin C.Why should i trust you? Explaining the predictions of any classifier.In:Proc.of the 22nd ACM SIGKDD Int'l Conf.on Knowledge Discovery and Data Mining.2016.1135-1144.
    [12] Sood A, Craven M.Feature importance explanations for temporal black-box models.In:Proc.of the AAAI.2022.8351-8360.
    [13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I.Attention is all you need.In:Proc.of the NIPS.2017.5998-6008.
    [14] Mahajan D, Tan CH, Sharma A.Preserving causal constraints in counterfactual explanations for machine learning classifiers.arXiv:1912.03277, 2019.
    [15] Karimi AH, Barthe G, Schölkopf B, Valera I.A survey of algorithmic recourse:definitions, formulations, solutions, and prospects.arXiv:2010.04050, 2020.
    [16] Rajani NF, Krause B, Yin WP, Niu T, Socher R, Xiong CM.Explaining and improving model behavior with k nearest neighbor representations.arXiv:2010.09030, 2020.
    [17] Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh:A generic approach to entity resolution.VLDB Journal, 2009, 18(1):255-276.
    [18] Chaudhuri S, Chen BC, Ganti V, Kaushik R.Example-driven design of efficient record matching queries.In:Proc.of the VLDB.2007.327-338.
    [19] Wang J, Li G, Kraska T, Franklin MJ, Feng J.Leveraging transitive relations for crowdsourced joins.In:Proc.of the SIGMOD Conf.2013.229-240.
    [20] Vesdapunt N, Bellare K, Dalvi N.Crowdsourcing algorithms for entity resolution.Proc.of the VLDB Endowment, 2014, 7(12):1071-1082.
    [21] Konda P, et al.Magellan:Toward building entity matching management systems.Proc.of the VLDB Endowment, 2016, 9(12):1197-1208.
    [22] Wu R, Chaba S, Sawlani S, et al.ZeroER:Entity resolution using zero labeled examples.In:Proc.of the SIGMOD Conf.2020.1149-1164.
    [23] Meduri V, Popa L.Sen P, Sarwat M, et al.A comprehensive benchmark framework for active learning methods in entity matching.In:Proc.of the SIGMOD.2020.1133-1147.
    [24] Mudgal S, Li H, Rekatsinas T, et al.Deep learning for entity matching:A design space exploration.In:Proc.of the SIGMOD Conf.2018.19-34.
    [25] Devlin J, Chang MW, Lee K, Toutanova K.BERT:Pre-training of deep bidirectional Transformers for language understanding.In:Proc.of the NAACL-HLT, Vol.1.2019.4171-4186
    [26] Brunner U, Stockinger K.Entity matching with Transformer architectures-A step forward in data integration.In:Proc.of the EDBT.2020.463-473.
    [27] Chen ZQ, Chen Q, Hou BY, Li ZH, Li GL.Towards interpretable and learnable risk analysis for entity resolution.In:Proc.of the ACM SIGMOD Int'l Conf.on Management of Data (SIGMOD 2020).Portland:ACM, 2020.1165-1180.
    [28] Wallace E, Feng S, Boyd-Graber J.Interpreting neural networks with nearest neighbors.arXiv:1809.02847, 2018.
    [29] Wachter S, Mittelstadt BD, Russell C.Counterfactual explanations without opening the black box:Automated decisions and the GDPR.arXiv:1711.00399, 2017.
    [30] Sokol K, Flach PA.Counterfactual explanations of machine learning predictions:Opportunities and challenges for AI safety.In:Proc.of the SafeAI@AAAI, Vol.2301.2019.Paper 20.
    [31] Cappuzzo R, Papotti P, Thirumuruganathan S.Creating embeddings of heterogeneous relational datasets for data integration tasks.In:Proc.of the SIGMOD Conf.2020.1335-1349.
    [32] Wang Y, Huang M, Zhu X, Zhao L.Attention-based LSTM for aspect-level sentiment classification.In:Proc.of the Conf.on Empirical Methods in Natural Language Processing.2016.606-615.
    [33] Primpeli A, Peeters R, Bizer C.The WDC training dataset and gold standard for large-scale product matching.In:Companion Proc.of the World Wide Web Conf.2019.381-386.
    [34] Yang Q, Fan LX, Zhu J.Introduction to Interpretable Artificial Intelligence.Beijing:Electronic Industry Press, 2022(in Chinese with English abstract).
    附中文参考文献
    [34] 杨强, 范力欣, 朱军.可解释人工智能导论.北京:电子工业出版社, 2022.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

梁峥,王宏志,戴加佳,邵心玥,丁小欧,穆添愉.预训练语言模型实体匹配的可解释性.软件学报,2023,34(3):1087-1108

复制
分享
文章指标
  • 点击次数:1473
  • 下载次数: 4289
  • HTML阅读次数: 3433
  • 引用次数: 0
历史
  • 收稿日期:2022-05-16
  • 最后修改日期:2022-07-29
  • 在线发布日期: 2022-10-26
  • 出版日期: 2023-03-06
文章二维码
您是第19708596位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号