面向列语义识别的共现属性交互模型构建与优化
作者:
作者简介:

高珊(1997-),女,硕士,主要研究领域为自然语言处理,数据标准化;王兰(1993-),女,硕士,主要研究领域为自然语言处理;袁宛竹(1998-),女,硕士生,主要研究领域为自然语言处理;张静(1973-),女,博士,副教授,博士生导师,CCF专业会员,主要研究领域为数据挖掘,自然语言处理;卢卫(1981-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为数据库基础理论,大数据系统研制,时空背景下的查询处理,云数据库系统和应用;杜小勇(1963-),男,博士,教授,博士生导师,CCF会士,主要研究领域为智能信息检索,高性能数据库,非结构化数据管理.

通讯作者:

卢卫,lu-wei@ruc.edu.cn;杜小勇,duyong@ruc.edu.cn

基金项目:

国家重点研发计划(2020YFB2104101)


Construction and Optimization of Co-occurrence-attribute-interaction Model for Column Semantic Recognition
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [36]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    政务数据治理正在经历从“物理数据汇聚”到“逻辑语义汇通”的新阶段.逻辑语义汇通是指针对各孤岛政务系统因长期“自治”而形成的元数据缺失、元数据同名不同义以及同义不同名等问题,在不重建或修改原系统代码以及不物理汇聚各政务数据的前提下,通过技术手段,统一各孤岛信息系统元数据的语义表达,实现元数据的语义互联互通.该工作是将各孤岛信息系统的元数据语义对齐到已有的标准元数据上,具体地,将标准元数据名称看作语义标签,对孤岛关系数据的列投影进行语义识别,从而建立列名和标准元数据的语义对齐,实现孤岛元数据标准化治理.已有基于列投影的语义识别技术无法捕捉到关系数据的列顺序无关性特征以及属性语义标签之间的相关性特征,针对这一问题,提出了基于预测阶段和纠错阶段的两阶段模型:在预测阶段,提出了共现属性交互的CAI模型(co-occurrence-attribute-interaction model),利用并行化的自注意力机制保证列顺序无关的共现属性交互;在纠错阶段,结合语义标签之间的共现性,通过引入纠错机制(correction mechanism),优化CAI模型预测结果.在政务基准数据和Magellan等多组公开英文数据集上进行了实验,结果表明,引入纠错机制的两阶段模型,在宏平均和加权平均两个指标上,比已有最优模型最多可分别提高20.03%,13.36%.

    Abstract:

    Government data governance is undergoing a new phase of transition from "physical data aggregation" to "logical semantic unification". Thus far, long-term "autonomy" of government information silos, lead to a wide spectrum of metadata curation issues, such as attributes with the same names but having different meanings, or attributes with different names but having the same meanings. Instead of either rebuilding/modifying legacy information systems or physically aggregating data from isolated information systems, logical semantic unification solves this problem by unifying the semantic expression of the metadata in government information silos and achieves the standardized metadata governance. This work semantically aligns the metadata of each government information silo to the existing standard metadata. Specifically, the standard metadata names are viewed as semantic labels, and the semantic meanings of columns of relations in each government information silo are semantically identified, so as to establish the semantic alignment of column names and standard metadata and achieve standardized governance of silo metadata.

    参考文献
    [1] Du XY, Chen YG, Fan J, et al.Data wrangling:A key technique of data governance.Big Data Research, 2019, 5(3):13-22(in Chinese with English abstract).
    [2] Wu XD, Dong BB, Du XZ, Yang W.Data governance technology.Ruan Jian Xue Bao/Journal of Software, 2019, 30(9):2830-2856(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/5854.htm[doi:10.13328/j.cnki.jos.005854]
    [3] Zhang D, Suhara Y, Li JF, Hulsebos M, Demiralp Ç, Tan WC.Sato:Contextual semantic type detection in tables.CoRR abs/1911.06311, 2019.
    [4] Ding Y, Guo YH, Lu W, Li HX, Zhang MH, Li H, Pan AQ, Du XY.Context-aware semantic type identification for relational attributes.Journal of Computer Science and Technology, 2021.https://jcst.ict.ac.cn/EN/10.1007/s00000-021-1048-2
    [5] Devlin J, Chang MW, Lee K, et al.Bert:Pre-training of deep bidirectional transformers for language understanding.arXiv:1810.04805, 2018.
    [6] Vaswani A, Shazeer N, Parmar N, et al.Attention is all you need.In:Proc.of the Advances in Neural Information Processing Systems.2017.5998-6008.
    [7] Venetis P, Halevy A, Madhavan J, Pasca M, Shen W, Wu F, Miao GX, Wu C.Recovering semantics of tables on the Web.Proc.of the VLDB Endowment, 2011, 4(9):528-538.
    [8] Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z.DBpedia:A nucleus for a Web of open data.In:Proc.of the ISWC.2007.722-735.
    [9] Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J.Freebase:A collaboratively created graph database for structuring human knowledge.In:Proc.of the SIGMOD.2008.1247-1250.
    [10] Jiménez-Ruiz E, Hassanzadeh O, Efthymiou V, et al.Semtab 2019:Resources to benchmark tabular data to knowledge graph matching systems.In:Proc.of the European Semantic Web Conf.Cham:Springer, 2020.514-530.
    [11] Ritze D, Lehmberg O, Bizer C.Matching html tables to DBpedia.In:Proc.of the 5th Int'l Conf.on Web Intelligence, Mining and Semantics.2015.1-6.
    [12] Efthymiou V, Hassanzadeh O, Rodriguez-Muro M, et al.Matching Web tables with knowledge base entities:From entity lookups to entity embeddings.In:Proc.of the Int'l Semantic Web Conf.Springer, 2017.260-277.
    [13] Zhang Z.Towards efficient and effective semantic table interpretation.In:Proc.of the Int'l Semantic Web Conf.Springer, 2014.487-502.
    [14] Azzi R, Diallo G, Jiménez-Ruiz E, et al.AMALGAM:Making tabular dataset explicit with knowledge graph.In:Proc.of the SemTab@ISWC.2020.9-16.
    [15] Nguyen P, Kertkeidkachorn N, Ichise R, et al.MTab:Matching tabular data to knowledge graph using probability models.arXiv:1910.00246, 2019.
    [16] Ramnandan SK, Mittal A, Knoblock CA, Szekely P.Assigning semantic labels to data sources.In:Proc.of the ESWC.Springer, 2015.403-417.
    [17] Pham M, Alse S, Knoblock CA, Szekely P.Semantic labeling:A domain-independent approach.In:Proc.of the ISWC.Springer, 2016.446-462.
    [18] Chen Z, Jia H, Heflin J, et al.Generating schema labels through dataset content analysis.In:Companion Proc.of the Web Conf.2018.1515-1522.
    [19] Hulsebos M, Hu KZ, Bakker MA, Zgraggen E, Satyanarayan A, Kraska T, Demiralp A, Hidalgo C.Sherlock:A deep learning approach to semantic data type detection.In:Proc.of the KDD.2019.1500-1508.
    [20] Lafferty JD, McCallum A, Pereira FCN.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Proc.of the ICML.2001.282-289.
    [21] Ge C, Gao Y, Miao X, et al.A hybrid data cleaning framework using Markov logic networks.IEEE Trans.on Knowledge and Data Engineering, 2022, 34(5):2048-2062.
    [22] Ge C, Liu X, Chen L, et al.Largeea:Aligning entities for large-scale knowledge graphs.arXiv:2108.05211, 2021.
    [23] Tang X, Zhang J, Chen B, et al.BERT-INT:A BERT-based interaction model for knowledge graph alignment.In:Proc.of the Int'l Joint Conf.on Artificial Intelligence.2020.3174-3180.
    [24] Chen J, Jiménez-Ruiz E, Horrocks I, et al.Colnet:Embedding the semantics of Web tables for column type prediction.Proc.of the AAAI Conf.on Artificial Intelligence, 2019, 33(1):29-36.
    [25] Deng X, Sun H, Lees A, et al.TURL:Table understanding through representation learning.ACM SIGMOD Record, 2022, 51(1):33-40.
    [26] Hu D.An introductory survey on attention mechanisms in NLP problems.In:Proc.of the SAI Intelligent Systems Conf.Cham:Springer, 2019.432-448.
    [27] Harris ZS.Distributional structure.Word, 1954, 10(2-3):146-162.
    [28] Du L, Gao F, Chen X, et al.TabularNet:A neural network architecture for understanding semantic structures of tabular data.In:Proc.of the 27th ACM SIGKDD Conf.on Knowledge Discovery & Data Mining.2021.322-331.
    [29] Mudgal S, Li H, Rekatsinas T, et al.Deep learning for entity matching:A design space exploration.In:Proc.of the Int'l Conf.on Management of Data.2018.19-34.
    [30] Konda PV.Magellan:Toward Building Entity Matching Management Systems.Proc.of the VLDB Endowment, 2016, 9(12):1197-1208.
    [31] Gokhale C, Das S, Doan AH, et al.Corleone:Hands-off crowdsourcing for entity matching.In:Proc.of the ACM SIGMOD Int'l Conf.on Management of Data.2014.601-612.
    [32] Das S, Paul SGC, Doan AH, et al.Falcon:Scaling up hands-off crowdsourced entity matching to build cloud services.In:Proc.of the ACM Int'l Conf.on Management of Data.2017.1431-1446.
    [33] Li S, Zhao Z, Hu RF, Li WS, Liu T, Du XY.Analogical reasoning on Chinese morphological and semantic relations.In:Proc.of the ACL 2018.2018.138-143.
    附中文参考文献
    [1] 杜小勇, 陈跃国, 范举, 等.数据整理——大数据治理的关键技术.大数据, 2019, 5(3):13-22.
    [2] 吴信东, 董丙冰, 堵新政, 杨威.数据治理技术.软件学报, 2019, 30(9):2830-2856.http://www.jos.org.cn/1000-9825/5854.htm[doi:10.13328/j.cnki.jos.005854]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

高珊,袁宛竹,卢卫,王兰,张静,杜小勇.面向列语义识别的共现属性交互模型构建与优化.软件学报,2023,34(3):1010-1026

复制
分享
文章指标
  • 点击次数:1329
  • 下载次数: 4053
  • HTML阅读次数: 2868
  • 引用次数: 0
历史
  • 收稿日期:2022-05-15
  • 最后修改日期:2022-07-29
  • 在线发布日期: 2022-10-26
  • 出版日期: 2023-03-06
文章二维码
您是第19727216位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号