面向列语义识别的共现属性交互模型构建与优化
作者:
作者单位:

作者简介:

通讯作者:

卢卫,E-mail:lu-wei@ruc.edu.cn;杜小勇,E-mail:duyong@ruc.edu.cn

中图分类号:

基金项目:

国家重点研发计划(2020YFB2104101)


Construction and Optimization of Co-occurrence-Attribute-Interaction Model for Column Semantic Recognition
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    政务数据治理正在经历从"物理数据汇聚"到"逻辑语义汇通"的新阶段.逻辑语义汇通是指针对各孤岛政务系统因长期"自治"而形成的元数据缺失、元数据同名不同义以及同义不同名等问题,在不重建或修改原系统代码以及不物理汇聚各政务数据的前提下,通过技术手段,统一各孤岛信息系统元数据的语义表达,实现元数据的语义互联互通.本文的工作是将各孤岛信息系统的元数据语义对齐到已有的标准元数据上,具体地,将标准元数据名称看做语义标签,对孤岛关系数据的列投影进行语义识别,从而建立列名和标准元数据的语义对齐,实现孤岛元数据标准化治理.已有基于列投影的语义识别技术无法捕捉到关系数据的列顺序无关性特征以及属性语义标签之间的相关性特征,针对这一问题,本文提出了基于预测阶段和纠错阶段的两阶段模型.在预测阶段,提出了共现属性交互的CAI模型(Co-occurrence-Attribute-Interaction model,简称CAI模型),利用并行化的自注意力机制保证列顺序无关的共现属性交互;在纠错阶段,结合语义标签之间的共现性,通过引入纠错机制(Correction mechanism),优化CAI模型预测结果.在政务基准数据和Magellan等多组公开英文数据集上进行了实验,结果表明,引入纠错机制的两阶段模型在宏平均和加权平均两个指标上比已有最优模型最多可分别提高20.03%、13.36%.

    Abstract:

    Government data governance is undergoing a new phase of transition from"physical data aggregation"to"logical semantic convergence".Logical semantic convergence refers to the problem of missing metadata,different meanings of the same name and different names of the same meaning formed by the long-term"autonomy"of each islanded government system,and on the premise of not rebuilding or modifying the original system code and not physically converging data,the semantic expression of metadata of each islanded information system is unified through technical means to realize the standardized governance of government metadata.The work in this paper is to semantically align the metadata of each silo information system to the existing standard metadata,specifically,the standard metadata names are viewed as semantic tags,and the column projections of silo relationship data are semantically identified,so as to establish the semantic alignment of column names and standard metadata and achieve standardized governance of silo metadata.The existing semantic recognition techniques based on column projection cannot capture the column order-independent features of relational data and the correlation features between attribute semantic labels.To address this problem,this paper proposes a two-phase model based on the prediction phase and correction phase.In the prediction phase,the CAI model (Co-occurrence-Attribute-Interaction model) is proposed to guarantee the column order-independent co-occurrence attribute interaction by using the parallelized self-attention mechanism;in the correcting phase,the correction mechanism is introduced to optimize the prediction results of CAI model by combining the co-occurrence between semantic labels.We conduct experiments on the government benchmark and several public English datasets such as Magellan,and the results show that the two-phase CAI-Correction model can improve the macro average and weighted average by up to 20.03% and 13.36%,respectively,over the existing optimal model.

    参考文献
    相似文献
    引证文献
引用本文

高珊,袁宛竹,卢卫,王兰,张静,杜小勇.面向列语义识别的共现属性交互模型构建与优化.软件学报,2023,34(3):0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-05-15
  • 最后修改日期:2022-07-29
  • 录用日期:
  • 在线发布日期: 2022-10-26
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号