动态迁移实体块信息的跨领域中文实体识别模型
作者:
作者单位:

作者简介:

模式识别与人工智能

通讯作者:

王永吉,ywang@itechs.iscas.ac.cn

中图分类号:

TP391

基金项目:

国家重点研发计划(2017YFB1002303)


Dynamically Transfer Entity Span Information for Cross-domain Chinese Named Entity Recognition
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    由于中文文本之间没有分隔符,难以识别中文命名实体的边界.此外,在垂直领域中难以获取充足的标记完整的语料,例如医疗领域和金融领域等垂直领域.为解决上述不足,提出一种动态迁移实体块信息的跨领域中文实体识别模型(TES-NER),将跨领域共享的实体块信息(entity span)通过基于门机制(gate mechanism)的动态融合层,从语料充足的通用领域(源领域)动态迁移到垂直领域(目标领域)上的中文命名实体模型,其中,实体块信息用于表示中文命名实体的范围.TES-NER模型首先通过双向长短期记忆神经网络(BiLSTM)和全连接网络(FCN)构建跨领域共享实体块识别模块,用于识别跨领域共享的实体块信息以确定中文命名实体的边界;然后,通过独立的基于字的双向长短期记忆神经网络和条件随机场(BiLSTM-CRF)构建中文命名实体识别模块,用于识别领域指定的中文命名实体;最后构建动态融合层,将实体块识别模块抽取得到的跨领域共享实体块信息通过门机制动态决定迁移到领域指定的命名实体识别模型上的量.设置通用领域(源领域)数据集为标记语料充足的新闻领域数据集(MSRA),垂直领域(目标领域)数据集为混合领域(OntoNotes 5.0)、金融领域(Resume)和医学领域(CCKS 2017)这3个数据集,其中,混合领域数据集(OntoNotes 5.0)是融合了6个不同垂直领域的数据集.实验结果表明,提出的模型在OntoNotes 5.0、Resume和CCKS 2017这3个垂直领域数据集上的F1值相比于双向长短期记忆和条件随机场模型(BiLSTM-CRF)分别高出2.18%、1.68%和0.99%.

    Abstract:

    Boundaries identification of Chinese named entities is a difficult problem because of no separator between Chinese texts. Furthermore, the lack of well-marked NER data makes Chinese named entity recognition (NER) tasks more challenging in vertical domains, such as clinical domain and financial domain. To address aforementioned issues, this study proposes a novel cross-domain Chinese NER model by dynamically transferring entity span information (TES-NER). The cross-domain shared entity span information is transferred from the general domain (source domain) with sufficient corpus to the Chinese NER model on the vertical domain (target domain) through a dynamic fusion layer based on the gate mechanism, where the entity span information is used to represent the scope of the Chinese named entities. Specifically, TES-NER first introduces a cross-domain shared entity span recognition module based on a bidirectional long short-term memory (BiLSTM) layer and a fully connected neural network (FCN) which are used to identify the cross-domain shared entity span information to determine the boundaries of the Chinese named entities. Then, a Chinese NER module is constructed to identify the domain-specific Chinese named entities by applying independent BiLSTM with conditional random field models (BiLSTM-CRF). Finally, a dynamic fusion layer is designed to dynamically determine the amount of the cross-domain shared entity span information extracted from the entity span recognition module, which is used to transfer the knowledge to the domain-specific NER model through the gate mechanism. This study sets the general domain (source domain) dataset as the news domain dataset (MSRA) with sufficient labeled corpus, while the vertical domain (target domain) datasets are composed of three datasets: Mixed domain (OntoNotes 5.0), financial domain (Resume), and medical domain (CCKS 2017). Among them, the mixed domain dataset (OntoNotes 5.0) is a corpus integrating six different vertical domains. The F1 values of the model proposed in this study are 2.18%, 1.68%, and 0.99% higher than BiLSTM-CRF, respectively.

    参考文献
    相似文献
    引证文献
引用本文

吴炳潮,邓成龙,关贝,陈晓霖,昝道广,常志军,肖尊严,曲大成,王永吉.动态迁移实体块信息的跨领域中文实体识别模型.软件学报,2022,33(10):3776-3792

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-10-16
  • 最后修改日期:2020-12-15
  • 录用日期:
  • 在线发布日期: 2021-02-07
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号