面向低资源关系抽取的自训练方法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP18

基金项目:

国家自然科学基金(62376177, 61936010)


Self-training Approach for Low-resource Relation Extraction
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    自训练是缓解标注数据不足问题的常见方法, 其通常做法是利用教师模型去获取高置信度的自动标注数据作为可靠数据. 然而在低资源场景关系抽取任务上, 该方法不仅存在教师模型泛化能力差的问题, 而且受到关系抽取任务中易混淆关系类别的影响, 导致难以从自动标注数据中有效地识别出可靠数据, 同时产生大量难以利用的低置信度噪音数据. 因此, 提出一种有效利用低置信度数据的自训练方法ST-LRE (self-training approach for low-resource relation extraction). 该方法一方面基于复述增强的预测方法来加强教师模型筛选可靠数据的能力; 另一方面, 基于部分标注模式从低置信度数据中提炼出可利用的模糊数据. 基于模糊数据的候选类别集合, 提出了基于负标签集合的负向训练方法. 最后, 为了支持可靠数据和模糊数据的融合训练, 提出一种支持正负向训练的联合方法. 在两个广泛使用的关系抽取数据集SemEval2010 Task-8和Re-TACRED的低资源场景上进行实验, ST-LRE方法取得显著且一致的提升.

    Abstract:

    Self-training, a common strategy for tackling the annotated-data scarcity, typically involves acquiring auto-annotated data with high confidence generated by a teacher model as reliable data. However, in low-resource scenarios for Relation Extraction (RE) tasks, this approach is hindered by the limited generalization capacity of the teacher model and the confusable relational categories in tasks. Consequently, efficiently identifying reliable data from automatically labeled data becomes challenging, and a large amount of low-confidence noise data will be generalized. Therefore, this study proposes a self-training approach for low-resource relation extraction (ST-LRE). This approach aids the teacher model in selecting reliable data based on prediction ways of paraphrases, and extracts ambiguous data with reliability from low-confidence data based on partially-labeled modes. Considering the candidate categories of ambiguous data, this study proposes a negative training approach based on the set of negative labels. Finally, a unified approach capable of both positive and negative training is proposed for the integrated training of reliable data and ambiguous data. In the experiments, ST-LRE consistently demonstrates significant improvements in low-resource scenarios of two widely used RE datasets SemEval2010 Task-8 and Re-TACRED.

    参考文献
    相似文献
    引证文献
引用本文

郁俊杰,王星,陈文亮,张民.面向低资源关系抽取的自训练方法.软件学报,,():1-17

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-10-10
  • 最后修改日期:2024-01-18
  • 录用日期:
  • 在线发布日期: 2024-07-03
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号