篇章视角的汉语零指代语料库构建
作者:
作者单位:

作者简介:

孔芳(1977-),女,博士,教授,博士生导师,主要研究领域为自然语言理解,篇章分析.
葛海柱(1994-),男,硕士,主要研究领域为自然语言理解,篇章分析.
周国栋(1967-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为自然语言理解,机器学习.

通讯作者:

周国栋,E-mail:gdzhou@suda.edu.cn

中图分类号:

TP18

基金项目:

国家自然科学基金(61876118,61751206);江苏高校优势学科建设工程


Corpus Construction for Chinese Zero Anaphora from Discourse Perspective
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61876118, 61751206); A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    零指代是汉语中普遍存在的一个现象,在汉英机器翻译、文本摘要以及阅读理解等众多自然语言处理任务中都起着重要作用,目前已成为自然语言处理领域的一个研究热点.提出了篇章视角的汉语零指代表示体系,从服务于篇章分析的角度出发,首先以基本篇章单元为考察对象,判别其是否包含零元素;再根据零元素在基本篇章单元中承担的角色将零元素划分成主干类和修饰类两类;接着以段落对应的篇章修辞结构树为考察指代关系的基本单元,依据先行词与零元素间的位置关系将指代关系分成基本篇章单元内和基本篇章单元间两种,并针对基本篇章单元间的指代关系,根据零元素对应的先行词的状况将指代关系分成实体类、事件类、组合类和其他等4类;最后,基于篇章视角的汉语零指代表示体系,选取汉语树库CTB、连接词驱动的汉语篇章树库CDTB和OntoNotes语料中重叠的325篇文本进行了汉语零指代的标注,构建了服务于篇章分析的汉语零指代语料库.一方面,借助系统检测来说明所提出的表示体系合理有效,构造的语料库质量上乘;另一方面构建了完整的汉语零指代消解基准平台,从可计算的角度验证了所构建的汉语零指代语料库能够为篇章视角的汉语零指代研究提供必要的支撑.

    Abstract:

    As a common phenomenon in Chinese, zero anaphora plays an important role in many natural language processing tasks, such as machine translation, text summarization and machine reading comprehension. Currently, it has become a research hotspot in the field of natural language processing. Towards better discourse analysis, this study proposes a representation architecture for Chinese zero anaphora from the discourse perspective. Firstly, the elementary discourse unit is taken as the investigation object to determine whether it contains zero elements. Secondly, according to the roles of zero elements in the elementary discourse unit, the zero elements are divided into two categories: the core type and the modifier type. Thirdly, the discourse rhetorical tree of the paragraph is used as the basic unit to evaluate the Chinese zero coreferential relationship. According to the positional relationship between the antecedent and the zero element, the coreferential relationship is classified into two types, i.e., Intra-EDU and Inter-EDU. After that, for Inter-EDU type, the coreferential relationship is furtherly divided into four categories according to the status of the antecedent, i.e., entity, event, union, and others. Finally, this study selects the overlapped 325 texts of the Chinese treebank (CTB), the connective-driven Chinese discourse treebank (CDTB), and the OntoNotes corpus to annotate the Chinese zero anaphora. System evaluation shows the high quality of the constructed corpus for Chinese zero anaphora. Moreover, a complete zero anaphor resolution baseline system is constructed to show the appropriateness and the effectiveness of the proposed representation architecture for Chinese zero anaphora from computability perspective.

    参考文献
    相似文献
    引证文献
引用本文

孔芳,葛海柱,周国栋.篇章视角的汉语零指代语料库构建.软件学报,2021,32(12):3782-3801

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-05-15
  • 最后修改日期:2020-06-22
  • 录用日期:
  • 在线发布日期: 2021-12-02
  • 出版日期: 2021-12-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号