基于多策略融合Giza++的术语对齐法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国防基础科学研究计划(Q172011A001)


Automatic Term Alignment Based on Advanced Multi-Strategy and Giza++ Integration
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    跨语系术语对齐质量不高,原因在于其依赖于低质量的术语抽取与对齐.提出的多策略融合Giza++ (AGiza)的术语对齐法,为提高术语抽取质量,用首尾词性规则提高召回率,用独立过滤、停用过滤提高准确率,再识别共句术语对.为提高术语对齐的对准率:基于独立度、停用度,提出独立相关度、停用相关度;由种子对相关度和单词关联度概率加组合成语义相关度;根据首尾对齐情况,提出首尾相关度,并去除值为0者;基于词性组成特征,构造词性相似度;由GIZA++计算得到g值;经过属性的相关系数分析后,乘法组合各属性构造术语对齐度a;最后,过滤a超过术语对齐阈值(由召回率设定)的术语对.实验结果表明,AGiza术语对齐,可有效地处理跨语系术语对齐,质量高于GIZA++,Dice, F2,LLR,K-VEC及DKVEC.

    Abstract:

    The quality of cross-phylum term alignment depends on the quality of term extraction and alignment method. This paper proposes an automatic term alignment based on advanced multi-strategy and Giza++ (AGiza) integration. By analyzing the properties of the term extraction performed by using some existing methodologies in the literature, the rules of the first and the last part of speech of strings are designed to increase the recall rate. Methods that are applied for the purpose of increasing the precision of the term extraction include: (1) independence filter; (2) stopping filter; and (3) recognition of the co-occurrence of terms in the sentence pairs. The following steps are also implemmented to increase the alignment quality: (1) design the degree of the independence correspondence based on the degree of independence; (2) construct the degree of the stopping correspondence based on the degree of stopping usage; (3) propose the degree of semantic correspondence that computed by the seed pairs' correspondence and word pairs' similarity based on additivity of probability; (4) construct the alignment correspondence degree of the first part and last part between the term pairs in order to cancel the term pairs whose value is equal to zero; (5) present the similarity degree of the part of speech between the term pairs considering the patterns that define the morphosyntactic structures of terms; and (6) obtain the value of g based on GIZA++. The term-aligned degree (a) is computed by the six attributes of term pairs based on multiplication of probability after analyzing their correlations. Term pairs is extracted by select the term-aligned pairs based on the candidate term pairs whose a is more than the term-aligned threshold that make the tolerance of recall is less than 1%. The simulation results of Chinese-English term alignment show that automatic term alignment based on AGiza can be used to extract cross-phylum term pairs effectively. Furthermore, it outperforms GIZA++, the Dice coefficient, the F2 coefficient, the log-likelihood ratio, K-VEC and DKVEC.

    参考文献
    相似文献
    引证文献
引用本文

刘胜奇,朱东华.基于多策略融合Giza++的术语对齐法.软件学报,2015,26(7):1650-1661

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2013-11-03
  • 最后修改日期:2014-04-09
  • 录用日期:
  • 在线发布日期: 2015-07-02
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号