统计与规则并举的汉语词性自动标注算法
作者:
基金项目:

本文研究得到国家863高科技项目基金资助.

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [1]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    本文提出并实现了一种基于定量统计分析优先的统计和规则并举的汉语词性自动标注算法.本算法引入置信区间的概念,优先采用高准确率的定量统计分析技术,然后利用规则标注剩余语料和校正部分统计标注错误.封闭和开放测试表明,在未考虑生词和汉语词错误切分的情况下,本算法的准确率为98.9%和98.1%.

    Abstract:

    This paper proposes an algorithm of automaticallytagging the POS(part of speech) of Chinese words which is based on integration of the statistical technique and the rule technique with the priority of the quantitative statistical analysis. The confidence intervals in the estimation of parameters is employed in the algorithm, and this makes the high-accuracy quantitative statistical technique as the top priority of tagging a corpus. Then the untagging part of the corpus is tagged in terms of rules, and some errors by statistics can be corrected by rules. Both closed and opened tests indicated that the accuracies of the algorithm are 98.9% and 98.1% respectively without consideration of both unknown words and segmentation errors.

    参考文献
    1  Brill E, Magerman D, Marcus M et al. Deducing linguistic structure from the statistics of large corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop. Hidden Valley PA: Addison Wesley Longman Limited, USA, 1990. 275~282 2  Zhao Tie-jun, Mao Cheng-jiang, Zhang Min et al. Solving the ambiguity of Chinese POS in the CEMT-III system. Chinese Information Journal, 1994,7(4):52~59 3  Bernard Merialdo. Tagging English text with a probailistic model. Computational Linguistics, 1994,20(2):1~29 4  Zhang Chi. Research on the algorithm of POS tagging on Chinese corpus based on statistics [Bachelor Thesis]. Harbin Institute of Technology, 1996 5  Bai Shuan-hu. Research on the algorithm of POS tagging on Chinese corpus based on statistics [Master Thesis]. Tsinghua University, 1992 6  Zhou Qiang. An algorithm of tagging Chinese POS based on statistics and rule. Chinese Information Journal, 1996,9(3):1~9 7  Elliott Macklovith. Where the tagger falters. In: Proceedings of TMI-92, the 4th International Conference on Theoretical and Methodological Issues in Machine Translation. Pierre Isabelle, Bell Canada Publishing House, Canada, 1992. 113~126 8  Ido Dagan, Alon Itai. Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 1994,20(4):563~596. 9  Zhou Ming, Li Sheng et al. Dear: a translator's workstation. In: Proceedings of the NLPRS'95, Natural Language Processing Pacific Rim Symposium. KPChoi: Publishing House of Korean Advance Institute of Science and Technology, Korea, 1995. 388~397
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张 民,李 生,赵铁军,张艳风.统计与规则并举的汉语词性自动标注算法.软件学报,1998,9(2):134-138

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:1996-08-21
  • 最后修改日期:1997-03-20
文章二维码
您是第19783709位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号