• Article
  • | |
  • Metrics
  • |
  • Reference [21]
  • |
  • Related [20]
  • |
  • Cited by [9]
  • | |
  • Comments
    Abstract:

    The performance of Chinese word segmentation has been greatly improved by character-based approaches in recent years. With the help of powerful machine learning strategies, the words extraction via combination of characters becomes the focus in Chinese word segmentation researches. In spite of the outstanding capability of discovering out-of-vocabulary words, the character-based approaches are not as good as word-based approaches in in-vocabulary words segmentation with some internal and external information of the words lost. In this paper we propose a joint decoding strategy that combines the character-based conditional random field model and word-based Bi-gram language model, for segmenting Chinese character sequences. The experimental results demonstrate the good performance of our approach, and prove that two sub models are well integrated as the joint model of character and word could more effectively enhance the performance of Chinese word segmentation systems than any of the single model, thus is fit for many applications in Chinese information processing.

    Reference
    [1] Sproat R, Emerson T. The 1st Int’l Chinese Word Segmentation Bakeoff. In: Proc. of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003. http://www.aclweb.org/anthology-new/W/W03/W03-1719.pdf
    [2] Emerson T. The 2nd Int’l Chinese Word Segmentation Bakeoff. In: Proc. of the 4th SIGHAN Workshop on Chinese Language Processing. 2005. http://www.aclweb.org/anthology-new/I/I05/I05-3017.pdf
    [3] Levow G. The 3rd Int’l Chinese Language Proc. Bakeoff: Word segmentation and name entity recognition. In: Proc. of the 5th SIGHAN Workshop on Chinese Language Proc. 2006.
    [4] Xue N, Shen L. Chinese word segmentation as LMR tagging. In: Proc. of the 2nd SIGHAN Workshop on Chinese Language Proc. 2003. http://www.aclweb.org/anthology-new/W/W03/W03-1728.pdf
    [5] Huang C, Zhao H. Which is essential for chinese word segmentation: Character versus word. In: Proc. of the 20th Pacific Asia Conf. on Language, Information and Computation (PACLIC-20). 2006. 1-12.
    [6] Huang C, Zhao H. Chinese word segmentation: A decade review. Journal of Chinese Information Processing, 2007,21(3):8-18 (in Chinese with English abstract).
    [7] Zhang R, Kikui G, Sumita E. Subword-Based tagging by conditional random fields for Chinese word segmentation. In: Proc. of the HLT/NAACL-2006. 2006.
    [8] Zhao H, Kit C. Effective subsequence-based tagging for chinese word segmentation. Journal of Chinese Information Processing, 2007,21(5):8-13 (in Chinese with English abstract).
    [9] Zhao H, Huang C, Li M, Lu B. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proc. of the 20th Pacific Asia Conf. on Language, Information and Computation (PACLIC-20). 2006. 87-94.
    [10] Berger A, Pietra SAD, Pietra VJD. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22:39-71.
    [11] Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. In: Proc. of the Conf. on Empirical Methods in Natural Language Processing. 1996. http://www.aclweb.org/anthology-new/W/W96/W96-0213.pdf
    [12] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of the 18th Int’l Conf. on Machine Learning (ICML 2001). 2001. http://www.cis.upenn.edu/~pereira/papers/crf.pdf
    [13] Sha F, Pereira F. Shallow parsing with conditional random fields. In: Proc. of the HLT-NAACL 2003. 2003. http://www.aclweb. org/anthology-new/N/N03/N03-1028.pdf
    [14] Peng FC, Feng FF, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proc. of the 20th Int’l Conf. on Computational Linguisticsd. 2004. http://www.aclweb.org/anthology-new/C/C04/C04-1081.pdf
    [15] Zhao H, Huang C, Li M. An improved chinese word segmentation system with conditional random field. In: Proc. of the 5th SIGHAN Workshop on Chinese Language Processing. 2006. 162-165.
    [16] Zhang H, Liu T, Ma J, Liao X. Chinese word segmentation with multiple postprocessors in HIT-IRLab. In: Proc. of the 4th SIGHAN Workshop on Chinese Language Processing. 2005. http://www.aclweb.org/anthology-new/I/I05/I05-3028.pdf
    [17] Katz SM. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech, and Signal Processing, 1987,35(3):400-401.
    [18] Zhao H, Kit C. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In: Proc. of the 6th SIGHAN Workshop on Chinese Language Processing (SIGHAN-6). 2008. http://www.aclweb.org/ anthology-new/I/I08/I08-4017.pdf
    [19] Jiang W, Huang L, Liu Q, Lü Y. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In: Proc. of the 46th Annual Meeting of the Association for Computational Linguistics (ACL-08). 2008. 附中文参考文献:
    [6] 黄昌宁,赵海.中文分词十年回顾,中文信息学报,2007,21(3):8-18.
    [8] 赵海,揭春雨.基于有效子串标注的中文分词.中文信息学报,2007,21(5):8-13.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

宋彦,蔡东风,张桂平,赵海.一种基于字词联合解码的中文分词方法.软件学报,2009,20(9):2366-2375

Copy
Share
Article Metrics
  • Abstract:5273
  • PDF: 8680
  • HTML: 0
  • Cited by: 0
History
  • Received:September 12,2008
  • Revised:March 05,2009
You are the first2044866Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063