• Article
  • | |
  • Metrics
  • |
  • Reference [14]
  • |
  • Related [20]
  • |
  • Cited by [16]
  • | |
  • Comments
    Abstract:

    New words recognition and ambiguity resolving have vital effect on information retrieval precision. This paper presents a statistical model based algorithm for adaptive Chinese word segmentation. Then, a new word segmentation system called BUAASEISEG is designed and implemented using this algorithm. BUAASEISEG can recognize new words in various domains and do disambiguation and segment words with arbitrary length. It uses an iterative bigram method to do word segmentation. Through online statistical analysis on target article and using the offline words frequencies dictionary or the inverted index of the search engine, the candidate words selection and disambiguation are done. On the basis of the statistical methods, post-process using stopwords list, quantity suffix words list and surname list are used for further precision improvement. The comparative evaluation with the famous Chinese word segmentation system ICTCLAS, using news and papers as testing text, shows that BUAASEISEG outperforms ICTCLAS in new words recognition and disambiguation.

    Reference
    [1]Foo S,Li H.Chinese word segmentation accuracy and its effects on information retrieval.Information Processing and Management,2004,40(1):161-190.
    [2]Zhang HP,Yu HK,Xiong DY,Liu Q.HHMM-Based Chinese lexical analyzer ICTCLAS.In:Proc.of the 2nd SigHan Workshop.2003.184-187.
    [3]Su KY,Chaing TH,Chang JS.An overview of corpus-based statistics-oriented (CBSO) techniques for natual language processing.Computational Linguistics and Chinese Language Processing,1996,1(1):101-157.
    [4]Zhang HP,Liu Q.Model of Chinese words rough segmentation based on N-shortest-paths method.Journal of Chinese Information Processing,2002,16(5):1-7 (in Chinese with English abstract).
    [5]Liang NY.CDWS:A word segmentation system for written Chinese texts.Journal of Chinese Information Processing,1987,1(2):101-106 (in Chinese with English abstract).
    [6]Zhu XF,Wang H.Classification of modern Chinese quantity suffix and noun.Technical Report,1994 (in Chinese with English abstract).http://www.icl.pku.edu.cn/icl_tr/collected_papers/chinese/collection-2/yyy23.htm
    [7]Gao JF,Li M,Huang CN.Improved source-channel models for Chinese word segmentation.In:Proc.of the 41st Annual Meeting of the Association for Computational Linguistics.2003.7-12.
    [8]Giles JT,Wo L,Berry MW.GTP (general text parser) Software for text mining in statistical data mining and knowledge discovery.In:Bozdogan H,ed.Boca Raton:CRC Press,2003.455-471.
    [9]Chang JS,Lin YC,Su KY.Automatic construction of a Chinese electronic dictionary.In:Yarowsky D,Church K,eds.Proc.of the 3rd Workshop on Very Large Corpora.1995.107-120.
    [10]Dai YB,Khoo SGT,Loh TE.A new statistical formula for Chinese word segmentation incorporating contextual information.In:Proc.of the 22nd Annual Int'l ACM SIGIR Conf.on Research and Development in Information Retrieval.1999.82-89.
    [11]Gao JF,Wu AD,Li M,Huang CN,Li HQ,Xia XS,Qin HW.Adaptive Chinese word segmentation.In:Proc.of the 41st Annual Meeting of the Association for Computational Linguistics.2004.21-26.
    [4]张华平,刘群.基于N-最短路径方法的中文词语粗分模型.中文信息学报,2002,16(5):1-7.
    [5]梁南元.书面汉语自动分词系统--CDWS.中文信息学报,1987,1(2):101-106.
    [6]朱学锋,王惠.现代汉语量词与名词的子类划分.技术报告,1994.http://www.icl.pku.edu.cn/icl_tr/collected papers/chinese/collection-2/YYY23.htm
    Comments
    Comments
    分享到微博
    Submit
Get Citation

曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统.软件学报,2006,17(3):356-363

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 02,2005
  • Revised:October 11,2005
You are the first2038792Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063