New words recognition and ambiguity resolving have vital effect on information retrieval precision. This paper presents a statistical model based algorithm for adaptive Chinese word segmentation. Then, a new word segmentation system called BUAASEISEG is designed and implemented using this algorithm. BUAASEISEG can recognize new words in various domains and do disambiguation and segment words with arbitrary length. It uses an iterative bigram method to do word segmentation. Through online statistical analysis on target article and using the offline words frequencies dictionary or the inverted index of the search engine, the candidate words selection and disambiguation are done. On the basis of the statistical methods, post-process using stopwords list, quantity suffix words list and surname list are used for further precision improvement. The comparative evaluation with the famous Chinese word segmentation system ICTCLAS, using news and papers as testing text, shows that BUAASEISEG outperforms ICTCLAS in new words recognition and disambiguation.
[1]Foo S,Li H.Chinese word segmentation accuracy and its effects on information retrieval.Information Processing and Management,2004,40(1):161-190.
[2]Zhang HP,Yu HK,Xiong DY,Liu Q.HHMM-Based Chinese lexical analyzer ICTCLAS.In:Proc.of the 2nd SigHan Workshop.2003.184-187.
[3]Su KY,Chaing TH,Chang JS.An overview of corpus-based statistics-oriented (CBSO) techniques for natual language processing.Computational Linguistics and Chinese Language Processing,1996,1(1):101-157.
[4]Zhang HP,Liu Q.Model of Chinese words rough segmentation based on N-shortest-paths method.Journal of Chinese Information Processing,2002,16(5):1-7 (in Chinese with English abstract).
[5]Liang NY.CDWS:A word segmentation system for written Chinese texts.Journal of Chinese Information Processing,1987,1(2):101-106 (in Chinese with English abstract).
[6]Zhu XF,Wang H.Classification of modern Chinese quantity suffix and noun.Technical Report,1994 (in Chinese with English abstract).http://www.icl.pku.edu.cn/icl_tr/collected_papers/chinese/collection-2/yyy23.htm
[7]Gao JF,Li M,Huang CN.Improved source-channel models for Chinese word segmentation.In:Proc.of the 41st Annual Meeting of the Association for Computational Linguistics.2003.7-12.
[8]Giles JT,Wo L,Berry MW.GTP (general text parser) Software for text mining in statistical data mining and knowledge discovery.In:Bozdogan H,ed.Boca Raton:CRC Press,2003.455-471.
[9]Chang JS,Lin YC,Su KY.Automatic construction of a Chinese electronic dictionary.In:Yarowsky D,Church K,eds.Proc.of the 3rd Workshop on Very Large Corpora.1995.107-120.
[10]Dai YB,Khoo SGT,Loh TE.A new statistical formula for Chinese word segmentation incorporating contextual information.In:Proc.of the 22nd Annual Int'l ACM SIGIR Conf.on Research and Development in Information Retrieval.1999.82-89.
[11]Gao JF,Wu AD,Li M,Huang CN,Li HQ,Xia XS,Qin HW.Adaptive Chinese word segmentation.In:Proc.of the 41st Annual Meeting of the Association for Computational Linguistics.2004.21-26.