基于条件随机场方法的开放领域新词发现

doi:10.3724/SP.J.1001.2013.04254

微信服务号

微信订阅号

2025年5月10日 12:41 星期六

首页 > 过刊浏览>2013年第24卷第5期 >1051-1060. DOI:10.3724/SP.J.1001.2013.04254

PDF HTML阅读 XML下载导出引用引用提醒

基于条件随机场方法的开放领域新词发现
DOI:
                        10.3724/SP.J.1001.2013.04254
                    
CSTR:
                        
                    
作者:
                        陈飞陈飞
智能技术与系统国家重点实验室(清华大学), 北京 100084;清华大学 清华信息科学与技术国家实验室(清华大学)(筹), 北京 100084;清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
刘奕群刘奕群
智能技术与系统国家重点实验室(清华大学), 北京 100084;清华大学 清华信息科学与技术国家实验室(清华大学)(筹), 北京 100084;清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
魏超魏超
智能技术与系统国家重点实验室(清华大学), 北京 100084;清华大学 清华信息科学与技术国家实验室(清华大学)(筹), 北京 100084;清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
张云亮张云亮
清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
张敏张敏
智能技术与系统国家重点实验室(清华大学), 北京 100084;清华大学 清华信息科学与技术国家实验室(清华大学)(筹), 北京 100084;清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
马少平马少平
智能技术与系统国家重点实验室(清华大学), 北京 100084;清华大学 清华信息科学与技术国家实验室(清华大学)(筹), 北京 100084;清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(60903107, 61073071); 国家高技术研究发展计划(863)(2011AA01A205)

Open Domain New Word Detection Using Condition Random Field Method

Author:

CHEN Fei
CHEN Fei
State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084, China;Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Yi-Qun
LIU Yi-Qun
State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084, China;Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
WEI Chao
WEI Chao
State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084, China;Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Yun-Liang
ZHANG Yun-Liang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Min
ZHANG Min
State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084, China;Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
MA Shao-Ping
MA Shao-Ping
State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084, China;Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

开放领域新词发现研究对于中文自然语言处理的性能提升有着重要的意义.利用条件随机场(condition random field,简称CRF)可对序列输入标注的特点,将新词发现问题转化为预测已分词词语边界是否为新词边界的问题.在对海量规模中文互联网语料进行分析挖掘的基础上,提出了一系列区分新词边界的统计特征,并采用CRF方法综合这些特征实现了开放领域新词发现的算法,同时比较了K-Means 聚类、等频率、基于信息增益这3 种离散化方法对新词发现结果的影响.通过在SogouT 大规模中文语料库上的新词发现实验,验证了所提出的方法有较好的效果.

关键词:新词发现;condition random field(CRF);中文分词

Abstract:

Open domain new word detection is vital for Chinese natural language processing research. This paper proposes a novel detection algorithm based condition random field (CRF), which treats the new word detection problem as a classification problem. In this algorithm, the study tries to separate boundaries of new words from existing words with both the CRF method and a serial of statistical features extracted from large scale corpus. The effectiveness of three different discretization strategies are also compared including K-means, equal-frequency, and information gain. Experimental results on a large-scale Web corpus named SogouT show the effectiveness of the proposed algorithms.

Key words:new word detection;conditional random field;Chinese word segmentation

引用本文

陈飞,刘奕群,魏超,张云亮,张敏,马少平.基于条件随机场方法的开放领域新词发现.软件学报,2013,24(5):1051-1060

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2011-09-20
最后修改日期:2012-04-23
录用日期:
在线发布日期: 2013-05-07
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码