Clustering-Based PU Active Text Classification Method

doi:10.3724/SP.J.1001.2013.04467

微信服务号

微信订阅号

2025-4-24- 15

Home > Archive>Volume 24, Issue 11, 2013 >2571-2583. DOI:10.3724/SP.J.1001.2013.04467

PDF HTML XML Export Cite reminder

Clustering-Based PU Active Text Classification Method
DOI:
                        10.3724/SP.J.1001.2013.04467
                    
Author:
                        LIU LuLIU Lu
College of Computer Science and Technology, Jilin University, Changchun 130012, China;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
PENG TaoPENG Tao
College of Computer Science and Technology, Jilin University, Changchun 130012, China;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA;Key Laboratory of Symbol Computation and Knowledge Engineering Jilin University, Ministry of Education, Changchun 130012, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZUO Wan-LiZUO Wan-Li
College of Computer Science and Technology, Jilin University, Changchun 130012, China;Key Laboratory of Symbol Computation and Knowledge Engineering Jilin University, Ministry of Education, Changchun 130012, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
DAI Yao-KangDAI Yao-Kang
College of Computer Science and Technology, Jilin University, Changchun 130012, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Text classification is a key technology in information retrieval. Collecting more reliable negative examples, and building effective and efficient classifiers are two important problems for automatic text classification. However, the existing methods mostly collect a small number of reliable negative examples, keeping the classifiers from reaching high accuracy. In this paper, a clustering-based method for automatic PU (positive and unlabeled) text classification enhanced by SVM active learning is proposed. In contrast to traditional methods, this approach is based on the clustering technique which employs the characteristic that positive and negative examples should share as few words as possible. It finds more reliable negative examples by removing as many probable positive examples from unlabeled set as possible. In the process of building classifier, a term weighting scheme TFIPNDF (term frequency inverse positive-negative document frequency, improved TFIDF) is adopted. An additional improved Rocchio, in conjunction with SVMs active learning, significantly improves the performance of classifying. Experimental results on three different datasets (RCV1, Reuters-21578, 20 Newsgroups) show that the proposed clustering- based method extracts more reliable negative examples than the baseline algorithms with very low error rates and implementing SVM active learning also improves the accuracy of classification significantly.

Key words:positive and unlabeled (PU) text classification;clustering;TFIPNDF (term frequency inverse positive-negative document frequency);active learning;reliable negative example;improved Rocchio

Get Citation

刘露,彭涛,左万利,戴耀康.一种基于聚类的PU主动文本分类方法.软件学报,2013,24(11):2571-2583

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:February 28,2013
Revised:July 16,2013
Adopted:
Online: November 01,2013
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History