Abstract:Text classification is a key technology in information retrieval. Collecting more reliable negative examples, and building effective and efficient classifiers are two important problems for automatic text classification. However, the existing methods mostly collect a small number of reliable negative examples, keeping the classifiers from reaching high accuracy. In this paper, a clustering-based method for automatic PU (positive and unlabeled) text classification enhanced by SVM active learning is proposed. In contrast to traditional methods, this approach is based on the clustering technique which employs the characteristic that positive and negative examples should share as few words as possible. It finds more reliable negative examples by removing as many probable positive examples from unlabeled set as possible. In the process of building classifier, a term weighting scheme TFIPNDF (term frequency inverse positive-negative document frequency, improved TFIDF) is adopted. An additional improved Rocchio, in conjunction with SVMs active learning, significantly improves the performance of classifying. Experimental results on three different datasets (RCV1, Reuters-21578, 20 Newsgroups) show that the proposed clustering- based method extracts more reliable negative examples than the baseline algorithms with very low error rates and implementing SVM active learning also improves the accuracy of classification significantly.