Clustering-Based PU Active Text Classification Method
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [43]
  • |
  • Related [20]
  • |
  • Cited by
  • | |
  • Comments
    Abstract:

    Text classification is a key technology in information retrieval. Collecting more reliable negative examples, and building effective and efficient classifiers are two important problems for automatic text classification. However, the existing methods mostly collect a small number of reliable negative examples, keeping the classifiers from reaching high accuracy. In this paper, a clustering-based method for automatic PU (positive and unlabeled) text classification enhanced by SVM active learning is proposed. In contrast to traditional methods, this approach is based on the clustering technique which employs the characteristic that positive and negative examples should share as few words as possible. It finds more reliable negative examples by removing as many probable positive examples from unlabeled set as possible. In the process of building classifier, a term weighting scheme TFIPNDF (term frequency inverse positive-negative document frequency, improved TFIDF) is adopted. An additional improved Rocchio, in conjunction with SVMs active learning, significantly improves the performance of classifying. Experimental results on three different datasets (RCV1, Reuters-21578, 20 Newsgroups) show that the proposed clustering- based method extracts more reliable negative examples than the baseline algorithms with very low error rates and implementing SVM active learning also improves the accuracy of classification significantly.

    Reference
    [1] Liu W, Wang T. Online active multi-field learning for efficient email spam filtering. Knowledge and Information Systems, 2012, 33(1):117-136.[doi: 10.1007/s10115-011-0461-x]
    [2] Fumera G, Pillai I, Roli F. Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research, 2006,7:2699-2720.
    [3] Qi XG, Davison BD. Web page classification: Feature and algorithms. ACM Computing Surveys, 2009,41(2):Article 12.[doi: 10. 1145/1459352.1459357]
    [4] Anotonellis I, Bouras C, Poulopoulos V. Personalized news categorization through scalable text classification. Frontiers of WWW Research and Development-APWEB, Lecture Notes in Computer Science, 2006,3841:391-401.[doi: 10.1007/11610113_35]
    [5] Hu M, Liu B. Mining and summarizing customer review. In: Proc. of the ACM SIGKDD Int''l Conf. on Knowledge Discovery and Data Mining. New York: ACM, 2004. 168-177.[doi: 10.1145/1014052.1014073]
    [6] Kim S, Hovy E. Determining the sentiment of opinions. In: Proc. of the Int''l Conf. on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004.[doi: 10. 3115/1220355.1220555]
    [7] Schohn G, Cohn D. Less is more: Active learning with support vector machines. In: Proc. of the 17th Int''l Conf. on Machine Learning. San Francisco: Morgan Kaufmann Publishers, Inc., 2000. 839-846.
    [8] Liu B, Lee WS, Yu PS, Li XL. Partially supervised classification of text documents. In: Sammut C, Hoffmann AG, eds. Proc. of the 19th Int''l Conf. on Machine Learning. San Francisco: Morgan Kaufmann Publishers, Inc., 2002. 387-394.
    [9] Yu H, Han JW, Chang KCC. PEBL: Positive example based learning for Web page classification using SVM. In: Proc. of the Knowledge Discovery and Data Mining. New York: ACM, 2002. 239-248.[doi: 10.1145/775047.775083]
    [10] Li XL, Liu B. Learning to classify texts using positive and unlabeled data. In: Proc. of the Int''l Joint Conf. on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, Inc., 2003. 587-592.
    [11] Liu B, Dai Y, Li XL, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples. In: Proc. of the 3rd IEEE Int''l Conf. on Data Mining. Washington: IEEE Computer Society, 2003. 179-186.[doi: 10.1109/ICDM.2003.1250918]
    [12] Lee WS, Liu B. Learning with positive and unlabeled examples using weighted logistic regression. In: Proc. of the 20th Int''l Conf. on Machine Learning. 2003. 448-455.
    [13] Manevitz LM, Yousef M. One-Class SVMS for document classification. The Journal of Machine Learning Research, 2001,2: 139-154.
    [14] Peng T, Zuo WL, He FL. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowledge and Information Systems, 2008,16(3):281-301.[doi: 10.1007/s10115-007-0107-1]
    [15] Yu S, Li CP. PE-PUC: A graph based PU-learning approach for text classification. Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science, 2007,4571:574-584.[doi: 10.1007/978-3-540-73499-4_43]
    [16] Xiao YS, Liu B, Yin J, Cao LB, Zhang CQ, Hao ZF. Similarity-Based approach for positive and unlabeled learning. In: Walsh T, ed. Proc. of the 22nd Int''l Joint Conf. on Artificial Intelligence. AAAI Press, 2011. 1577-1582.
    [17] Wu J, Lu MY. Asymmetric semi-supervised boosting scheme for interactive image retrieval. ETRI Journal, 2010,32(5):766-776.
    [doi: 10.4218/etrij.10.1510.0016]
    [18] Li ZM. Li L, Liu YJ, Bao JW. An improved method for support vector machine-based active feedback. In: Proc. of the 2008 3rd Int''l Conf. on Pervasive Computing and Applications, Vol.1. 2008. 389-393.[doi: 10.1109/ICPCA.2008.4783617]
    [19] Zhou ZH, Chen KJ, Dai HB. Enhancing relevance feedback in image retrieval using unlabeled data. ACM Trans. on Information Systems, 2006,24(2):219-244.[doi: 10.1145/1148020.1148023]
    [20] Sheng LY, Ortega A. Graph based partially supervised learning of documents. In: Proc. of the 2011 IEEE Int''l Workshop on Machine Learning for Signal Processing. 2011. 1-6.[doi: 10.1109/MLSP.2011.6064566]
    [21] Pan SR, Zhang Y, Li X. Dynamic classifier ensemble for positive unlabeled text stream classification. Knowledge and Information Systems, 2012,33(2):267-287.[doi: 10.1007/s10115-011-0469-2]
    [22] Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 2nd ed., Heidelberg: Springer-Verlag, 2011.
    [23] Cooley R, Mobasher B, Srivastava J. Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1999,1(1):5-32.[doi: 10.1007/BF03325089]
    [24] Zhang BZ, Zuo WL. Reliable negative extracting based on KNN for learning from positive and unlabeled examples. Journal of Computers, 2009,4(1):94-101.[doi: 10.4304/jcp.4.1.94-101]
    [25] Zhang BZ, Zuo WL. A novel reliable negative method based on clustering for learning from positive and unlabeled examples. In: Proc. of the AIRS 2008. LNCS 4993, Heidelberg: Springer-Verlag, 2008. 385-392.[doi: 10.1007/978-3-540-68636-1_37]
    [26] Settles B. Active learning literature survey. Technical Report, 1648, University of Wisconsin-Madison, 2010.
    [27] Nissim N, Moskovitch R, Rokach L, Elovici Y. Detecting unknown computer worm activity via support vector machines and active learning. Pattern Analysis and Application, 2012,15(4):459-475.[doi: 10.1007/s10044-012-0296-4]
    [28] Joshi AJ, Porikli F, Papanikolopoulos NP. Scalable active learning for multiclass image classification. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2012,34(11):2259-2273.[doi: 10.1109/TPAMI.2012.21]
    [29] Ji M, Han JW. A variance minimization criterion to active learning on graphs. In: Proc. of the 15th Int''l Conf. on Artificial Intelligence and Atatistics (AISTATS). 2012. 556-564.
    [30] Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Proc. of the Advances in Large Margin Classifiers. Cambridge: MIT Press, 1999. 61-74.
    [31] Peng T, Liu L, Zuo WL. PU text classification enhanced by term frequency-inverse document frequency-improved weighting. Concurrency and Computation: Practice and Experience, Published Online: 10 MAY 2013.[doi: 10.1002/cpe.3040]
    [32] Denis F, Gilleron R, Tommasi M. Text classification from positive and unlabeled examples. In: Proc. of the Conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU). 2002.
    [33] Buckley C, Salton G, Allan J. The effect of adding relevance information in a relevance feedback environment. In: Croft WB, van Rijsbergen CJ, eds. Proc. of the Int''l ACM SIGIR Conf. New York: Springer-Verlag, 1994. 292-300.[doi: 10.1007/978-1-4471- 2099-5_30]
    [34] Chen L, Guo G, Wang K. Class-Dependent projection based method for text categorization. Pattern Recognition Letters, 2011, 32(10):1493-1501.[doi: 10.1016/j.patrec.2011.01.018]
    [35] Mesleh AM. Feature sub-set selection metrics for arabic text classification. Pattern Recognition Letters, 2011,32(14):1922-1929.
    [doi: 10.1016/j.patrec.2011.07.010]
    [36] Schölkopf S, Platt J, Shawe J, Smola A, Williamson R. Estimating the support of a high-dimensional distribution. Technical Report, MSR-TR-99-87, Microsoft Research, 2001.
    [37] Bouguila N. Hybrid generative discriminative approach for proportional data modeling and classification. IEEE Trans. on Knowledge and Data Engineering, 2012,24(12):2184-2202.[doi: 10.1109/TKDE.2011.162]
    [38] Anguita D, Ghio A, Oneto L, Ridella S. In-Sample and out-of-sample model selection and error estimation for support vector machine. IEEE Trans. on Neural Networks and Learning Systems, 2012,23(9):1390-1406.[doi: 10.1109/TNNLS.2012.2202401]
    [39] Yu HF, Hsieh CJ, Chang KW, Lin CJ. Large linear classification when data cannot fit in memory. ACM Trans. on Knowledge Discovery from Data, 2012,5(4):Article 23.[doi: 10.1145/2086737.2086743]
    [40] Park JM, Hu Y. On-Line learning for active pattern recognition. IEEE Signal Processing Letters, 1996,3(11):301-303.[doi: 10. 1109/97.542161]
    [41] Croft WB, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. Boston: Addison Wesley, 2009.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘露,彭涛,左万利,戴耀康.一种基于聚类的PU主动文本分类方法.软件学报,2013,24(11):2571-2583

Copy
Share
Article Metrics
  • Abstract:6801
  • PDF: 9405
  • HTML: 0
  • Cited by: 0
History
  • Received:February 28,2013
  • Revised:July 16,2013
  • Online: November 01,2013
You are the first2036684Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063