• Article
  • | |
  • Metrics
  • |
  • Reference [14]
  • |
  • Related [20]
  • |
  • Cited by [7]
  • | |
  • Comments
    Abstract:

    To emphasize the fuzzy relation among words, latent concepts, text and topics, an information theory based approach to latent concept extraction and text clustering is proposed. Latent concept variable and topic variable are introduced to reveal such relation, and a global objective function is defined in the theme of rate-distortion theory. An anneal-like algorithm is designed to extract the hierarchical tree of latent concept, and to group the texts under corresponding concept hierarchy at the same time. Furthermore, it determines the number of concept and text clustering result with a concept selection method based on minimal description length criteria. It is a soft co-clustering method and outperforms the ones based on the word space, and current text hard co-clustering method based on latent concept by experiments.

    Reference
    [1] Li XG, Yu G, Wang DL. MMPClust: A skew prevention algorithm for model-based document clustering. In: Zhou LZ, ed. Proc. of the 10th Int’l Conf. on Database Systems for Advanced Applications. Beijing: Springer-Verlag, 2005. 536-547.
    [2] Deerwester S, Dumais ST, Furnas GW. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990,41(6):391-407.
    [3] Hofmann T. Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Berkley: ACM Press, 1999. 50-57.
    [4] Gong XJ, SHI ZZ. Semi-Supervised Web mining based on bayes latent semantic model. Journal of Software, 2002,13(8): 1508-1514 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/13/1508.pdf
    [5] Karypis G, Han EH. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. In: Proc. of the 2000 ACM CIKM Int’l Conf. on Information and Knowledge Management. McLean: ACM Press, 2000. 12-19.
    [6] Aggarwal CC, Yu PS. On effective conceptual indexing and similarity search in text data. In: Proc. of the 2001 IEEE Int’l Conf. on Data Mining. San Jose: IEEE Computer Society, 2001. 3-10.
    [7] Baker LD, McCallum AK. Distributional clustering of words for text classification. In: Proc. of the 21st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Melbourne: ACM Press, 1998. 96-103.
    [8] Dhilon IS. Co-Clustering documents and words using bipartite spectral graph partitioning. In: Proc. of the 7th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM Press, 2001. 269-274.
    [9] Tishby N, Pereira FC, Bialek W. The information bottleneck method. In: Proc. of the 37th Annual Allerton Conf. on Communcation, Control and Computing. 1999. 368-377.
    [10] Slonim N, Tishby N. Document clustering using word clusters via the information bottleneck. In: Belkin NJ, et al., eds. Proc. of the 23rd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Athens: ACM Press, 2000. 208-215.
    [11] Pereira F, Tishby N, Lee L. Distributional clustering of English words. In: Proc. of the 31st Annual Meeting of the Association for Computational Linguistics. Columbus: Morgan Kaufmann Publishers, 1993. 183-190.
    [12] Friedman N, Mosenzon O, Slonim N, Tishby N. Multivariate information bottleneck. In: Breese JS, Koller D, eds. Proc. of the 17th Conf. on Uncertainty in Artificial Intelligence. Seattle: Morgan Kaufmann Publishers, 2001. 152-161.
    [13] Dhillon IS, Mallela S, Modha DS. Information theoretic co-clustering. In: Getoor L, Senator TE, Domingos P, Faloutsos C, eds. Proc. of the 9th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. Washington: ACM, 2003. 89-98.
    [14] Rose K. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc. of the IEEE, 1998,86(11):2210-2239.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

李晓光,于 戈,王大玲,鲍玉斌.基于信息论的潜在概念获取与文本聚类.软件学报,2008,19(9):2276-2284

Copy
Share
Article Metrics
  • Abstract:5064
  • PDF: 6376
  • HTML: 0
  • Cited by: 0
History
  • Received:December 28,2006
  • Revised:August 03,2007
You are the first2045290Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063