基于Bayes潜在语义模型的半监督Web挖掘
作者:
基金项目:

国家自然科学基金资助项目(60073019,69803010)


Semi-Supervised Web Mining Based on Bayes Latent Semantic Model
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [9]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    随着互联网信息的增长,Web挖掘已经成为数据挖掘研究的热点之一.网页分类是通过学习大量的带有类别标注的训练样本来预测网页的类别,人工标注这些训练样本是相当繁琐的.网页聚类通过一定的相似性度量,将相关网页归并到一类.然而传统的聚类算法对解空间的搜索带有盲目性和缺乏语义特征.提出了两阶段的半监督文本学习策略.第1阶段,利用贝叶斯潜在语义模型来标注含有潜在类别主题词变量的网页的类别;第2阶段,利用简单贝叶斯模型,在第1阶段类别标注的基础上,通过EM(expectation maximization)算法对不含有潜在类别主题词变量的文档作类别标注.实验结果表明,该算法具有很高的精度和召回率.

    Abstract:

    With the increasing of information on Internet, Web mining has been the focus of data mining. Web classification predicts the labels of Web documents by learning lots of training examples with labels. It is very expensive to get these examples by manual. Web clustering groups the similar Web documents by a certain of metric of similarity. But the classical algorithms of clustering are aimless in searching the solution space and absent of semantic characters. In this paper, a semi-supervised learning strategy consists of tow stages is put forward.The fist atage,labels the documents the documents that include latent class variables by using Bayes latent semantic model.The second stage,based on the results from the first stage,labels the documents excluding latent class variables with the Naive Bayes models.Experimental results show that this algorithm has good precision and recall rate.

    参考文献
    [1] Lan, Huang. A survey on web information retrieval technologies. http://citeseer.nj.nec.com/cache/papers2/cs/16461/http:zSzzSzwww.ecsl.cs.sunysb.eduzSztrzSzrpe8.pdf/a-survey-on-web.pdf.
    [2] Deerwester, S., Dumais, S.T., G.W., et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990,41.
    [3] Shi, Zhong-zhi. Knowledge Discovery. Beijing: Tsinghua University Press, 2000. (in Chinese).
    [4] Li, Xiao-li, Liu, Ji-min, Shi, zhong-zhi. Concept inference network and its application in text classification. Computer Research and Development, 2000,37(9):1032~1038 (in Chinese).
    [5] Shivakumar, Vaithyanathan. Hierarchical Bayes for text classification. In: Tan, Ah-Hwee, Yu, P.S., eds. Proceedings of the International Workshop on Text and Web Mining. 2000.
    [6] Chickering, D., Heckerman, D. Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. Technical Report, MSR-TR-96-08, Microsoft Research, 1996.
    [7] Nigam, K., McCallum, A., Thrun, S., et al. Learning to classify text from labeled and unlabeled documents. In: Mostow, J., Madison, C.R., eds. Proceedings of the 15th National Conference on Artificial Intelligence. Wisconsin: AAAI Press, 1998. 792-799.,
    [8] 史忠植.知识发现.北京:清华大学出版社,2000.
    [9] 李晓黎,刘继敏,史忠植.概念推理网及其在文本分类中的应用.计算机研究与发展,2000,37(9):1032~1038.
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘.软件学报,2002,13(8):1508-1514

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2001-06-04
  • 最后修改日期:2001-09-06
文章二维码
您是第19867387位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号