A Web Site Representation and Mining Algorithm Using the Multiscale Tree Model
DOI:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    With the exponential growth of both the amount and the diversity of the web information, web site mining is highly desirable for automatically discovering and classifying topic-specific web resources from the World Wide Web. Nevertheless, existing web site mining methods have not yet handled adequately how to make use of all the correlative contextual semantic clues and how to denoise the content of web sites effectually so as to obtain a better classification accuracy. This paper circumstantiates three issues to be solved for designing an effective and efficient web site mining algorithm, i.e., the sampling size, the analysis granularity, and the representation structure of web sites. On the basis, this paper proposes a novel multiscale tree representation model of web sites, and presents a multiscale web site mining approach that contains an HMT-based two-phase classification algorithm, a context-based interscale fusion algorithm, a two-stage text-based denoising procedure, and an entropy-base pruning strategy. The proposed model and algorithms may be used as a starting-point for further investigating some related issues of web sites, such as query optimization of multiple sites and web usage mining. Experiments also show that the approach achieves in average 16% improvement in classification accuracy and 34.5% reduction in processing time over the baseline system.

    Reference
    Related
    Cited by
Get Citation

田永鸿,黄铁军,高文.基于多粒度树模型的Web站点描述及挖掘算法.软件学报,2004,15(9):1393-1404

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 02,2003
  • Revised:July 08,2003
  • Adopted:
  • Online:
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063