[关键词]
[摘要]
随着WWW上的信息日益丰富,对高效率信息采集(IG)工具的需求日益迫切.由于网络资源非常昂贵,因此,信息采集属于资源受限型任务.主要目标是设计面向特定领域的高效率信息采集方法.提出了在不下载页面的情况下推测页面内容的方法,设计了不同的控制策略,并定义了多种页面下载优先级定量指标,建造了一个信息采集系统——TH-Gatherer,并进行了不同的实验以检验此方法.实验证明,可以在不实际下载页面的情况下,近似推测出候选页面的内容,采用混合尺度的基于优先级的采集方法,在采集效率方面比当前许多信息采集工具(包括Crawler和离线浏览工具)常用的宽度优先方法高4倍以上.实验结果表明,所设计的获取方法在获取效率方面比当前常用的宽度优先方法高4倍以上.此方法适用于资源受限条件下、特定领域的信息采集.
[Key word]
[Abstract]
With the information available through World-Wide-Web becoming overwhelming, e fficient information gathering (IG) tools are necessary. Since the network resou rces are expensive, so IG is a resource-bounded task. The main purpose of this paper is to find an efficient gathering method for specific topic. This paper pr esents methods for predicting page's content without downloading it, designs dif ferent controlling strategies, and defines several kinds of page downloading pri ority measures. An IG system, TH-Gatherer, was built to test the methods, and d ifferent experiments were carried out. Through experiments, it was found that th e content of candidate pages can be predicted approximately without downloading. When the priority based gathering strategy and hybrid measure are used, the gat hering efficiency is four times of that of BFS strategy which is used by many cu rrent IG tools (including crawlers and off-line browsing tools). The method pre sented in this paper is suitable for resource-bounded, specific topic informati on gathering.
[中图分类号]
[基金项目]
This project is supported by the Fundation of IBM China Research Laborato ry (IBM中国研究中心基金).