使用分类器自动发现特定领域的深度网入口
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant No.60373099 (国家自然科学基金); the Science and Technology Development Program of Jilin Province of China under Grant No.20070533 (吉林省科技发展计划)


Using Classifiers to Find Domain-Specific Online Databases Automatically
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [26]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.

    Abstract:

    In hidden Web domain, general-purpose search engines (i.e., Google and Yahoo) have their shortcomings. They cover less than one-third of the data stored in document databases. Unlike the surface Web, if combined, they cover roughly the same data. Hidden Web is a highly important information source since the content provided by many hidden Web sites is often of very high quality. This paper proposes a three-step framework to automatically identify domain-specific hidden Web entries. With those obtained query interfaces, they can be integrated to obtain a unified interface which is given to users to query. Eight large-scale experiments demonstrate that the technique can find domain-specific hidden Web entries accurately and efficiently.

    参考文献
    [1]Rocco D,Caverlee J,Liu L,Critchlow T.Exploiting the deep Web with DynaBot:Matching,probing,and ranking.In:Ellis A,Hagino T,eds.Proc.of the World Wide Web Special Interest Tracks And Posters (WWW).Chiba:ACM,2005.1174-1175.
    [2]BrightPlanet.com.The deep Web:Surfacing hidden value.http://brightplanet.com
    [3]Bergman MK.The deep Web:Surfacing hidden value.Journal of Electronic Publishing,2001,7(1):1174-1175.http://www.press.umich.edu/jep/07-01/bergman.html
    [4]He B,Zhang Z,Chang KCC.Knocking the door to the deep Web:Integrating Web query interfaces.In:Weikum G,ed.Proc.of the SIGMOD Conf.Paris:ACM,2004.913-914.
    [5]Chang KCC,He B,Zhang Z.MetaQuerier over the deep Web:Shallow integration across holistic sources.In:Nascimento MA,?zsu MT,Kossmann D,Miller RJ,Blakeley JA,Schiefer KB,eds.Proc.of the Int'l Conf.on Very Large Data Bases (VLDB).Morgan Kaufmann Publishers,2004.15-21.
    [6]Wu W,Doan A,Yu CT.Merging interface schemas on the deep Web via clustering aggregation.In:Proc.of the Int'l Conf.on Data Mining (ICDM).IEEE Computer Society,2005.801-804.
    [7]He H,Meng WY,Yu CT,Wu ZH.WISE-Integrator:A system for extracting and integrating complex Web search interfaces of the deep Web.In:B?hm K,Jensen CS,Haas LM,Kersten ML,Larson PA,Ooi BC,eds.Proc.of the Int'l Conf.on Very Large Data Bases (VLDB).ACM,2005.1314-1317.
    [8]Chang KCC,Garcia-Molina H.Mind your vocabulary:Query mapping across heterogeneous information sources.In:Delis A,Faloutsos C,Ghandeharizadeh S,eds.Proc.of the SIGMOD Conf.Philadelphia:ACM Press,1999.335-346.
    [9]He B,Zhang Z,Chang KCC.MetaQuerier:Querying structured Web sources on-the-fly.In:?zcan F,ed.Proc.of the SIGMOD Conf.ACM,2005.927-929.
    [10]Nakatoh T,Yamada Y,Hirokawa S.Automatic generation of deep Web wrappers based on discovery of repetition.In:Proc.of the Asia Information Retrieval Symp.(AIRS).Beijing:Springer-Verlag,2004.269-272.
    [11]Hedley YL,Younas M,James A,Sanderson M.A two-phase sampling technique for information extraction from hidden Web databases.In:Laender AHF,Lee D,Ronthaler M,eds.Proc.of the Int'l Workshop on Web Information and Data Management (WIDM).Washington:ACM,2004.1-8.
    [12]Mundluru D,Katukuri JR,Celebi S.Automatically mining result records from search engine response pages.In:Proc.of the Int'l Conf.on Data Mining (ICDM).IEEE Computer Society,2005.749-752.
    [13]Liu B,Grossman R,Zhai YH.Mining data records in Web pages.In:Getoor L,Senator TE,Domingos P,Faloutsos C,eds.Proc.of the Knowledge Discovery and Data Mining (KDD).Washington:ACM,2003.601-606.
    [14]Hsieh W,Madhavan J,Pike R.Data management projects at Google.In:Chaudhuri S,Hristidis V,Polyzotis N,eds.Proc.of the SIGMOD Conf.Chicago:ACM,2006.725-726.
    [15]Wu P,Wen JR,Liu H,Ma WY.Query selection techniques for efficient crawling of structured Web sources.In:Liu L,Reuter A,Whang KY,Zhang J,eds.Proc.of the Int'l Conf.on Data Mining (ICDE).IEEE Computer Society,2006.47.
    [16]Raghavan S,Garcia-Molina H.Crawling the hidden Web.In:Apers PMG,Atzeni P,Ceri S,Paraboschi S,Ramamohanarao K,Snodgrass RT,eds.Proc.of the Int'l Conf.on Very Large Data Bases (VLDB).Rome:Morgan Kaufmann Publishers,2001.129-138.
    [17]Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web.In:Schewe KD,Zhou X,eds.Proc.of the Australasian Database Conf.(ADC).Australian Computer Society,2003.181-189.
    [18]Bergholz A,Chidlovskii B.Crawling for domain-specific hidden Web resources.In:Proc.of the Int'l Conf.on Web Information Systems Engineering (WISE).Roma:IEEE Computer Society,2003.125-133.
    [19]Barbosa L,Freire J.Combining classifiers to identify online databases.In:Williamson CL,Zurko ME,Patel-Schneider PF,Shenoy PJ,eds.Proc.of the World Wide Web Conf.(WWW).ACM,2007.431-440.
    [20]Barbosa L,Freire J.An adaptive crawler for locating hidden-Web entry points.In:Williamson CL,Zurko ME,Patel-Schneider PF,Shenoy PJ,eds.Proc.of the World Wide Web Conf.(WWW).ACM,2007.441-450.
    [21]Barbosa L,Freire J.Searching for hidden-Web databases.In:Doan AH,Neven F,McCann R,Bex GJ,eds.Proc.of the 8th Int'l Workshop on the Web and Databases (WebDB).Baltimore:ACM Press,2005.1-6.
    [22]Chang CC,Lin CJ.Libsvm-A library for support vector machines.http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    [23]CPAN.http://search.cpan.org/
    [24]Torgo L,Gama J.Regression by classification.In:Borges D,Kaestner C,eds.Proc.of the Brasilian Artificial Intelligence Symp.Curitiba:Springer-Verlag,1996.51-60.
    [25]The uiuc Web integration repository.http://metaquerier.cs.uiuc.edu/repository/
    [26]Weka.http://www.cs.waikato.ac.nz/ml/weka/
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王 辉,刘艳威,左万利.使用分类器自动发现特定领域的深度网入口.软件学报,2008,19(2):246-256

复制
分享
文章指标
  • 点击次数:8273
  • 下载次数: 8383
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2007-08-02
  • 最后修改日期:2007-11-06
文章二维码
您是第19899778位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号