• Article
  • | |
  • Metrics
  • |
  • Reference [18]
  • |
  • Related [20]
  • |
  • Cited by [9]
  • | |
  • Comments
    Abstract:

    New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text blocks in HTML pages. The unification processing for database attributes is to let the attribute labels that are closed semantically be replaced with delegates. The domain and language knowledge found in learning samples is represented in hierarchical fuzzy sets and an algorithm for the unification processing is proposed based on the presentation. Based on the pre-computing a k-NN (k nearest neighbors) algorithm is given for deep Web database classification, where the semantic distance between two databases is calculated based on both the distance between the content texts of the HTML pages and the distance between database forms embedded in the pages. Various classification experiments are carried out to compare the classification results done by the algorithm with pre-computing and the one without the pre-computing in terms of classification precision, recall and F1 values.

    Reference
    [1]Brightpanet's investigation.2001.http://www.brightplanet.com/news/prs/deep-Web-500-times-larger.html
    [2]Chang KCC,He B,Zhang Z.Toward large-scale,integration:building a MetaQuerier over databases on the Web.In:Weikum G,ed.Proc.of the Conf.on Innovative Data Systems Research.Asilomar:IEEE Computer Society,2005.44-55.
    [3]He H,Meng W,Yu CT,Wu Z.Automatic integration of Web search interfaces with WISE-integrator.VLDB Journal,2004,13(3):256-273.
    [4]He H,Meng W,Yu C,Wu Z.Wise-Integrator:An automatic integrator of Web search interfaces for e-commerce.In:Lockemann P,ed.Proc.of the Int'l Conf.on very Large Data Bases.Berlin:IEEE Computer Society,2003.357-368.
    [5]Gravano L,Garcia-Molina H,Tomasic A.Gloss:Textsource discovery over the Internet.ACM Trans.on Database Systems,1999,24(2):229-246..
    [6]Yi L,Liu B.Web page cleaning for Web mining through feature weighting.In:Cohn AG,ed.Proc.of the 18th Int'l Joint Conf.on Artificial Intelligence (IJCAI 2003).Acapulco:Kluwier Academic Publisher,2003.64-75.
    [7]Bergholz A,Chidlovskii B.Crawling for domain-specific hidden Web resources.In:Spaccapietra S,ed.Proc.of the 4th Int'l Conf.on Web Information Systems Engineering.Rome:IEEE Computer Society,2003.125-133.
    [8]Barbosa L,Freire J,Silva A.Organizing hidden-Web databases by clustering visible Web documents.In:Doqac A,ed.Proc.of IEEE the 23rd Int'l Conf.on Data Engineering.Istanbul:IEEE Computer Society,2007.326-335.
    [9]Gravano L,Ipeirotis PG,Sahami M.QProber:A system for automatic classification of hidden-Web databases.ACM TOIS,2003,21(1):1-41.
    [10]He B,Tao T,Chang KCC.Organizing structured Web sources by query schemas:A clustering approach.In:Gravano L,ed.Proc.of ACM the 13th Conf.on Information and Knowlege Management.Washington:ACM Press,2004.22-31.
    [11]Baeza-Yates R,Ribeiro-Neto B.Modern Information Retrieval.Boston:Addison Wesley,1999.27-30.
    [12]The UIUC Web integration repository.2007.http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html
    [13]Thomopolos S,Buche P,Haemmerle O.Fuzzy sets defined on a hierarchical domain.IEEE Trans.on Knowledge and Data Engineering,2006,16(10):1395-1409.
    [14]Wang J,Loehovsky F.Data-Rich section extraction from HTML pages.In:Cham TS,ed.Proc.of the 3rd Int'l Conf.on Web Information Systems Engineering.Singapore:IEEE Computer Society Press,2002.1-10.
    [15]Cai D,Yu SP,Wen JR,Ma WY.VIPS:A vision-based page segmentation algorithm.Technical Report,MSR-TR-2003-79,Redmond:Microsoft Research Corporation,2003.1-79.
    [16]Song RH,Liu HF,Wen JR,Ma WY.Learning important models for Web page blocks based on layout and content analysis.SIGKDD Explorations,2004,6(2):14-23.
    [17]Feng HM,Liu B,Liu YM.Framework of Web page analysis and content extraction with coordinate trees.Journal of Tsinghua University,2005,45(S1):1767-1771 (in Chinese with English abstract).
    [18]CWT200G.2007.http://www.cwirf.org/SharedRes/DataSet/cwt.html
    Comments
    Comments
    分享到微博
    Submit
Get Citation

马 军,宋 玲,韩晓晖,闫 泼.基于网页上下文的Deep Web数据库分类.软件学报,2008,19(2):267-274

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 31,2007
  • Revised:November 19,2007
You are the first2042104Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063