Classification of Deep Web Databases Based on the Context of Web Pages

微信服务号

微信订阅号

2025-5-4- 0

Home > Archive>Volume 19, Issue 2, 2008 >267-274

Classification of Deep Web Databases Based on the Context of Web Pages
DOI:
                        
                    
Author:
                        MA JunMA Jun

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
SONG LingSONG Ling

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
HAN Xiao-HuiHAN Xiao-Hui

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
YAN PoYAN Po

Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [18]

Related [20]

Cited by [9]

Materials

Comments

Abstract:

New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text blocks in HTML pages. The unification processing for database attributes is to let the attribute labels that are closed semantically be replaced with delegates. The domain and language knowledge found in learning samples is represented in hierarchical fuzzy sets and an algorithm for the unification processing is proposed based on the presentation. Based on the pre-computing a k-NN (k nearest neighbors) algorithm is given for deep Web database classification, where the semantic distance between two databases is calculated based on both the distance between the content texts of the HTML pages and the distance between database forms embedded in the pages. Various classification experiments are carried out to compare the classification results done by the algorithm with pre-computing and the one without the pre-computing in terms of classification precision, recall and F1 values.

Key words:deep Web; hidden Web; database classification; content text extraction; semantic classification

Reference

[1]Brightpanet's investigation.2001.http://www.brightplanet.com/news/prs/deep-Web-500-times-larger.html

[2]Chang KCC,He B,Zhang Z.Toward large-scale,integration:building a MetaQuerier over databases on the Web.In:Weikum G,ed.Proc.of the Conf.on Innovative Data Systems Research.Asilomar:IEEE Computer Society,2005.44-55.

[3]He H,Meng W,Yu CT,Wu Z.Automatic integration of Web search interfaces with WISE-integrator.VLDB Journal,2004,13(3):256-273.

[4]He H,Meng W,Yu C,Wu Z.Wise-Integrator:An automatic integrator of Web search interfaces for e-commerce.In:Lockemann P,ed.Proc.of the Int'l Conf.on very Large Data Bases.Berlin:IEEE Computer Society,2003.357-368.

[5]Gravano L,Garcia-Molina H,Tomasic A.Gloss:Textsource discovery over the Internet.ACM Trans.on Database Systems,1999,24(2):229-246..

[6]Yi L,Liu B.Web page cleaning for Web mining through feature weighting.In:Cohn AG,ed.Proc.of the 18th Int'l Joint Conf.on Artificial Intelligence (IJCAI 2003).Acapulco:Kluwier Academic Publisher,2003.64-75.

[7]Bergholz A,Chidlovskii B.Crawling for domain-specific hidden Web resources.In:Spaccapietra S,ed.Proc.of the 4th Int'l Conf.on Web Information Systems Engineering.Rome:IEEE Computer Society,2003.125-133.

[8]Barbosa L,Freire J,Silva A.Organizing hidden-Web databases by clustering visible Web documents.In:Doqac A,ed.Proc.of IEEE the 23rd Int'l Conf.on Data Engineering.Istanbul:IEEE Computer Society,2007.326-335.

[9]Gravano L,Ipeirotis PG,Sahami M.QProber:A system for automatic classification of hidden-Web databases.ACM TOIS,2003,21(1):1-41.

[10]He B,Tao T,Chang KCC.Organizing structured Web sources by query schemas:A clustering approach.In:Gravano L,ed.Proc.of ACM the 13th Conf.on Information and Knowlege Management.Washington:ACM Press,2004.22-31.

[11]Baeza-Yates R,Ribeiro-Neto B.Modern Information Retrieval.Boston:Addison Wesley,1999.27-30.

[12]The UIUC Web integration repository.2007.http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html

[13]Thomopolos S,Buche P,Haemmerle O.Fuzzy sets defined on a hierarchical domain.IEEE Trans.on Knowledge and Data Engineering,2006,16(10):1395-1409.

[14]Wang J,Loehovsky F.Data-Rich section extraction from HTML pages.In:Cham TS,ed.Proc.of the 3rd Int'l Conf.on Web Information Systems Engineering.Singapore:IEEE Computer Society Press,2002.1-10.

[15]Cai D,Yu SP,Wen JR,Ma WY.VIPS:A vision-based page segmentation algorithm.Technical Report,MSR-TR-2003-79,Redmond:Microsoft Research Corporation,2003.1-79.

[16]Song RH,Liu HF,Wen JR,Ma WY.Learning important models for Web page blocks based on layout and content analysis.SIGKDD Explorations,2004,6(2):14-23.

[17]Feng HM,Liu B,Liu YM.Framework of Web page analysis and content extraction with coordinate trees.Journal of Tsinghua University,2005,45(S1):1767-1771 (in Chinese with English abstract).

[18]CWT200G.2007.http://www.cwirf.org/SharedRes/DataSet/cwt.html

Get Citation

马军,宋玲,韩晓晖,闫泼.基于网页上下文的Deep Web数据库分类.软件学报,2008,19(2):267-274

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:August 31,2007
Revised:November 19,2007
Adopted:
Online:
Published:

You are the first2042104Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History