Overlapping Deep Web Data Source Selection Based on Stratified Sampling
Author:
Affiliation:

Clc Number:

Fund Project:

National Natural Science Foundation of China (61232002, 61202035); Science and Technology Support Program of Hubei Province (2015BAA127)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Many Web applications, such as multimedia data integration and online business data aggregation, require deep Web querying to integrate information from many data sources on the Web. The success of such applications is largely determined by the efficiency and effectiveness of querying methods over relevant sources. Existing studies on multiple data source integration have focused on ranking the relevance of queries w.r.t data sources without considering the impact of overlap among the sources over data source selection, resulting in not only query processing overhead but also increased workloads on data sources. In order to improve query efficiency on overlapping data sources, this work proposes a tuple-level stratified sampling approach for overlapping data source selection. The approach has two stages: the offline stage and the online stage. In the offline stage, tuple-level stratified sampling is applied to obtain sample tuples. In the online stage, samples are used to estimate query coverage and overlap among multiple data sources. A heuristic method is also designed to discover data sources with low overlap. Experimental results show that the proposed approach is more efficient and effective than the state of the art methods for selecting overlapping data sources.

    Reference
    Related
    Cited by
Get Citation

江俊彦,彭智勇,吴小莹,彭承晨,王敏.基于分层抽样的重叠深网数据源选择.软件学报,2017,28(5):1271-1295

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 31,2016
  • Revised:April 21,2016
  • Adopted:
  • Online: August 06,2016
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063