Abstract:Many Web applications, such as multimedia data integration and online business data aggregation, require deep Web querying to integrate information from many data sources on the Web. The success of such applications is largely determined by the efficiency and effectiveness of querying methods over relevant sources. Existing studies on multiple data source integration have focused on ranking the relevance of queries w.r.t data sources without considering the impact of overlap among the sources over data source selection, resulting in not only query processing overhead but also increased workloads on data sources. In order to improve query efficiency on overlapping data sources, this work proposes a tuple-level stratified sampling approach for overlapping data source selection. The approach has two stages: the offline stage and the online stage. In the offline stage, tuple-level stratified sampling is applied to obtain sample tuples. In the online stage, samples are used to estimate query coverage and overlap among multiple data sources. A heuristic method is also designed to discover data sources with low overlap. Experimental results show that the proposed approach is more efficient and effective than the state of the art methods for selecting overlapping data sources.