高速流环境下近似连续k代表轮廓查询算法
作者:
作者单位:

作者简介:

朱睿(1982-),男,博士,副教授,CCF高级会员,主要研究领域为流数据管理,查询处理与优化;杨晓春(1973-),女,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库理论与系统,文本与时序大数据管理;宋栿尧(1995-),男,硕士,主要研究领域为流数据管理;张安珍(1990-),女,博士,讲师,CCF专业会员,主要研究领域为大数据质量管理,近似查询处理库;王斌(1973-),男,博士,教授,CCF专业会员,主要研究领域为数据质量管理,文本数据管理;夏秀峰(1965-),男,博士,教授,CCF高级会员,主要研究领域为管理信息系统,数据库.

通讯作者:

王斌,wangbin@neu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(62102271,62072088,61991404);国家重点研发计划(2020YFB1707901);沈阳市创新人才项目(RC200439)


Approximate Continuous k Representative Skyline Query Algorithm over High-Speed Streaming Data Environment
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    k代表轮廓查询是从传统轮廓查询中衍生出来的一类查询.给定多维数据集合D,轮廓查询从D中找到所有不被其他对象支配的对象,将其返回给用户,便于用户结合自身偏好选择高质量对象.然而,轮廓对象规模通常较大,用户需要从大量数据中进行选择,导致选择速度和质量无法得到保证.与传统轮廓查询相比,k代表轮廓查询从所有轮廓对象中选择“代表性”最强的k个对象返回给用户,有效地解决了传统轮廓查询存在的这一问题.给定滑动窗口W和连续查询qq监听窗口中的数据.当窗口滑动时,查询q返回窗口中,组合支配面积最大的k个对象.现有算法的核心思想是:实时监测当前窗口中的轮廓对象集合,当轮廓对象集合更新时,算法更新k代表轮廓.然而,实时监测窗口中,轮廓集合的计算代价通常较大.此外,当轮廓集合规模较大时,从中选择k代表轮廓的计算代价是同样巨大的,导致已有算法无法在高速流环境下使用.针对上述问题,提出了r-近似k代表轮廓查询.为了支持该查询,提出了查询处理框架PAKRS (predict-based approximate k representative skyline).首先,PAKRS利用高速流的特性对当前窗口进行划分,根据划分结果构建未来窗口预测结果集,用其预测新流入窗口数据成为轮廓对象的最早时间.其次,提出了索引r-GRID.它帮助PAKRS在2维和d维(d>2)环境下,分别以O (k/s+k/m)和O (2Ld/m+2Ld/s)的增量维护代价下筛选近似k代表轮廓,L是一个小于k的正整数.由理论分析证明可知,PAKRS的计算复杂度小于前人所提的算法计算复杂度.最后,通过大量实验对所提算法性能进行评估.结果表明,PAKRS的运行时间是PBA (prefix-based algorithm)算法的1/4、GA (greedy algorithm)算法的1/6、e-GA (e-constraint greedy algorithm)算法的1/3.

    Abstract:

    k representative skyline query is a type of query derived from traditional skyline query. Given a set of d-dimensional dataset D, a skyline query finds all objects in D that are not dominated by other ones, which helps users to select high-quality objects based on their preference. However, the scale of skyline objects may be large in many cases, users have to choose target objects from a large number of objects, leading that both the selection speed and quality cannot be guaranteed. Compared with traditional skyline query, k representative skyline query chooses the most "representative" k objects from all skyline objects, which effectively solves such problem causes by traditional skyline query. Given the sliding window W and a continuous query q, q monitors objects in the window. When the window slides, q returns k skyline objects with the largest group dominance size in the window. The key behind existing algorithms is to monitor skyline objects in the current window. When the skyline set is updated, the algorithm updates k representative skyline set. However, the cost of monitoring skyline set is usually high. When the skyline set scale is large, the computational cost of choosing k representative skyline objects is also high. Thus, existing algorithms cannot efficiently work under high-speed stream environment. This study proposes a query named r-approximate k representative skyline query. In order to support this type of queries, a novel framework is proposed named PAKRS (predict-based approximate k representative skyline). Firstly, PAKRS partitions the current window into a group of sub-windows. Next, the predicted result sets of a few future windows are constructed according to the partition result. In this way, the earliest moment can be predicted when new arrived objects may become skyline objects. Secondly, an index is proposed named r-GRID, which can help PAKRS to select r-approximate k representative skyline with O(k/s+k/m) computational cost under 2-dimensional space, and O(2Ld/m+2Ld/s) computational cost under d-dimensional space (d>2), where L is a little integer smaller than k. Theoretical analysis shows that the computational complexity of PAKRS is lower than the state-of-the-art efforts. Extensive experiments have been conducted to confirm the efficiency and effectiveness of the proposed algorithms. Experimental results show that the running time of PAKRS is about 1/4 times of PBA (prefix-based algorithm), algorithm 1/6 times of GA (greedy algorithm) and about 1/3 times of e-GA (e-constraint greedy algorithm).

    参考文献
    相似文献
    引证文献
引用本文

朱睿,宋栿尧,王斌,杨晓春,张安珍,夏秀峰.高速流环境下近似连续k代表轮廓查询算法.软件学报,2023,34(3):1425-1450

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-11-25
  • 最后修改日期:2022-04-27
  • 录用日期:
  • 在线发布日期: 2023-03-10
  • 出版日期: 2023-03-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号