一种面向数据仓库周期性查询的增量优化方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家高技术研究发展计划(863)(2015AA011505);国家自然科学基金(61303053,61402445,61402303,61521092)


Incremental Optimization Method for Periodic Query in Data Warehouse
Author:
Affiliation:

Fund Project:

National High-Tech R&D Program of China (863) (2015AA011505); National Natural Science Foundation of China (61303053, 61402445, 61402303, 61521092)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    大数据蕴含着巨大的价值.分析类查询是获取数据价值的一种重要手段.为及时把握分析结果的变化,查询需要周期性地重复.为此,将不可避免地引入对旧数据的重复分析.目前,以重用历史数据的中间结果、优化冗余计算为核心思路的增量分析技术,存在用户透明性不佳、对历史结果存储位置的选择不够智能化等问题,对周期性增量查询的优化效果有限.从兼顾用户透明性和优化收益的角度出发,设计了一种以语义规则为指导的增量优化方法.该方法扩展了增量描述语法,以查询操作符的操作语义和输出语义指导对历史数据存储、合并位置的选择,再根据代价模型和物理查询任务的划分位置对选择结果进行调整,生成优化后可以在分布式计算框架(如MapReduce)周期性调度执行的物理查询任务.以Apache Hive为基础,实现了上述方法的原型HiveInc.实验结果表明:对于扩展了增量语法描述的TPC-H测试集,HiveInc相对于优化前可以获得平均2.93倍、最高5.78倍的加速;与经典的优化技术IncMR、DryadInc相比,分别可以获得1.69倍和1.61倍的加速.

    Abstract:

    Analytical query is an important way to get value from big data in data warehouse. With the growth of data, the same query needs to be executed periodically, which inevitably introduces redundant calculation on historical data. One type of incremental optimization technology reduces redundant calculation by reusing intermediate results of historical data. However it has following problems:1) it isn't transparent for user; 2) choice of historical result storing/reusing position is not intelligent; and 3) optimization gains is limited. This article designs an incremental optimization method, which is guided by the semantic rules. This method focuses on both user transparency and optimization gains, and extends grammar to support incremental description. Historical result storing/reusing location is firstly chosen by operators' operational semantics and output semantics. Positions are then adjusted according to cost model and physical task's division positions. At last, optimized tasks-DAG is generated with the ability to run in a distributed computing framework (such as MapReduce) periodically. This paper implements a prototype, called HiveInc, based on Apache Hive. Experimental results on TPC-H show that, compared to non-optimization, HiveInc can obtain average 2.93 speed-up and highest 5.78 speed-up. Compared to classical optimization techniques, IncMR and DryadInc, speed-up of 1.69 and 1.61 can be obtained respectively.

    参考文献
    相似文献
    引证文献
引用本文

康炎丽,李丰,王蕾.一种面向数据仓库周期性查询的增量优化方法.软件学报,2017,28(8):2126-2147

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2016-03-31
  • 最后修改日期:2016-05-12
  • 录用日期:
  • 在线发布日期: 2017-08-15
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号