基于声明式推理的高效协同查询处理技术
作者:
作者简介:

邱志林(1997-), 男, 硕士, 主要研究领域为数据库内机器学习的优化. ;寿黎但(1976-), 男, 博士, 教授, 博士生导师, CCF高级会员, 主要研究领域为非结构化数据管理, 移动社会媒体数据管理, 多媒体挖掘. ;陈珂(1977-), 女, 博士, 副研究员, CCF专业会员, 主要研究领域为非结构化数据管理, 数据挖掘, 隐私保护. ;江大伟(1982-), 男, 博士, 研究员, 博士生导师, 主要研究领域为分布式数据管理技术, 云数据管理技术, 大数据管理技术. ;骆歆远(1988-), 男, 博士, 助理研究员, 主要研究领域为大数据管理, 大数据智能计算, 信息检索. ;陈刚(1973-), 男, 博士, 教授, 博士生导师, CCF杰出会员, 主要研究领域为数据库, 大数据管理系统, 大数据智能计算.

通讯作者:

陈珂, E-mail: chenk@zju.edu.cn

中图分类号:

TP311

基金项目:

国家重点研发计划(2022YFB3304100); 中央高校基本科研业务费专项资金(2021FZZX001-24)


Efficient Collaborative Query Processing Technique Based on Declarative Inference
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [31]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    由于深度学习领域的不断进步, 人们对用协同查询处理(CQP)技术扩展关系数据库以处理涉及结构化和非结构化数据的高级分析查询越来越感兴趣. 最先进的CQP方法使用用户定义函数(UDFs)来实现深度神经网络(NN)模型来处理非结构化数据, 并使用关系操作来处理结构化数据. 基于UDF的方法简化了查询书写, 允许用户使用单一的SQL提交分析查询, 但要求在即席数据分析中能够根据所需性能指标手动选择合适且高效的模型, 这对用户提出了很高的挑战. 为了解决该问题, 提出基于声明式推理函数(DIF)的协同查询处理技术, 通过优化模型选择、执行方式、设备绑定等多个查询实现路径构建完整的协同查询处理框架. 基于所提研究设计的成本模型和优化规则, 查询处理器能够计算出不同查询计划的代价, 并自动选择最优的物理查询计划. 在4个数据集上的实验结果证实了提出的基于DIF的CQP方法的有效性和效率.

    Abstract:

    Due to the continuous advancements in the field of deep learning, there is growing interest in extending relational databases with collaborative query processing (CQP) techniques to handle advanced analytical queries involving structured and unstructured data. State-of-the-art CQP methods employ user-defined functions (UDFs) to implement deep neural network (NN) models for processing unstructured data while utilizing relational operations for structured data. UDF-based approaches simplify query composition, allowing users to submit analytical queries with a single SQL statement. However, they require manual selection of appropriate and efficient models based on desired performance metrics during ad-hoc data analysis, posing significant challenges to users. To address this issue, this research proposes a CQP technique based on declarative inference functions (DIF), which constructs a complete CQP framework by optimizing model selection, execution strategies, and device bindings across multiple query execution paths. Leveraging the cost model and optimization rules designed in this study, the query processor is capable of estimating the cost of different query plans and automatically selecting the optimal physical query plan. Experimental results on four datasets validate the effectiveness and efficiency of the proposed DIF-based CQP approach.

    参考文献
    [1] Lin QR, Wu S, Zhao JB, Dai J, Li FF, Chen G. A comparative study of in-database inference approaches. In: Proc. of the 38th IEEE Int’l Conf. on Data Engineering. Kuala Lumpur: IEEE, 2022. 1794–1807.
    [2] Lu Y, Chowdhery A, Kandula S, Chaudhuri S. Accelerating machine learning inference with probabilistic predicates. In: Proc. of the 2018 Int’l Conf. on Management of Data. Houston: ACM, 2018. 1493–1508.
    [3] 李国良, 周煊赫, 孙佶, 余翔, 袁海涛, 刘佳斌, 韩越. 基于机器学习的数据库技术综述. 计算机学报, 2020, 43(11): 2019–2049.
    Li GL, Zhou XH, Sun J, Yu X, Yuan HT, Liu JB, Han Y. A survey of machine learning based database techniques. Chinese Journal of Computers, 2020, 43(11): 2019–2049 (in Chinese with English abstract).
    [4] 李国良, 周煊赫. 面向AI的数据管理技术综述. 软件学报, 2021, 32(1): 21–40. http://www.jos.org.cn/1000-9825/6121.htm
    Li GL, Zhou XH. Survey of data management techniques for artificial intelligence. Ruan Jian Xue Bao/Journal of Software, 2021, 32(1): 21–40 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6121.htm
    [5] 孙路明, 张少敏, 姬涛, 李翠平, 陈红. 人工智能赋能的数据管理技术研究. 软件学报, 2020, 31(3): 600–619. http://www.jos.org.cn/1000-9825/5909.htm
    Sun LM, Zhang SM, Ji T, Li CP, Chen H. Survey of data management techniques powered by artificial intelligence. Ruan Jian Xue Bao/Journal of Software, 2020, 31(3): 600–619 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5909.htm
    [6] 柴茗珂, 范举, 杜小勇. 学习式数据库系统: 挑战与机遇. 软件学报, 2020, 31(3): 806–830. http://www.jos.org.cn/1000-9825/5908.htm
    Chai MK, Fan J, Du XY. Learnable database systems: Challenges and opportunities. Ruan Jian Xue Bao/Journal of Software, 2020, 31(3): 806–830 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5908.htm
    [7] 邱涛, 王斌, 舒昭维, 赵智博, 宋子文, 钟延辉. 面向关系数据库的智能索引调优方法. 软件学报, 2020, 31(3): 634–647. http://www.jos.org.cn/1000-9825/5906.htm
    Qiu T, Wang B, Shu ZW, Zhao ZB, Song ZW, Zhong YH. Intelligent index tuning approach for relational databases. Ruan Jian Xue Bao/Journal of Software, 2020, 31(3): 634–647 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5906.htm
    [8] 李国良, 周煊赫. 轩辕: AI原生数据库系统. 软件学报, 2020, 31(3): 831–844. http://www.jos.org.cn/1000-9825/5899.htm
    Li GL, Zhou XH. XuanYuan: An AI-native database systems. Ruan Jian Xue Bao/Journal of Software, 2020, 31(3): 831–844 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5899.htm
    [9] Oracle. Oracle advanced analytics. 2012. https://www.oracle.com/artificial-intelligence/database-machine-learning/features/
    [10] Microsoft. Microsoft SQL MLS. 2017. https://learn.microsoft.com/en-us/sql/machine-learning/?view=sql-server-2017
    [11] Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng XX, Li K, Kumar A. The MADlib analytics library: Or MAD skills, the SQL. Proc. of the VLDB Endowment, 2012, 5(12): 1700–1711.
    [12] D’Silva JV, de Moor F, Kemme B. AIDA: Abstraction for advanced in-database analytics. Proc. of the VLDB Endowment, 2018, 11(11): 1400–1413.
    [13] Li XP, Cui B, Chen YR, Wu WT, Zhang C. MLog: Towards declarative in-database machine learning. Proc. of the VLDB Endowment, 2017, 10(12): 1933–1936.
    [14] Luo SY, Gao ZJ, Gubanov M, Perez LL, Jermaine C. Scalable linear algebra on a relational database system. IEEE Trans. on Knowledge and Data Engineering, 2019, 31(7): 1224–1238.
    [15] Schüle ME, Simonis F, Heyenbrock T, Kemper A, Günnemann S, Neumann T. In-database machine learning: Gradient descent and tensor algebra for main memory database systems. In: Proc. of Datenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs, Datenbanken und Informationssysteme. Rostock: Gesellschaft für Informatik, 2019. 247–266.
    [16] Günther M, Thiele M, Lehner W. RETRO: Relation retrofitting for in-database machine learning on textual data. In: Proc. of the 23rd Int’l Conf. on Extending Database Technology. Copenhagen: OpenProceedings.org, 2020. 411–414.
    [17] Kang DL, Mathur A, Veeramacheneni T, Bailis P, Zaharia M. Jointly optimizing preprocessing and inference for DNN-based visual analytics. Proc. of the VLDB Endowment, 2020, 14(2): 87–100.
    [18] 钮泽平, 李国良. 数据库内AI模型优化. 软件学报, 2021, 32(3): 622–635. http://www.jos.org.cn/1000-9825/6179.htm
    Niu ZP, Li GL. In-database AI model optimization. Ruan Jian Xue Bao/Journal of Software, 2021, 32(3): 622–635 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6179.htm
    [19] Kang DL, Emmons J, Abuzaid F, Bailis P, Zaharia M. NoScope: Optimizing neural network queries over video at scale. Proc. of the VLDB Endowment, 2017, 10(11): 1586–1597.
    [20] Yang ZH, Wang ZZ, Huang YC, Lu Y, Li C, Wang XS. Optimizing machine learning inference queries with correlative proxy models. Proc. of the VLDB Endowment, 2022, 15(10): 2032–2044.
    [21] Kang DL, Guibas J, Bailis P, Hashimoto T, Zaharia M. Task-agnostic indexes for deep learning-based queries over unstructured data. arXiv:2009.04540, 2020.
    [22] Li JY, Sun MS, Zhang X. A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. In: Proc. of the 21st Int’l Conf. on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Sydney: Association for Computational Linguistics, 2006. 545–552.
    [23] Xu L, Tong Y, Dong QQ, Liao YX, Yu C, Tian Y, Liu WT, Li L, Liu CQ, Zhang XW. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv:2001.04351. 2020.
    [24] Krizhevsky A. Learning multiple layers of features from tiny images [MS. Thesis]. Toronto: University of Toronto, 2009.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

邱志林,寿黎但,陈珂,江大伟,骆歆远,陈刚.基于声明式推理的高效协同查询处理技术.软件学报,2024,35(12):5558-5581

复制
分享
文章指标
  • 点击次数:613
  • 下载次数: 2394
  • HTML阅读次数: 645
  • 引用次数: 0
历史
  • 收稿日期:2023-04-12
  • 最后修改日期:2023-06-05
  • 在线发布日期: 2024-01-17
  • 出版日期: 2024-12-06
文章二维码
您是第20545220位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号