数据库内AI模型优化

doi:10.13328/j.cnki.jos.006179

微信服务号

微信订阅号

2025年7月13日 14:12 星期日

首页 > 过刊浏览>2021年第32卷第3期 >622-635. DOI:10.13328/j.cnki.jos.006179

PDF HTML阅读 XML下载导出引用引用提醒

数据库内AI模型优化
DOI:
                        10.13328/j.cnki.jos.006179
                    
CSTR:
                        
                    
作者:
                        钮泽平钮泽平
清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
李国良李国良
清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:钮泽平(1997-),男,博士生,CCF学生会员,主要研究领域为数据库与机器学习的交叉技术.
李国良(1981-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库,大数据分析和挖掘,群体计算.
通讯作者:李国良,E-mail:liguoliang@tsinghua.edu.cn
中图分类号:
基金项目:国家自然科学基金（61925205，61632016）

In-database AI Model Optimization

Author:

NIU Ze-Ping
NIU Ze-Ping
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
LI Guo-Liang
LI Guo-Liang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Natural Science Foundation of China (61925205, 61632016)

摘要

图/表

访问统计

参考文献 [21]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

在大量变化着的数据中，数据分析师常常只关心预测结果为特定值的少量数据.然而，利用机器学习模型进行推理的工作流程中，由于机器学习算法库默认数据以单表方式组织，用户必须先通过SQL语句查询出全部数据，即使随后在模型推理过程中会将大量数据丢弃.指出了在这个过程中，如果可以预先从模型中提取信息，就有望能在数据获取阶段快速排除不需要的数据，从而降低数据获取过程中的多表连接代价、进程间通信代价以及模型预测代价，进而加速整个工作流程.以决策树模型为例，首先提出一种预筛选+验证的执行方法对查询过程进行优化，之后给出了从决策树中提取用于预筛选谓词的离线算法，最后在真实数据集上进行测试.实验结果表明，所提出的方法能够对借助决策树模型推理结果对数据进行筛选的应用场景起到较好的加速效果.

关键词:SQL;数据库;决策树;DB4AI

Abstract:

In a large number of changing data, data analysts often only care about a small amount of data with specific prediction results. However, users must query all the data by SQL before inference step, even if a large amount of data will be dropped, because the machine learning algorithm libraries always assume that the data is organized in a single table. This study points out that in this process, if some hints can be gotten from model in advance, it is expected that unnecessary data can be quickly eliminated in the data acquisition phase, thus reducing the cost of multi-table join, inter-process communication, and model prediction. This work takes a specific kind of machine learning model, i.e., decision tree, as an example. Firstly, a pre-filtering and validation execution workflow is proposed. Then, an offline algorithm is used to extract pre-filtering predicates from the decision tree. Finally, the algorithm is tested on real world dataset. Experiments show that the method proposed in this study can accelerate the execution of SQL queries containing predicates on decision tree prediction result.

Key words:SQL;database;decision tree;DB4AI

参考文献

[1] Li GL, Zhou XH. XuanYuan:An AI-native database systems. Ruan Jian Xue Bao/Journal of Software, 2020,31(3):831-844(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5899.htm

[2] Li GL, Zhou XH, Sun J, et al. A survey of machine-learning-based database techniques. Chinese Journal of Computers, 2019(in Chinese with English abstract).

[3] Hellerstein JM, Re C, Schoppmann F, et al. The madlib analytics library or MAD skills, the SQL. Proc. of the VLDB Endowment, 2012, 5(12):1700-1711. http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf[doi:10.14778/2367502.2367510]

[4] Meng X, Bradley JK, Yavuz B, et al. Mllib:Machine learning in apache spark. Journal of Machine Learning Research, 2016,17:34:1-34:7. http://jmlr.org/papers/v17/15-237.html

[5] Li X, Cui B, CHEN Y, et al. Mlog:Towards declarative in-database machine learning. Proc. of the VLDB Endowment, 2017,10(12):1933-1936. http://www.vldb.org/pvldb/vol10/p1933-zhang.pdf.[doi:10.14778/3137765.3137812]

[6] Kumar A, Naughton J, Patel JM, et al. To join or not to join? Thinking twice about joins before feature selection. In:Proc. of the 2016 Int'l Conf. on Management of Data. 19-34. ACM, 2016.

[7] Chepurko N, Marcus R, Zgraggen E, et al. ARDA:Automatic relational data augmentation for machine learning. Proc. of the VLDB Endowment, 2020,13(9):1373-1387. http://www.vldb.org/pvldb/vol13/p1373-chepurko.pdf

[8] Renz-Wieland A, Gemulla R, Zeuch S, et al. Dynamic parameter allocation in parameter servers. Proc. of the VLDB Endowment, 2020,13(11):1877-1890. http://www.vldb.org/pvldb/vol13/p1877-renz-wieland.pdf

[9] Zhang Z, Wu W, Jiang J, et al. ColumnSGD:A column-oriented framework for distributed stochastic gradient descent. In:Proc. of the ICDE. 2020. 1513-1524.[doi:10.1109/ICDE48307.2020.00134]

[10] Jasny M, Ziegler T, Kraska T, et al. DB4ML-An in-memory database kernel with machine learning support. In:Proc. of the 2020 Int'l Conf. on Management of Data (SIGMOD Conf. 2020). 2020. 159-173.[doi:10.1145/3318464.3380575]

[11] Hutchison D, Howe B, Suciu D. LaraDB:A minimalist kernel for linear and relational algebra computation. In:Proc. of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR@SIGMOD). 2017.[doi:10.1145/3070607.3070608]

[12] Wang YR, Hutchison S, Leang J, et al. SPORES:Sum-product optimization via relational equality saturation for large scale linear algebra. Proc. of the VLDB Endowment, 2020,13(11):1919-1932. http://www.vldb.org/pvldb/vol13/p1919-wang.pdf

[13] Grover A, Arya D, Venkataraman G. Latency reduction via dcision tree based query construction. In:Proc. of the 2017 ACM on Conf. on Information and Knowledge Management (CIKM 2017). 2017. 1399-1407.[doi:10.1145/3132847.3132865]

[14] Breiman L, Friedman JH, Olshen RA, et al. Classification and Regression Trees. 1984.

[15] Bradford JP, Kunz C, Kohavi R, et al. Pruning decision trees with misclassification costs. In:Proc. of the 10th European Conf. on Machine Learning (ECML'98). LNCS 1398, Chemnitz:Springer-Verlag, 1998. 131-136. https://doi.org/10.1007/BFb0026682

[16] Ribeiro MT, Singh S, Guestrin C. "Why should I trust you?" Explaining the predictions of any classifier. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1135-1144.

[17] Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 4765-4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

[18] Wang Z, Zhang W, Liu N, et al. Transparent classification with multilayer logical perceptrons and random Binarization. In:Proc. of the AAAI 2020. 2020. 6331-6339.

附中文参考文献:

[1] 李国良,周煊赫.轩辕:AI原生数据库系统.软件学报,2020,31(3):831-844. http://www.jos.org.cn/1000-9825/5899.htm

[2] 李国良,周煊赫,孙佶,余翔,袁海涛,刘佳斌,韩越.基于机器学习的数据库技术综述.计算机学报,2019.

引用本文

钮泽平,李国良.数据库内AI模型优化.软件学报,2021,32(3):622-635

复制

文章指标

点击次数:2524
下载次数: 6324
HTML阅读次数: 3857
引用次数: 0

历史

收稿日期:2020-07-20
最后修改日期:2020-09-03
录用日期:
在线发布日期: 2021-01-21
出版日期: 2021-03-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码