





In-database AI Model Optimization
Fund Project:

National Natural Science Foundation of China (61925205, 61632016)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [21]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论



    In a large number of changing data, data analysts often only care about a small amount of data with specific prediction results. However, users must query all the data by SQL before inference step, even if a large amount of data will be dropped, because the machine learning algorithm libraries always assume that the data is organized in a single table. This study points out that in this process, if some hints can be gotten from model in advance, it is expected that unnecessary data can be quickly eliminated in the data acquisition phase, thus reducing the cost of multi-table join, inter-process communication, and model prediction. This work takes a specific kind of machine learning model, i.e., decision tree, as an example. Firstly, a pre-filtering and validation execution workflow is proposed. Then, an offline algorithm is used to extract pre-filtering predicates from the decision tree. Finally, the algorithm is tested on real world dataset. Experiments show that the method proposed in this study can accelerate the execution of SQL queries containing predicates on decision tree prediction result.

    [1] Li GL, Zhou XH. XuanYuan:An AI-native database systems. Ruan Jian Xue Bao/Journal of Software, 2020,31(3):831-844(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5899.htm
    [2] Li GL, Zhou XH, Sun J, et al. A survey of machine-learning-based database techniques. Chinese Journal of Computers, 2019(in Chinese with English abstract).
    [3] Hellerstein JM, Re C, Schoppmann F, et al. The madlib analytics library or MAD skills, the SQL. Proc. of the VLDB Endowment, 2012, 5(12):1700-1711. http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf[doi:10.14778/2367502.2367510]
    [4] Meng X, Bradley JK, Yavuz B, et al. Mllib:Machine learning in apache spark. Journal of Machine Learning Research, 2016,17:34:1-34:7. http://jmlr.org/papers/v17/15-237.html
    [5] Li X, Cui B, CHEN Y, et al. Mlog:Towards declarative in-database machine learning. Proc. of the VLDB Endowment, 2017,10(12):1933-1936. http://www.vldb.org/pvldb/vol10/p1933-zhang.pdf.[doi:10.14778/3137765.3137812]
    [6] Kumar A, Naughton J, Patel JM, et al. To join or not to join? Thinking twice about joins before feature selection. In:Proc. of the 2016 Int'l Conf. on Management of Data. 19-34. ACM, 2016.
    [7] Chepurko N, Marcus R, Zgraggen E, et al. ARDA:Automatic relational data augmentation for machine learning. Proc. of the VLDB Endowment, 2020,13(9):1373-1387. http://www.vldb.org/pvldb/vol13/p1373-chepurko.pdf
    [8] Renz-Wieland A, Gemulla R, Zeuch S, et al. Dynamic parameter allocation in parameter servers. Proc. of the VLDB Endowment, 2020,13(11):1877-1890. http://www.vldb.org/pvldb/vol13/p1877-renz-wieland.pdf
    [9] Zhang Z, Wu W, Jiang J, et al. ColumnSGD:A column-oriented framework for distributed stochastic gradient descent. In:Proc. of the ICDE. 2020. 1513-1524.[doi:10.1109/ICDE48307.2020.00134]
    [10] Jasny M, Ziegler T, Kraska T, et al. DB4ML-An in-memory database kernel with machine learning support. In:Proc. of the 2020 Int'l Conf. on Management of Data (SIGMOD Conf. 2020). 2020. 159-173.[doi:10.1145/3318464.3380575]
    [11] Hutchison D, Howe B, Suciu D. LaraDB:A minimalist kernel for linear and relational algebra computation. In:Proc. of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR@SIGMOD). 2017.[doi:10.1145/3070607.3070608]
    [12] Wang YR, Hutchison S, Leang J, et al. SPORES:Sum-product optimization via relational equality saturation for large scale linear algebra. Proc. of the VLDB Endowment, 2020,13(11):1919-1932. http://www.vldb.org/pvldb/vol13/p1919-wang.pdf
    [13] Grover A, Arya D, Venkataraman G. Latency reduction via dcision tree based query construction. In:Proc. of the 2017 ACM on Conf. on Information and Knowledge Management (CIKM 2017). 2017. 1399-1407.[doi:10.1145/3132847.3132865]
    [14] Breiman L, Friedman JH, Olshen RA, et al. Classification and Regression Trees. 1984.
    [15] Bradford JP, Kunz C, Kohavi R, et al. Pruning decision trees with misclassification costs. In:Proc. of the 10th European Conf. on Machine Learning (ECML'98). LNCS 1398, Chemnitz:Springer-Verlag, 1998. 131-136. https://doi.org/10.1007/BFb0026682
    [16] Ribeiro MT, Singh S, Guestrin C. "Why should I trust you?" Explaining the predictions of any classifier. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1135-1144.
    [17] Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 4765-4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
    [18] Wang Z, Zhang W, Liu N, et al. Transparent classification with multilayer logical perceptrons and random Binarization. In:Proc. of the AAAI 2020. 2020. 6331-6339.
    [1] 李国良,周煊赫.轩辕:AI原生数据库系统.软件学报,2020,31(3):831-844. http://www.jos.org.cn/1000-9825/5899.htm
    [2] 李国良,周煊赫,孙佶,余翔,袁海涛,刘佳斌,韩越.基于机器学习的数据库技术综述.计算机学报,2019.
    发 布


  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
  • 收稿日期:2020-07-20
  • 最后修改日期:2020-09-03
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
版权所有:中国科学院软件研究所 京ICP备05046678号-3
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn

京公网安备 11040202500063号