支撑机器学习的数据管理技术综述
作者:
作者简介:

崔建伟(1986-),男,博士生,CCF学生会员,主要研究领域为深度学习,自然语言处理.
赵哲(1992-),男,博士生,主要研究领域为深度学习,自然语言处理.
杜小勇(1963-),男,博士,教授,博士生导师,CCF会士,主要研究领域为数据库,大数据系统.

通讯作者:

杜小勇,E-mail:duyong@ruc.edu.cn

基金项目:

国家自然科学基金(62072458)


Survey on Data Management Technology for Machine Learning
Author:
Fund Project:

National Natural Science Foundation of China (62072458)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [76]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    应用驱动创新,数据库技术就是在支持主流应用的提质降本增效中发展起来的.从OLTP、OLAP到今天的在线机器学习建模无不如此.机器学习是当前人工智能技术落地的主要途径,通过对数据进行建模而提取知识、实现预测分析.从数据管理的视角对机器学习训练过程进行解构和建模,从数据选择、数据存储、数据存取、自动优化和系统实现等方面,综述了数据管理技术的应用及优缺点,在此基础上,提出支持在线机器学习的数据管理技术的若干关键技术挑战.

    Abstract:

    Applications drive innovation. The advance of database technology is achieved in support of development of mainstream applications effectively and efficiently. OLTP, OLAP, and online machine learning modeling today all follow this trend. Machine learning extracts knowledge and realizes predictive analysis by modeling data, is the main approach of artificial intelligence technology. This work studies the training process of machine learning from the perspective of data management, summarizes data management technology through data selection, data storage, data access, automatic optimization, and system implementation, and analyzes the advantages and disadvantages of these techniques. Based on the analysis, this study proposes key challenges of data management technology for online machine learning.

    参考文献
    [1] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. Imagenet:A large-scale hierarchical image database. In:Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, 2009. 248-255.
    [2] Lian Z, Li Y, Tao J, Huang J. Improving speech emotion recognition via transformer-based predictive coding through transfer learning. arXiv. Nov:arXiv-1811. 2018.
    [3] Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
    [4] Du XY, Lu W, Zhang F. History, present, and future of big data management systems. Ruan Jian Xue Bao/Journal of Software, 2019,30(1):127-141(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5644.htm[doi:10.13328/j.cnki.jos. 005644]
    [5] Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. 2019.
    [6] Weizenbaum J. ELIZA-A computer program for the study of natural language communication between man and machine. Communications of the ACM, 1966,9(1):36-45.
    [7] SVM. 2020. https://en.wikipedia.org/wiki/Support_vector_machine
    [8] CRF. 2020. https://en.wikipedia.org/wiki/Conditional_random_field
    [9] LDA. 2020. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
    [10] Liu TY. Learning to Rank for Information Retrieval. Springer Science & Business Media, 2011.
    [11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.
    [12] Boutsidis C, Drineas P, Magdon-Ismail M. Near-Optimal coresets for least-squares regression. IEEE Trans. on Information Theory, 2013,59(10):6880-6892.
    [13] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301. 3781. 2013.
    [14] Le Q, Mikolov T. Distributed representations of sentences and documents. In:Proc. of the Int'l Conf. on Machine Learning. 2014. 1188-1196.
    [15] Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L. Variational autoencoder for deep learning of images, labels and captions. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 2352-2360.
    [16] Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In:Proc. of the 2011 Conf. on Empirical Methods in Natural Language Processing. 2011. 355-362.
    [17] Moore RC, Lewis W. Intelligent selection of language model training data.
    [18] Perplexity. 2020. https://en.wikipedia.org/wiki/Perplexity
    [19] Chen B, Huang F. Semi-Supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In:Proc. of the 20th SIGNLL Conf. on Computational Natural Language Learning. 2016. 314-323.
    [20] Active learning. 2020. https://en.wikipedia.org/wiki/Active_learning_(machine_learning)
    [21] Wei K, Iyer R, Bilmes J. Submodularity in data subset selection and active learning. In:Proc. of the Int'l Conf. on Machine Learning. 2015. 1954-1963.
    [22] Wang R, Utiyama M, Sumita E. Dynamic sentence sampling for efficient training of neural machine translation. arXiv preprint arXiv:1805.00178. 2018.
    [23] TFRecord. 2020. https://www.tensorflow.org/tutorials/load_data/tfrecord
    [24] Protobuf. 2020. https://developers.google.com/protocol-buffers
    [25] ONNX. 2020. https://en.wikipedia.org/wiki/Open_Neural_Network_Exchange
    [26] Vartak M, Subramanyam H, Lee WE, Viswanathan S, Husnoo S, Madden S, Zaharia M. ModelDB:A system for machine learning model management. In:Proc. of the Workshop on Human-in-the-loop Data Analytics. 2016. 1-3.
    [27] Zhang Z, Sparks ER, Franklin MJ. Diagnosing machine learning pipelines with fine-grained lineage. In:Proc. of the 26th Int'l Symp. on High-Performance Parallel and Distributed Computing. 2017. 143-153.
    [28] George L. HBase:The Definitive Guide:Random Access to Your Planet-Size Data. O'Reilly Media, Inc., 2011.
    [29] AWS S3. 2020. https://aws.amazon.com/s3/
    [30] Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran A. Principles of dataset versioning:Exploring the recreation/storage tradeoff. Proc. of the VLDB Endowment, 2015,8(2):1346.
    [31] Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore AJ, Madden S, Parameswaran AG. Datahub:Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798. 2014.
    [32] Miao H, Li A, Davis LS, Deshpande A. Modelhub:Towards unified data and lifecycle management for deep learning. arXiv preprint arXiv:1611.06224. 2016.
    [33] Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In:Proc. of the Int'l Conf. on Scientific and Statistical Database Management. Berlin, Heidelberg:Springer-Verlag, 2011. 1-16.
    [34] Snappy. 2020. https://en.wikipedia.org/wiki/Snappy_(compression)
    [35] Elgohary A, Boehm M, Haas PJ, Reiss FR, Reinwald B. Compressed linear algebra for large-scale machine learning. Proc. of the VLDB Endowment, 2016,9(12):960-971.
    [36] Run-Length_Encoding. 2020. https://en.wikipedia.org/wiki/Run-length_encoding
    [37] Li F, Chen L, Kumar A, Naughton JF, Patel JM, Wu X. When lempel-ziv-welch meets machine learning:A case study of accelerating machine learning using coding. arXiv preprint arXiv:1702.06943. 2017.
    [38] Tabei Y, Saigo H, Yamanishi Y, Puglisi SJ. Scalable partial least squares regression on grammar-compressed data matrices. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1875-1884.
    [39] Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Trans. on Database Systems (TODS), 2016,41(1):1-32.
    [40] Deshpande A, Madden S. MauveDB:Supporting model-based user views in database systems. In:Proc. of the 2006 ACM SIGMOD Int'l Conf. on Management of Data. 2006. 73-84.
    [41] Nikolic M, ElSeidy M, Koch C. LINVIEW:Incremental view maintenance for complex analytical queries. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 253-264.
    [42] Anderson MR, Cafarella M. Input selection for fast feature engineering. In:Proc. of the 2016 IEEE 32nd Int'l Conf. on Data Engineering (ICDE). IEEE, 2016. 577-588.
    [43] Zhang Y, Munagala K, Yang J. Storing matrices on disk:Theory and practice revisited. Proc. of the VLDB Endowment, 2011,4(11):1075-1086.
    [44] Sparks ER, Venkataraman S, Kaftan T, Franklin MJ, Recht B. Keystoneml:Optimizing pipelines for large-scale advanced analytics. In:Proc. of the 2017 IEEE 33rd Int'l Conf. on Data Engineering (ICDE). IEEE, 2017. 535-546.
    [45] Transfer learning. 2020. https://en.wikipedia.org/wiki/Transfer_learning
    [46] Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, Tian Y. SystemML's optimizer:Plan generation for large-scale machine learning programs. IEEE Data Engineering Bulletin, 2014,37(3):52-62.
    [47] Sujeeth AK, Lee H, Brown KJ, Rompf T, Chafi H, Wu M, Atreya AR, Odersky M, Olukotun K. OptiML:An implicitly parallel domain-specific language for machine learning. In:Proc. of the ICML. 2011.
    [48] Kumar A, McCann R, Naughton J, Patel JM. Model selection management systems:The next frontier of advanced analytics. ACM SIGMOD Record, 2016,44(4):17-22.
    [49] Zoph B, Le QV. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. 2016.
    [50] Xie L, Yuille A. Genetic CNN. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 1379-1388.
    [51] Baker B, Gupta O, Naik N, Raskar R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. 2016.
    [52] Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In:Proc. of the Advances in Neural Information Processing Systems. 2014. 3320-3328.
    [53] Kumar A, Boehm M, Yang J. Data management in machine learning:Challenges, techniques, and systems. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 1717-1722.
    [54] Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C. MAD skills:New analysis practices for big data. Proc. of the VLDB Endowment, 2009,2(2):1481-1492.
    [55] Hellerstein J, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library or MAD skills, the SQL. arXiv preprint arXiv:1208.4165. 2012.
    [56] Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-RDBMS analytics. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 325-336.
    [57] Cheng Y, Qin C, Rusu F. GLADE:Big data analytics made easy. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 697-700.
    [58] Rusu F, Dobra A. GLADE:A scalable framework for efficient analytics. ACM SIGOPS Operating Systems Review, 2012,46(1):12-18.
    [59] Luo S, Gao ZJ, Gubanov M, Perez LL, Jermaine C. Scalable linear algebra on a relational database system. IEEE Trans. on Knowledge and Data Engineering, 2018,31(7):1224-1238.
    [60] Eigen. 2020. http://eigen.tuxfamily.org/
    [61] Mahajan D, Kim JK, Sacks J, Ardalan A, Kumar A, Esmaeilzadeh H. In-RDBMS hardware acceleration of advanced analytics. arXiv preprint arXiv:1801.06027. 2018.
    [62] Cai Z, Vagena Z, Perez L, Arumugam S, Haas PJ, Jermaine C. Simulation of database-valued Markov chains using SimSQL. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 637-648.
    [63] Kara K, Eguro K, Zhang C, Alonso G. ColumnML:Column-store machine learning with on-the-fly data transformation. Proc. of the VLDB Endowment, 2018,12(4):348-361.
    [64] Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. SystemML:Declarative machine learning on MapReduce. In:Proc. of the 2011 IEEE 27th Int'l Conf. on Data Engineering. IEEE, 2011. 231-242.
    [65] Boehm M, Dusenberry MW, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss FR, Sen P, Surve AC, Tatikonda S. Systemml:Declarative machine learning on spark. Proc. of the VLDB Endowment, 2016,9(13):1425-1436.
    [66] Huang B, Babu S, Yang J. Cumulon:Optimizing statistical data analysis in the cloud. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 1-12.
    [67] Brown PG. Overview of SciDB:Large scale array storage, processing and analysis. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. 2010. 963-968.
    [68] Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark:SQL and rich analytics at scale. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 13-24.
    [69] Stonebraker M, Madden S, Dubey P. Intel "big data" science and technology center vision and execution plan. ACM SIGMOD Record, 2013,42(1):44-49.
    [70] Zhang Y, Zhang W, Yang J. I/O-Efficient statistical computing with RIOT. In:Proc. of the 2010 IEEE 26th Int'l Conf. on Data Engineering (ICDE 2010). IEEE, 2010. 1157-1160.
    [71] Zhou X, Chai C, Li G, Sun J. Database meets artificial intelligence:A survey. IEEE Trans. on Knowledge and Data Engineering, 2020.
    [72] Lee Y, Scolari A, Chun BG, Santambrogio MD, Weimer M, Interlandi M. {PRETZEL}:Opening the black box of machine learning prediction serving systems. In:Proc. of the 13th {USENIX} Symp. on Operating Systems Design and Implementation (OSDI 2018). 2018. 611-626.
    [73] Wang W, Wang S, Gao J, Zhang M, Chen G, Ng TK, Ooi BC. Rafiki:Machine learning as an analytics service system. arXiv preprint arXiv:1804.06087. 2018.
    [74] Smith MJ, Sala C, Kanter JM, Veeramachaneni K. The machine learning bazaar:Harnessing the ML ecosystem for effective system development. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 785-800.
    附中文参考文献:
    [4] 杜小勇,卢卫,张峰.大数据管理系统的历史、现状与未来.软件学报,2019,30(1):127-141. http://www.jos.org.cn/1000-9825/5644. htm[doi:10.13328/j.cnki.jos.005644]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

崔建伟,赵哲,杜小勇.支撑机器学习的数据管理技术综述.软件学报,2021,32(3):604-621

复制
分享
文章指标
  • 点击次数:3717
  • 下载次数: 7330
  • HTML阅读次数: 4867
  • 引用次数: 0
历史
  • 收稿日期:2020-07-20
  • 最后修改日期:2020-09-03
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
文章二维码
您是第20635702位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号