Survey on Data Management Technology for Machine Learning
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (62072458)

  • Article
  • | |
  • Metrics
  • |
  • Reference [76]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Applications drive innovation. The advance of database technology is achieved in support of development of mainstream applications effectively and efficiently. OLTP, OLAP, and online machine learning modeling today all follow this trend. Machine learning extracts knowledge and realizes predictive analysis by modeling data, is the main approach of artificial intelligence technology. This work studies the training process of machine learning from the perspective of data management, summarizes data management technology through data selection, data storage, data access, automatic optimization, and system implementation, and analyzes the advantages and disadvantages of these techniques. Based on the analysis, this study proposes key challenges of data management technology for online machine learning.

    Reference
    [1] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. Imagenet:A large-scale hierarchical image database. In:Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, 2009. 248-255.
    [2] Lian Z, Li Y, Tao J, Huang J. Improving speech emotion recognition via transformer-based predictive coding through transfer learning. arXiv. Nov:arXiv-1811. 2018.
    [3] Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
    [4] Du XY, Lu W, Zhang F. History, present, and future of big data management systems. Ruan Jian Xue Bao/Journal of Software, 2019,30(1):127-141(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5644.htm[doi:10.13328/j.cnki.jos. 005644]
    [5] Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. 2019.
    [6] Weizenbaum J. ELIZA-A computer program for the study of natural language communication between man and machine. Communications of the ACM, 1966,9(1):36-45.
    [7] SVM. 2020. https://en.wikipedia.org/wiki/Support_vector_machine
    [8] CRF. 2020. https://en.wikipedia.org/wiki/Conditional_random_field
    [9] LDA. 2020. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
    [10] Liu TY. Learning to Rank for Information Retrieval. Springer Science & Business Media, 2011.
    [11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.
    [12] Boutsidis C, Drineas P, Magdon-Ismail M. Near-Optimal coresets for least-squares regression. IEEE Trans. on Information Theory, 2013,59(10):6880-6892.
    [13] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301. 3781. 2013.
    [14] Le Q, Mikolov T. Distributed representations of sentences and documents. In:Proc. of the Int'l Conf. on Machine Learning. 2014. 1188-1196.
    [15] Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L. Variational autoencoder for deep learning of images, labels and captions. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 2352-2360.
    [16] Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In:Proc. of the 2011 Conf. on Empirical Methods in Natural Language Processing. 2011. 355-362.
    [17] Moore RC, Lewis W. Intelligent selection of language model training data.
    [18] Perplexity. 2020. https://en.wikipedia.org/wiki/Perplexity
    [19] Chen B, Huang F. Semi-Supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In:Proc. of the 20th SIGNLL Conf. on Computational Natural Language Learning. 2016. 314-323.
    [20] Active learning. 2020. https://en.wikipedia.org/wiki/Active_learning_(machine_learning)
    [21] Wei K, Iyer R, Bilmes J. Submodularity in data subset selection and active learning. In:Proc. of the Int'l Conf. on Machine Learning. 2015. 1954-1963.
    [22] Wang R, Utiyama M, Sumita E. Dynamic sentence sampling for efficient training of neural machine translation. arXiv preprint arXiv:1805.00178. 2018.
    [23] TFRecord. 2020. https://www.tensorflow.org/tutorials/load_data/tfrecord
    [24] Protobuf. 2020. https://developers.google.com/protocol-buffers
    [25] ONNX. 2020. https://en.wikipedia.org/wiki/Open_Neural_Network_Exchange
    [26] Vartak M, Subramanyam H, Lee WE, Viswanathan S, Husnoo S, Madden S, Zaharia M. ModelDB:A system for machine learning model management. In:Proc. of the Workshop on Human-in-the-loop Data Analytics. 2016. 1-3.
    [27] Zhang Z, Sparks ER, Franklin MJ. Diagnosing machine learning pipelines with fine-grained lineage. In:Proc. of the 26th Int'l Symp. on High-Performance Parallel and Distributed Computing. 2017. 143-153.
    [28] George L. HBase:The Definitive Guide:Random Access to Your Planet-Size Data. O'Reilly Media, Inc., 2011.
    [29] AWS S3. 2020. https://aws.amazon.com/s3/
    [30] Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran A. Principles of dataset versioning:Exploring the recreation/storage tradeoff. Proc. of the VLDB Endowment, 2015,8(2):1346.
    [31] Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore AJ, Madden S, Parameswaran AG. Datahub:Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798. 2014.
    [32] Miao H, Li A, Davis LS, Deshpande A. Modelhub:Towards unified data and lifecycle management for deep learning. arXiv preprint arXiv:1611.06224. 2016.
    [33] Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In:Proc. of the Int'l Conf. on Scientific and Statistical Database Management. Berlin, Heidelberg:Springer-Verlag, 2011. 1-16.
    [34] Snappy. 2020. https://en.wikipedia.org/wiki/Snappy_(compression)
    [35] Elgohary A, Boehm M, Haas PJ, Reiss FR, Reinwald B. Compressed linear algebra for large-scale machine learning. Proc. of the VLDB Endowment, 2016,9(12):960-971.
    [36] Run-Length_Encoding. 2020. https://en.wikipedia.org/wiki/Run-length_encoding
    [37] Li F, Chen L, Kumar A, Naughton JF, Patel JM, Wu X. When lempel-ziv-welch meets machine learning:A case study of accelerating machine learning using coding. arXiv preprint arXiv:1702.06943. 2017.
    [38] Tabei Y, Saigo H, Yamanishi Y, Puglisi SJ. Scalable partial least squares regression on grammar-compressed data matrices. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1875-1884.
    [39] Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Trans. on Database Systems (TODS), 2016,41(1):1-32.
    [40] Deshpande A, Madden S. MauveDB:Supporting model-based user views in database systems. In:Proc. of the 2006 ACM SIGMOD Int'l Conf. on Management of Data. 2006. 73-84.
    [41] Nikolic M, ElSeidy M, Koch C. LINVIEW:Incremental view maintenance for complex analytical queries. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 253-264.
    [42] Anderson MR, Cafarella M. Input selection for fast feature engineering. In:Proc. of the 2016 IEEE 32nd Int'l Conf. on Data Engineering (ICDE). IEEE, 2016. 577-588.
    [43] Zhang Y, Munagala K, Yang J. Storing matrices on disk:Theory and practice revisited. Proc. of the VLDB Endowment, 2011,4(11):1075-1086.
    [44] Sparks ER, Venkataraman S, Kaftan T, Franklin MJ, Recht B. Keystoneml:Optimizing pipelines for large-scale advanced analytics. In:Proc. of the 2017 IEEE 33rd Int'l Conf. on Data Engineering (ICDE). IEEE, 2017. 535-546.
    [45] Transfer learning. 2020. https://en.wikipedia.org/wiki/Transfer_learning
    [46] Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, Tian Y. SystemML's optimizer:Plan generation for large-scale machine learning programs. IEEE Data Engineering Bulletin, 2014,37(3):52-62.
    [47] Sujeeth AK, Lee H, Brown KJ, Rompf T, Chafi H, Wu M, Atreya AR, Odersky M, Olukotun K. OptiML:An implicitly parallel domain-specific language for machine learning. In:Proc. of the ICML. 2011.
    [48] Kumar A, McCann R, Naughton J, Patel JM. Model selection management systems:The next frontier of advanced analytics. ACM SIGMOD Record, 2016,44(4):17-22.
    [49] Zoph B, Le QV. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. 2016.
    [50] Xie L, Yuille A. Genetic CNN. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 1379-1388.
    [51] Baker B, Gupta O, Naik N, Raskar R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. 2016.
    [52] Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In:Proc. of the Advances in Neural Information Processing Systems. 2014. 3320-3328.
    [53] Kumar A, Boehm M, Yang J. Data management in machine learning:Challenges, techniques, and systems. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 1717-1722.
    [54] Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C. MAD skills:New analysis practices for big data. Proc. of the VLDB Endowment, 2009,2(2):1481-1492.
    [55] Hellerstein J, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library or MAD skills, the SQL. arXiv preprint arXiv:1208.4165. 2012.
    [56] Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-RDBMS analytics. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 325-336.
    [57] Cheng Y, Qin C, Rusu F. GLADE:Big data analytics made easy. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 697-700.
    [58] Rusu F, Dobra A. GLADE:A scalable framework for efficient analytics. ACM SIGOPS Operating Systems Review, 2012,46(1):12-18.
    [59] Luo S, Gao ZJ, Gubanov M, Perez LL, Jermaine C. Scalable linear algebra on a relational database system. IEEE Trans. on Knowledge and Data Engineering, 2018,31(7):1224-1238.
    [60] Eigen. 2020. http://eigen.tuxfamily.org/
    [61] Mahajan D, Kim JK, Sacks J, Ardalan A, Kumar A, Esmaeilzadeh H. In-RDBMS hardware acceleration of advanced analytics. arXiv preprint arXiv:1801.06027. 2018.
    [62] Cai Z, Vagena Z, Perez L, Arumugam S, Haas PJ, Jermaine C. Simulation of database-valued Markov chains using SimSQL. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 637-648.
    [63] Kara K, Eguro K, Zhang C, Alonso G. ColumnML:Column-store machine learning with on-the-fly data transformation. Proc. of the VLDB Endowment, 2018,12(4):348-361.
    [64] Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. SystemML:Declarative machine learning on MapReduce. In:Proc. of the 2011 IEEE 27th Int'l Conf. on Data Engineering. IEEE, 2011. 231-242.
    [65] Boehm M, Dusenberry MW, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss FR, Sen P, Surve AC, Tatikonda S. Systemml:Declarative machine learning on spark. Proc. of the VLDB Endowment, 2016,9(13):1425-1436.
    [66] Huang B, Babu S, Yang J. Cumulon:Optimizing statistical data analysis in the cloud. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 1-12.
    [67] Brown PG. Overview of SciDB:Large scale array storage, processing and analysis. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. 2010. 963-968.
    [68] Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark:SQL and rich analytics at scale. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 13-24.
    [69] Stonebraker M, Madden S, Dubey P. Intel "big data" science and technology center vision and execution plan. ACM SIGMOD Record, 2013,42(1):44-49.
    [70] Zhang Y, Zhang W, Yang J. I/O-Efficient statistical computing with RIOT. In:Proc. of the 2010 IEEE 26th Int'l Conf. on Data Engineering (ICDE 2010). IEEE, 2010. 1157-1160.
    [71] Zhou X, Chai C, Li G, Sun J. Database meets artificial intelligence:A survey. IEEE Trans. on Knowledge and Data Engineering, 2020.
    [72] Lee Y, Scolari A, Chun BG, Santambrogio MD, Weimer M, Interlandi M. {PRETZEL}:Opening the black box of machine learning prediction serving systems. In:Proc. of the 13th {USENIX} Symp. on Operating Systems Design and Implementation (OSDI 2018). 2018. 611-626.
    [73] Wang W, Wang S, Gao J, Zhang M, Chen G, Ng TK, Ooi BC. Rafiki:Machine learning as an analytics service system. arXiv preprint arXiv:1804.06087. 2018.
    [74] Smith MJ, Sala C, Kanter JM, Veeramachaneni K. The machine learning bazaar:Harnessing the ML ecosystem for effective system development. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 785-800.
    附中文参考文献:
    [4] 杜小勇,卢卫,张峰.大数据管理系统的历史、现状与未来.软件学报,2019,30(1):127-141. http://www.jos.org.cn/1000-9825/5644. htm[doi:10.13328/j.cnki.jos.005644]
    Cited by
Get Citation

崔建伟,赵哲,杜小勇.支撑机器学习的数据管理技术综述.软件学报,2021,32(3):604-621

Copy
Share
Article Metrics
  • Abstract:3659
  • PDF: 7062
  • HTML: 4292
  • Cited by: 0
History
  • Received:July 20,2020
  • Revised:September 03,2020
  • Online: January 21,2021
  • Published: March 06,2021
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063