支撑机器学习的数据管理技术综述

doi:10.13328/j.cnki.jos.006182

微信服务号

微信订阅号

2025年8月21日 9:06 星期四

首页 > 过刊浏览>2021年第32卷第3期 >604-621. DOI:10.13328/j.cnki.jos.006182

PDF HTML阅读 XML下载导出引用引用提醒

支撑机器学习的数据管理技术综述
DOI:
                        10.13328/j.cnki.jos.006182
                    
CSTR:
                        
                    
作者:
                        崔建伟崔建伟
数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
赵哲赵哲
数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
杜小勇杜小勇
数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:崔建伟(1986-),男,博士生,CCF学生会员,主要研究领域为深度学习,自然语言处理.
赵哲(1992-),男,博士生,主要研究领域为深度学习,自然语言处理.
杜小勇(1963-),男,博士,教授,博士生导师,CCF会士,主要研究领域为数据库,大数据系统.
通讯作者:杜小勇,E-mail:duyong@ruc.edu.cn
中图分类号:
基金项目:国家自然科学基金（62072458）

Survey on Data Management Technology for Machine Learning

Author:

CUI Jian-Wei
CUI Jian-Wei
Key Laboratory of Data Engineering and Knowledge Engineering, MOE(Renmin University of China), Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
ZHAO Zhe
ZHAO Zhe
Key Laboratory of Data Engineering and Knowledge Engineering, MOE(Renmin University of China), Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
DU Xiao-Yong
DU Xiao-Yong
Key Laboratory of Data Engineering and Knowledge Engineering, MOE(Renmin University of China), Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Natural Science Foundation of China (62072458)

摘要

图/表

访问统计

参考文献 [76]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

应用驱动创新，数据库技术就是在支持主流应用的提质降本增效中发展起来的.从OLTP、OLAP到今天的在线机器学习建模无不如此.机器学习是当前人工智能技术落地的主要途径，通过对数据进行建模而提取知识、实现预测分析.从数据管理的视角对机器学习训练过程进行解构和建模，从数据选择、数据存储、数据存取、自动优化和系统实现等方面，综述了数据管理技术的应用及优缺点，在此基础上，提出支持在线机器学习的数据管理技术的若干关键技术挑战.

关键词:人工智能;机器学习;数据管理

Abstract:

Applications drive innovation. The advance of database technology is achieved in support of development of mainstream applications effectively and efficiently. OLTP, OLAP, and online machine learning modeling today all follow this trend. Machine learning extracts knowledge and realizes predictive analysis by modeling data, is the main approach of artificial intelligence technology. This work studies the training process of machine learning from the perspective of data management, summarizes data management technology through data selection, data storage, data access, automatic optimization, and system implementation, and analyzes the advantages and disadvantages of these techniques. Based on the analysis, this study proposes key challenges of data management technology for online machine learning.

Key words:artificial intelligence;machine learning;data management

参考文献

[1] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. Imagenet:A large-scale hierarchical image database. In:Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, 2009. 248-255.

[2] Lian Z, Li Y, Tao J, Huang J. Improving speech emotion recognition via transformer-based predictive coding through transfer learning. arXiv. Nov:arXiv-1811. 2018.

[3] Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.

[4] Du XY, Lu W, Zhang F. History, present, and future of big data management systems. Ruan Jian Xue Bao/Journal of Software, 2019,30(1):127-141(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5644.htm[doi:10.13328/j.cnki.jos. 005644]

[5] Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. 2019.

[6] Weizenbaum J. ELIZA-A computer program for the study of natural language communication between man and machine. Communications of the ACM, 1966,9(1):36-45.

[7] SVM. 2020. https://en.wikipedia.org/wiki/Support_vector_machine

[8] CRF. 2020. https://en.wikipedia.org/wiki/Conditional_random_field

[9] LDA. 2020. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

[10] Liu TY. Learning to Rank for Information Retrieval. Springer Science & Business Media, 2011.

[11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.

[12] Boutsidis C, Drineas P, Magdon-Ismail M. Near-Optimal coresets for least-squares regression. IEEE Trans. on Information Theory, 2013,59(10):6880-6892.

[13] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301. 3781. 2013.

[14] Le Q, Mikolov T. Distributed representations of sentences and documents. In:Proc. of the Int'l Conf. on Machine Learning. 2014. 1188-1196.

[15] Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L. Variational autoencoder for deep learning of images, labels and captions. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 2352-2360.

[16] Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In:Proc. of the 2011 Conf. on Empirical Methods in Natural Language Processing. 2011. 355-362.

[17] Moore RC, Lewis W. Intelligent selection of language model training data.

[18] Perplexity. 2020. https://en.wikipedia.org/wiki/Perplexity

[19] Chen B, Huang F. Semi-Supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In:Proc. of the 20th SIGNLL Conf. on Computational Natural Language Learning. 2016. 314-323.

[20] Active learning. 2020. https://en.wikipedia.org/wiki/Active_learning_(machine_learning)

[21] Wei K, Iyer R, Bilmes J. Submodularity in data subset selection and active learning. In:Proc. of the Int'l Conf. on Machine Learning. 2015. 1954-1963.

[22] Wang R, Utiyama M, Sumita E. Dynamic sentence sampling for efficient training of neural machine translation. arXiv preprint arXiv:1805.00178. 2018.

[23] TFRecord. 2020. https://www.tensorflow.org/tutorials/load_data/tfrecord

[24] Protobuf. 2020. https://developers.google.com/protocol-buffers

[25] ONNX. 2020. https://en.wikipedia.org/wiki/Open_Neural_Network_Exchange

[26] Vartak M, Subramanyam H, Lee WE, Viswanathan S, Husnoo S, Madden S, Zaharia M. ModelDB:A system for machine learning model management. In:Proc. of the Workshop on Human-in-the-loop Data Analytics. 2016. 1-3.

[27] Zhang Z, Sparks ER, Franklin MJ. Diagnosing machine learning pipelines with fine-grained lineage. In:Proc. of the 26th Int'l Symp. on High-Performance Parallel and Distributed Computing. 2017. 143-153.

[28] George L. HBase:The Definitive Guide:Random Access to Your Planet-Size Data. O'Reilly Media, Inc., 2011.

[29] AWS S3. 2020. https://aws.amazon.com/s3/

[30] Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran A. Principles of dataset versioning:Exploring the recreation/storage tradeoff. Proc. of the VLDB Endowment, 2015,8(2):1346.

[31] Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore AJ, Madden S, Parameswaran AG. Datahub:Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798. 2014.

[32] Miao H, Li A, Davis LS, Deshpande A. Modelhub:Towards unified data and lifecycle management for deep learning. arXiv preprint arXiv:1611.06224. 2016.

[33] Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In:Proc. of the Int'l Conf. on Scientific and Statistical Database Management. Berlin, Heidelberg:Springer-Verlag, 2011. 1-16.

[34] Snappy. 2020. https://en.wikipedia.org/wiki/Snappy_(compression)

[35] Elgohary A, Boehm M, Haas PJ, Reiss FR, Reinwald B. Compressed linear algebra for large-scale machine learning. Proc. of the VLDB Endowment, 2016,9(12):960-971.

[36] Run-Length_Encoding. 2020. https://en.wikipedia.org/wiki/Run-length_encoding

[37] Li F, Chen L, Kumar A, Naughton JF, Patel JM, Wu X. When lempel-ziv-welch meets machine learning:A case study of accelerating machine learning using coding. arXiv preprint arXiv:1702.06943. 2017.

[38] Tabei Y, Saigo H, Yamanishi Y, Puglisi SJ. Scalable partial least squares regression on grammar-compressed data matrices. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1875-1884.

[39] Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Trans. on Database Systems (TODS), 2016,41(1):1-32.

[40] Deshpande A, Madden S. MauveDB:Supporting model-based user views in database systems. In:Proc. of the 2006 ACM SIGMOD Int'l Conf. on Management of Data. 2006. 73-84.

[41] Nikolic M, ElSeidy M, Koch C. LINVIEW:Incremental view maintenance for complex analytical queries. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 253-264.

[42] Anderson MR, Cafarella M. Input selection for fast feature engineering. In:Proc. of the 2016 IEEE 32nd Int'l Conf. on Data Engineering (ICDE). IEEE, 2016. 577-588.

[43] Zhang Y, Munagala K, Yang J. Storing matrices on disk:Theory and practice revisited. Proc. of the VLDB Endowment, 2011,4(11):1075-1086.

[44] Sparks ER, Venkataraman S, Kaftan T, Franklin MJ, Recht B. Keystoneml:Optimizing pipelines for large-scale advanced analytics. In:Proc. of the 2017 IEEE 33rd Int'l Conf. on Data Engineering (ICDE). IEEE, 2017. 535-546.

[45] Transfer learning. 2020. https://en.wikipedia.org/wiki/Transfer_learning

[46] Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, Tian Y. SystemML's optimizer:Plan generation for large-scale machine learning programs. IEEE Data Engineering Bulletin, 2014,37(3):52-62.

[47] Sujeeth AK, Lee H, Brown KJ, Rompf T, Chafi H, Wu M, Atreya AR, Odersky M, Olukotun K. OptiML:An implicitly parallel domain-specific language for machine learning. In:Proc. of the ICML. 2011.

[48] Kumar A, McCann R, Naughton J, Patel JM. Model selection management systems:The next frontier of advanced analytics. ACM SIGMOD Record, 2016,44(4):17-22.

[49] Zoph B, Le QV. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. 2016.

[50] Xie L, Yuille A. Genetic CNN. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 1379-1388.

[51] Baker B, Gupta O, Naik N, Raskar R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. 2016.

[52] Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In:Proc. of the Advances in Neural Information Processing Systems. 2014. 3320-3328.

[53] Kumar A, Boehm M, Yang J. Data management in machine learning:Challenges, techniques, and systems. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 1717-1722.

[54] Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C. MAD skills:New analysis practices for big data. Proc. of the VLDB Endowment, 2009,2(2):1481-1492.

[55] Hellerstein J, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library or MAD skills, the SQL. arXiv preprint arXiv:1208.4165. 2012.

[56] Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-RDBMS analytics. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 325-336.

[57] Cheng Y, Qin C, Rusu F. GLADE:Big data analytics made easy. In:Proc. of the 2012 ACM SIGMOD Int'l Conf. on Management of Data. 2012. 697-700.

[58] Rusu F, Dobra A. GLADE:A scalable framework for efficient analytics. ACM SIGOPS Operating Systems Review, 2012,46(1):12-18.

[59] Luo S, Gao ZJ, Gubanov M, Perez LL, Jermaine C. Scalable linear algebra on a relational database system. IEEE Trans. on Knowledge and Data Engineering, 2018,31(7):1224-1238.

[60] Eigen. 2020. http://eigen.tuxfamily.org/

[61] Mahajan D, Kim JK, Sacks J, Ardalan A, Kumar A, Esmaeilzadeh H. In-RDBMS hardware acceleration of advanced analytics. arXiv preprint arXiv:1801.06027. 2018.

[62] Cai Z, Vagena Z, Perez L, Arumugam S, Haas PJ, Jermaine C. Simulation of database-valued Markov chains using SimSQL. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 637-648.

[63] Kara K, Eguro K, Zhang C, Alonso G. ColumnML:Column-store machine learning with on-the-fly data transformation. Proc. of the VLDB Endowment, 2018,12(4):348-361.

[64] Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. SystemML:Declarative machine learning on MapReduce. In:Proc. of the 2011 IEEE 27th Int'l Conf. on Data Engineering. IEEE, 2011. 231-242.

[65] Boehm M, Dusenberry MW, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss FR, Sen P, Surve AC, Tatikonda S. Systemml:Declarative machine learning on spark. Proc. of the VLDB Endowment, 2016,9(13):1425-1436.

[66] Huang B, Babu S, Yang J. Cumulon:Optimizing statistical data analysis in the cloud. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 1-12.

[67] Brown PG. Overview of SciDB:Large scale array storage, processing and analysis. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. 2010. 963-968.

[68] Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark:SQL and rich analytics at scale. In:Proc. of the 2013 ACM SIGMOD Int'l Conf. on Management of Data. 2013. 13-24.

[69] Stonebraker M, Madden S, Dubey P. Intel "big data" science and technology center vision and execution plan. ACM SIGMOD Record, 2013,42(1):44-49.

[70] Zhang Y, Zhang W, Yang J. I/O-Efficient statistical computing with RIOT. In:Proc. of the 2010 IEEE 26th Int'l Conf. on Data Engineering (ICDE 2010). IEEE, 2010. 1157-1160.

[71] Zhou X, Chai C, Li G, Sun J. Database meets artificial intelligence:A survey. IEEE Trans. on Knowledge and Data Engineering, 2020.

[72] Lee Y, Scolari A, Chun BG, Santambrogio MD, Weimer M, Interlandi M. {PRETZEL}:Opening the black box of machine learning prediction serving systems. In:Proc. of the 13th {USENIX} Symp. on Operating Systems Design and Implementation (OSDI 2018). 2018. 611-626.

[73] Wang W, Wang S, Gao J, Zhang M, Chen G, Ng TK, Ooi BC. Rafiki:Machine learning as an analytics service system. arXiv preprint arXiv:1804.06087. 2018.

[74] Smith MJ, Sala C, Kanter JM, Veeramachaneni K. The machine learning bazaar:Harnessing the ML ecosystem for effective system development. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 785-800.

附中文参考文献:

[4] 杜小勇,卢卫,张峰.大数据管理系统的历史、现状与未来.软件学报,2019,30(1):127-141. http://www.jos.org.cn/1000-9825/5644. htm[doi:10.13328/j.cnki.jos.005644]

引用本文

崔建伟,赵哲,杜小勇.支撑机器学习的数据管理技术综述.软件学报,2021,32(3):604-621

复制

文章指标

点击次数:3717
下载次数: 7330
HTML阅读次数: 4867
引用次数: 0

历史

收稿日期:2020-07-20
最后修改日期:2020-09-03
录用日期:
在线发布日期: 2021-01-21
出版日期: 2021-03-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码