Research Progress on Distributed Matrix Computation Systems for Big Data Analysis
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [97]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    As an essential part of big data governance applications, data analysis is characterized by time-consuming and large hardware requirements, making it essential to optimize its execution efficiency. Earlier, data analysts could execute analysis algorithms using traditional matrix computation tools. However, with the explosive growth of data volume, the traditional tools can no longer meet the performance requirements of applications. Hence, distributed matrix computation systems for big data analysis have emerged. This study reviews the progress of distributed matrix computation systems from technical and system perspectives. First, this study analyzes the challenges faced by distributed matrix computation systems in four dimensions:programming interface, compilation optimization, execution engine, and data storage, from the perspective of the mature data management field. Second, this study discusses and summarizes the technologies in each of these four dimensions. Finally, the study investigates the future research and development directions of distributed matrix computation systems.

    Reference
    [1] Sylvester J.On a new class of theorems.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1850, 37:363-370.
    [2] Cayley A.A memoir on the theory of matrices.Philosophical Trans.of the Royal Society of London, 1858, 31(148):17-37.
    [3] Gu R, Qiu HJ, Yang WJ, Hu W, Yuan CF, Huang YH.Goldfish:A large scale semantic data store and query system based on Boolean matrix factorization.Chinese Journal of Computers, 2017, 40(10):2212-2230(in Chinese with English abstract).
    [4] Shen XW, Ye XC, Wang D, Zhang H, Wang F, Tan X, Zhang ZM, Fan DR, Tang ZM, Sun NH.Optimizing dataflow architecture for scientific applications.Chinese Journal of Computers, 2017, 40(9):2181-2196(in Chinese with English abstract).
    [5] Big data analytics market.https://www.fortunebusinessinsights.com/big-data-analytics-market-106179
    [6] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M.TensorFlow:A system for large-scale machine learning.In:Proc.of the 12th USENIX Symp.on Operating Systems Design and Implementation.Savannah:USENIX, 2016.265-283.
    [7] PyTorch.https://github.com/pytorch/pytorch
    [8] Hellerstein J, Re C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A.The MADlib analytics library or MAD skills, the SQL.Proc.of the VLDB Endowment, 2012, 5(12):1700-1711.
    [9] Boehm M, Dusenberry MW, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss FR, Sen P, Surve AC, Tatikonda S.SystemML:Declarative machine learning on spark.Proc.of the VLDB Endowment, 2016, 9(13):1425-1436.
    [10] Bosagh Zadeh R, Meng X, Ulanov A, Yavuz B, Pu L, Venkataraman S, Sparks ER, Staple A, Zaharia M.Matrix computations and optimization in apache spark.In:Proc.of the 22nd ACM SIGKDD Int'l Conf.on Knowledge Discovery and Data Mining.San Francisco:ACM, 2016.31-38.
    [11] Schelter S, Palumbo A, Quinn S, Marthi S, Musselman A.Samsara:Declarative machine learning on distributed dataflow systems.In:Proc.of the 30th Conf.on Neural Information Processing Systems.Barcelona:Curran Associates Inc., 2016.
    [12] Chen L, Kumar A, Naughton J, Patel JM.Towards linear algebra over normalized data.Proc.of the VLDB Endowment, 2016, 10(11):1214-1225.
    [13] Gan J, Liu T, Li L, Zhang J.Non-negative matrix factorization:A survey.The Computer Journal, 2021, 64(7):1080-1092.
    [14] Gou P, Wang K, Luo AL, Xue MZ.Computational intelligence for big data analysis:Current status and future prospect.Ruan Jian Xue Bao/Journal of Software, 2015, 26(11):3010-3025(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/4900.htm[doi:10.13328/j.cnki.jos.004900]
    [15] Qian Z, Chen, X, Kang N, Chen M, Yu Y, Moscibroda T, Zhang Z.MadLINQ:Large-scale distributed matrix computation for the cloud.In:Proc.of the 7th ACM European Conf.on Computer Systems.Bern:ACM, 2012.197-210.
    [16] Venkataraman S, Bodzsar E, Roy I, AuYoung A, Schreiber RS.Presto:Distributed machine learning and graph processing with sparse matrices.In:Proc.of the 8th ACM European Conf.on Computer Systems.Prague:ACM, 2013.197-210.
    [17] Bosilca G, Bouteiller A, Danalis A, Faverge M, Haidar A, Herault T, Kurzak J, Langou J, Lemarinier P, Ltaief H, Luszczek P, YarKhan A, Dongarra J.Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA.In:Proc.of the IEEE Int'l Symp.on Parallel and Distributed Processing Workshops and Phd Forum.Anchorage:IEEE, 2011.1432-1441.
    [18] Chen Z, Xu C, Soto J, Markl V, Qian W, Zhou A.Hybrid evaluation for distributed iterative matrix computation.In:Proc.of the ACM Int'l Conf.on Management of Data.Xi'an:ACM, 2021.300-312.
    [19] Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J.Ricardo:Integrating R and hadoop.In:Proc.of the ACM Int'l Conf.on Management of Data.Indianapolis:ACM, 2010.987-998.
    [20] Venkataraman S, Yang Z, Liu D, Liang E, Falaki H, Meng X, Xin R, Ghdsi A, Franklin M, Stoica I, Zaharia M.SparkR:Scaling R programs with spark.In:Proc.of the ACM Int'l Conf.on Management of Data.San Francisco:ACM, 2016.1099-1104.
    [21] RHadoop.https://github.com/RevolutionAnalytics/RHadoop
    [22] Programming with big data in R.https://pbdr.org/
    [23] Poulson J, Marker B, Van de Geijn RA, Hammond JR, Romero NA.Elemental:A new framework for distributed memory dense matrix computations.ACM Trans.on Mathematical Software, 2013, 39(2):1-24.
    [24] Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss F, Sen P, Tatikonda S, Tian Y.SystemML's optimizer:Plan generation for large-scale machine learning programs.IEEE Data Engineering Bulletin, 2014, 37(3):52-62.
    [25] Boehm M, Antonov I, Baunsgaard S, Dokter M, Ginthor R, Innerebner K, Lindstaedt SN, Phani A, Rath B, Reinwald B, Siddiqui S, Wrede SB.SystemDS:A declarative machine learning system for the end-to-end data science lifecycle.In:Proc.of the 10th Biennial Conf.on Innovative Data Systems Research.2020.
    [26] Wang J, Wu J, Li M, Gu J, Das A, Zaniolo C.Formal semantics and high performance in declarative machine learning using datalog.The VLDB Journal, 2021, 30:859-881.
    [27] DistArrays.http://docs.enthought.com/distarray/
    [28] Huang CC, Chen Q, Wang Z, Power R, Ortiz J, Li J, Xiao Z.Spartan:A distributed array framework with smart tiling.In:Proc.of the USENIX Annual Technical Conf.Santa Clara:USENIX, 2015.1-15.
    [29] Sparks ER, Talwalkar A, Smith V, Kottalam J, Pan X, Gonzalez JE, Franklin MJ, Jordan MI, Kraska T.MLI:An API for distributed machine learning.In:Proc.of the IEEE 13th Int'l Conf.on Data Mining.Dallas:IEEE, 2013.1187-1192.
    [30] Yu L, Shao Y, Cui B.Exploiting matrix dependency for efficient distributed matrix computation.In:Proc.of the ACM Int'l Conf.on Management of Data.Melbourne:ACM, 2015.93-105.
    [31] Gates M, Kurzak J, Charara A, YarKhan A, Dongarra J.SLATE:Design of a modern distributed and accelerated linear algebra library.In:Proc.of the Int'l Conf.for High Performance Computing, Networking, Storage and Analysis.Denver:ACM, 2019.26:1-26:18.
    [32] Agullo E, Aumage O, Faverge M, Furmento N, Pruvost F, Sergent M, Thibault SP.Achieving high performance on supercomputers with a sequential task-based programming model.IEEE Trans.on Parallel and Distributed Systems, 2017.
    [33] Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M.Spark SQL:Relational data processing in spark.In:Proc.of the ACM Int'l Conf.on Management of Data.Melbourne:ACM, 2015.1383-1394.
    [34] MRQL.https://cwiki.apache.org/confluence/display/mrql/
    [35] Brown PG.Overview of SciDB:Large scale array storage, processing and analysis.In:Proc.of the ACM Int'l Conf.on Management of Data.Indianapolis:ACM, 2010.963-968.
    [36] Alotaibi R, Cautis B, Deutsch A, Manolescu I.HADAD:A lightweight approach for optimizing hybrid complex analytics queries.In:Proc.of the ACM Int'l Conf.on Management of Data.Xi'an:ACM, 2021.23-35.
    [37] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM.Distributed GraphLab:A framework for machine learning in the cloud.Proc.of the VLDB Endowment, 2012, 5(8):716-727.
    [38] Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G.Pregel:A system for large-scale graph processing.In:Proc.of the ACM Int'l Conf.on Management of Data.Indianapolis:ACM, 2010.135-146.
    [39] Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K.Apache Flink:Stream and batch processing in a single engine.Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4):28-38.
    [40] Ewen S, Tzoumas K, Kaufmann M, Markl V.Spinning fast iterative data flows.Proc.of the VLDB Endowment, 2012, 5(12):1268-1279.
    [41] Mihaylov SR, Ives ZG, Guha S.REX:Recursive, delta-based data-centric computation.Proc.of the VLDB Endowment, 2012, 5(11):1280-1291.
    [42] McSherry F, Murray DG, Isaacs R, Isard M.Differential dataflow.In:Proc.of the 6th Biennial Conf.on Innovative Data Systems Research.2013.
    [43] Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M.Naiad:A timely dataflow system.In:Proc.of the 24th ACM Symp.on Operating Systems Principles.Farmington:ACM, 2013.439-455.
    [44] Nikolic M, Elseidy M, Koch C.LINVIEW:Incremental view maintenance for complex analytical queries.In:Proc.of the ACM Int'l Conf.on Management of Data.Snowbird:ACM, 2014.253-264.
    [45] Nikolic M, Olteanu D.Incremental view maintenance with triple lock factorization benefits.In:Proc.of the ACM Int'l Conf.on Management of Data.Houston:ACM, 2018.365-380.
    [46] Kunft A, Katsifodimos A, Schelter S, Breß S, Rabl T, Markl V.An intermediate representation for optimizing machine learning pipelines.Proc.of the VLDB Endowment, 2019, 12(11):1553-1567.
    [47] Yu Y, Tang M, Aref WG, Malluhi QM, Abbas MM, Ouzzani M.In-memory distributed matrix computation processing and optimization.In:Proc.of the IEEE 33rd Int'l Conf.on Data Engineering.San Diego:IEEE, 2017.1047-1058.
    [48] Wang YR, Hutchison S, Suciu D, Howe B, Leang J.SPORES:Sum-product optimization via relational equality saturation for large scale linear algebra.Proc.of the VLDB Endowment, 2020, 13(11):1919-1932.
    [49] Green TJ, Karvounarakis G, Tannen V.Provenance semirings.In:Proc.of the 26th ACM-SIGACT-SIGART Symp.on Principles of Database Systems.Santa Barbara:ACM, 2007.31-40.
    [50] Tate R, Stepp M, Tatlock Z, Lerner S.Equality saturation:A new approach to optimization.In:Proc.of the 36th Annual ACM SIGPLAN-SIGACT Symp.on Principles of Programming Languages.Savannah:ACM, 2009.264-276.
    [51] Chen Z, Xu C, Qian W, Zhou A.Redundancy elimination in distributed matrix computation.In:Proc.of the ACM Int'l Conf.on Management of Data.Philadelphia:ACM, 2022.573-586.
    [52] Karsavuran MO, Akbudak K, Aykanat C.Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors.IEEE Trans.on Parallel and Distributed Systems, 2015, 27(6):1713-1726.
    [53] Han D, Lee J, Kim M.FuseME:Distributed matrix computation engine based on cuboid-based fused operator and plan generation.In:Proc.of the ACM Int'l Conf.on Management of Data.Philadelphia:ACM, 2022.1891-1904.
    [54] Huang B, Babu S, Yang J.Cumulon:Optimizing statistical data analysis in the cloud.In:Proc.of the ACM Int'l Conf.on Management of Data.New York:ACM, 2013.1-12.
    [55] Belter G, Jessup ER, Karlin I, Siek JG.Automating the generation of composed linear algebra kernels.In:Proc.of the Conf.on High Performance Computing Networking, Storage and Analysis.Portland:Association for Computing Machinery, 2009.1-12.
    [56] Sujeeth AK, Lee HJ, Brown KJ, Rompf T, Chafi H, Wu M, Atreya AR, Odersky M, Olukotun K.OptiML:An implicitly parallel domain-specific language for machine learning.In:Proc.of the 28th Int'l Conf.on Machine Learning.Washington:Omnipress, 2011.609-616.
    [57] Zhang M, Wu Y, Chen K, Ma T, Zheng W.Measuring and optimizing distributed array programs.Proc.of the VLDB Endowment, 2016, 9(12):912-923.
    [58] Crotty A, Galakatos A, Dursun K, Kraska T, Binnig C, Cetintemel U, Zdonik S.An architecture for compiling UDF-centric workflows.Proc.of the VLDB Endowment, 2015, 8(12):1466-1477.
    [59] Palkar S, Thomas JJ, Shanbhag A, Narayanan D, Pirk H, Schwarzkopf M, Amarasinghe S, Zaharia M.Weld:A common runtime for high performance data analytics.In:Proc.of the 8th Biennial Conf.on Innovative Data Systems Research.2017.
    [60] Vasilache N, Zinenko O, Theodoridis T, Goyal P, DeVito Z, Moses WS, Verdoolaege S, Adams A, Cohen A.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions.arXiv:1802.04730, 2018.
    [61] Elgamal T, Luo S, Boehm M, Evfimievski AV, Tatikonda S, Reinwald B, Sen P.SPOOF:Sum-product optimization and operator fusion for large-scale machine learning.In:Proc.of the 8th Biennial Conf.on Innovative Data Systems Research.2017.
    [62] Boehm M, Reinwald B, Hutchison D, Hutchison D, Sen P, Evfimievski AV, Pansare N.On optimizing operator fusion plans for large-scale machine learning in SystemML.Proc.of the VLDB Endowment, 2018, 11(12):1755-1768.
    [63] Matlab mmtimes function.https://ww2.mathworks.cn/matlabcentral/fileexchange/27950-mmtimes-matrix-chain-product
    [64] Sommer J, Boehm M, Evfimievski AV, Reinwald B, Haas PJ.MNC:Structure-exploiting sparsity estimation for matrix expressions.In:Proc.of the Int'l Conf.on Management of Data.Amsterdam:ACM, 2019.1607-1623.
    [65] Ghoting A, Krishnamurthy R, Pednault EPD, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S.SystemML:Declarative machine learning on MapReduce.In:Proc.of the 27th IEEE Int'l Conf.on Data Engineering.Hannover:IEEE, 2011.231-242.
    [66] Liao X, Li SG, Lu YT, Yang CQ.New 2.5D parallel matrix multiplication algorithm based on BLACS.Chinese Journal of Computers, 2020, 44(5):1037-1050(in Chinese with English abstract).
    [67] Feng J, Ni M, Zhao JB.Algorithm of distributed matrix multiplication based on hadoop.Computer Systems Applications, 2013, 22(12):149-154(in Chinese with English abstract).
    [68] Gu R, Tang Y, Tian C, Zhou H, Li G, Zheng X, Huang Y.Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms.IEEE Trans.on Parallel and Distributed Systems, 2017, 28(9):2539-2552.
    [69] Kwasniewski G, Kabic M, Ben-Nun T, Ziogas AN, Saether JE, Gaillard A, Schneider T, Besta M, Kozhevnikov A, VandeVondele J, Hoefler T.On the parallel I/O optimality of linear algebra kernels:Near-optimal matrix factorizations.In:Proc.of the Int'l Conf.for High Performance Computing, Networking, Storage and Analysis.St.Louis:ACM, 2021.70:1-70:15.
    [70] Dean J, Ghemawat S.MapReduce:Simplified data processing on large clusters.Communications of the ACM, 2008, 51(1):107-113.
    [71] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D.Dryad:Distributed data-parallel programs from sequential building blocks.In:Proc.of the 2nd ACM SIGOPS/EuroSys European Conf.on Computer Systems.Lisbon:ACM, 2007.59-72.
    [72] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCualy M, Franklin MJ, Shenker S, Stoica I.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing.In:Proc.of the 9th USENIX Symp.on Networked Systems Design and Implementation.San Jose:USENIX, 2012.15-28.
    [73] H2O.https://h2o.ai/
    [74] PostgreSQL.https://www.postgresql.org/
    [75] MySQL.https://www.mysql.com/
    [76] GreenPlum.https://tanzu.vmware.com/greenplum
    [77] Zhang Y, Herodotou H, Yang J.RIOT:I/O-efficient numerical computing without SQL.In:Proc.of the 4th Biennial Conf.on Innovative Data Systems Research.2009.
    [78] ScaLAPACK.http://www.netlib.org/scalapack/
    [79] PETSc.https://petsc.org/release/
    [80] Van De Geijn RA, Watts J.SUMMA:Scalable universal matrix multiplication algorithm.Concurrency:Practice and Experience, 1997, 9(4):255-274.
    [81] Ballard G, Demmel J, Holtz O, Lipshitz B, Schwartz O.Communication-optimal parallel algorithm for Strassen's matrix multiplication.In:Proc.of the 24th Annual ACM Symp.on Parallelism in Algorithms and Architectures.Pittsburgh:ACM, 2012.193-204.
    [82] Demmel J, Eliahu D, Fox A, Kamil S, Lipshitz B, Schwartz O, Spillinger O.Communication-optimal parallel recursive rectangular matrix multiplication.In:Proc.of the IEEE 27th Int'l Symp.on Parallel and Distributed Processing.Cambridge:IEEE, 2013.261-272.
    [83] Li YY, Xue W, Chen DX, Wang XL, Xu P, Zhang WS, Yang GW.Performance optimization for sparse matrix-vector multiplication on sunway architecture.Chinese Journal of Computers, 2020, 43(6):1010-1024(in Chinese with English abstract).
    [84] Long GP, Fan DR.Parallelization of LU decomposition on the Godson-Tv1 many-core architecture.Chinese Journal of Computers, 2009, 32(11):2157-2167(in Chinese with English abstract).
    [85] Kunft A, Katsifodimos A, Schelter S, Rabl T, Markl V.Blockjoin:Efficient matrix partitioning through joins.Proc.of the VLDB Endowment, 2017, 10(13):2061-2072.
    [86] Han D, Nam Y, Lee J, Park K, Kim H, Kim M.DistME:A fast and elastic distributed matrix computation engine using GPUs.In:Proc.of the Int'l Conf.on Management of Data.Amsterdam:ACM, 2019.759-774.
    [87] Thomas A, Kumar A.A comparative evaluation of systems for scalable linear algebra-based analytics.Proc.of the VLDB Endowment, 2018, 11(13):2168-2182.
    [88] Han B, Chen Z, Xu C, Zhou A.Efficient matrix computation for SGD-based algorithms on apache spark.In:Proc.of the 27th Database Systems for Advanced Applications.Springer, 2022.309-324.
    [89] Parger M, Winter M, Mlakar D, Steinberger M.SpECK:Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis.In:Proc.of the 25th ACM SIGPLAN Symp.on Principles and Practice of Parallel Programming.San Diego:ACM, 2020.362-375.
    附中文参考文献
    [3] 顾荣,仇红剑, 杨文家, 胡伟, 袁春风, 黄宜华.Goldfish:基于矩阵分解的大规模RDF数据存储与查询系统.计算机学报, 2017, 40(10):2212-2230.
    [4] 申小伟, 叶笑春, 王达, 张浩, 王飞, 谭旭, 张志敏, 范东睿, 唐志敏, 孙凝晖.一种面向科学计算的数据流优化方法.计算机学报, 2017, 40(9):2181-2196.
    [14] 郭平, 王可,罗阿理, 薛明志.大数据分析中的计算智能研究现状与展望.软件学报, 2015, 26(11):3010-3025.http://www.jos.org.cn/1000-9825/4900.htm[doi:10.13328/j.cnki.jos.004900]
    [66] 廖霞, 李胜国, 卢宇彤, 杨灿群.基于BLACS的2.5D并行矩阵乘法.计算机学报, 2020, 44(5):1037-1050.
    [67] 冯健, 倪明,赵建波.一种基于分布式平台Hadoop的矩阵相乘算法.计算机系统应用, 2013, 22(12):149-154.
    [83] 李亿渊, 薛巍,陈德训, 王欣亮, 许平, 张武生, 杨广文.稀疏矩阵向量乘法在申威众核架构上的性能优化.计算机学报, 2020, 43(6):1010-1024.
    [84] 龙国平, 范东睿.LU分解在Godson-Tv1众核体系结构上的并行化研究.计算机学报, 2009, 32(11):2157-2167.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

陈梓浩,徐辰,钱卫宁,周傲英.面向大数据分析的分布式矩阵计算系统研究进展.软件学报,2023,34(3):1236-1258

Copy
Share
Article Metrics
  • Abstract:1552
  • PDF: 4760
  • HTML: 4139
  • Cited by: 0
History
  • Received:May 15,2022
  • Revised:July 29,2022
  • Online: October 26,2022
  • Published: March 06,2023
You are the first2049984Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063