图嵌入算法的分布式优化与实现
作者:
作者简介:

张文涛(1994-),男,博士,主要研究领域为图计算,自动化机器学习,分布式系统.
苑斌(1993-),男,硕士,主要研究领域为数据库,大数据管理分析,机器学习.
张智鹏(1993-),男,博士,主要研究领域为大数据处理,分布式机器学习.
崔斌(1975-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为数据库,大数据管理分析.

通讯作者:

崔斌,E-mail:bin.cui@pku.edu.cn

基金项目:

国家重点研发计划(2018YFB1004403);国家自然科学基金(61832001);北京大学-腾讯协同创新实验室项目


Distributed Optimization and Implementation of Graph Embedding Algorithms
Author:
Fund Project:

National Key Research and Development Program of China (2018YFB1004403); National Natural Science Foundation of China (61832001); PKU-Tencent Joint Research Lab

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [22]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    随着人工智能时代的到来,图嵌入技术被越来越多地用来挖掘图中的信息.然而,现实生活中的图通常很大,因此,分布式图嵌入技术得到了广泛的关注.分布式图嵌入算法面临着两大难点:(1)图嵌入算法多种多样,没有一个通用的框架能够描述大部分的算法;(2)现在的分布式图嵌入算法扩展性不足,当处理大图时性能较低.针对以上两个挑战,首先提出一个通用的分布式图嵌入框架,具体地,将图嵌入算法中的采样流程和训练流程进行解耦,使得框架能够较好地表达多种不同的算法;其次,提出一种基于参数服务器的模型切分嵌入策略,具体地,将模型分别切分到计算节点和参数服务器上,同时使用数据洗牌的操作保证计算节点之间没有模型交互,从而减少了分布式计算中的通信开销.基于参数服务器实现了一种原型系统,并且用充分的实验证明了在不损失精度的前提下,基于模型切分的策略能够比基线系统取得更好的性能.

    Abstract:

    With the advent of artificial intelligence, graph embedding techniques are more and more used to mine the information from graphs. However, graphs in real world are usually large and distributed graph embedding is needed. There are two main challenges in distributed graph embedding. (1) There exist many graph embedding methods and there is not a general framework for most of the embedding algorithms. (2) Existing distributed implementations of graph embedding suffer from poor scalability and perform bad on large graphs. To tackle the above two challenges, a general framework is firstly presented for distributed graph embedding. In detail, the process of sampling and training is separated in graph embedding such that the framework can describe different graph embedding methods. Second, a parameter server-based model partitioning strategy is proposed—the model is partitioned to both workers and servers and shuffling is used to ensure that there is no model exchange among workers. A prototype system is implemented on parameter server and solid experiments are conducted to show that partitioning-based strategy can get better performance than all baseline systems without loss of accuracy.

    参考文献
    [1] Zhang W, Miao X, Shao Y, Jiang J, Chen L, Ruas O, Cui B. Reliable data distillation on graph convolutional network. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 1399-1414.
    [2] Wu S, Zhang Y, Gao C, Bian K, Cui B. GARG:Anonymous recommendation of point-of-interest in mobile networks by graph convolution network. Data Science and Engineering, 2020,5(4):433-447.
    [3] He J, Liu HY, Zheng YQ, Shu T, He W, Du XY. Bi-labeled LDA:Inferring interest tags for non-famous users in social network. Data Science and Engineering, 2020,5(1):27-47.
    [4] Shao Y, Chen L, Cui B. Efficient cohesive subgraphs detection in parallel. In:Proc. of the 2014 ACM SIGMOD Int'l Conf. on Management of Data. 2014. 613-624.
    [5] Sikos LF, Philp D. Provenance-aware knowledge representation:A survey of data models and contextualized knowledge graphs. Data Science and Engineering, 2020,5(3):293-316.
    [6] Perozzi B, Al-Rfou R, Skiena S. Deepwalk:Online learning of social representations. In:Proc. of the 20th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2014. 701-710.
    [7] Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. Line:Large-scale information network embedding. In:Proc. of the 24th Int'l Conf. on World Wide Web. 2015. 1067-1077.
    [8] Zhou C, Liu Y, Liu X, Liu Z, Gao J. Scalable graph embedding for asymmetric proximity. In:Proc. of the 31st AAAI Conf. on Artificial Intelligence. 2017. 2942-2948.
    [9] Grover A, Leskovec J. node2vec:Scalable feature learning for networks. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 855-864.
    [10] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In:Proc. of the Advances in Neural Information Processing Systems. 2013. 3111-3119.
    [11] Goldberg Y, Levy O. word2vec explained:Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.
    [12] Jiang J, Yu L, Jiang J, Liu Y, Cui B. Angel:A new large-scale machine learning system. National Science Review, 2018,5(2):216-236.
    [13] Jiang J, Xiao P, Yu L, Li X, Cheng J, Miao X, Cui B. PSGraph:How tencent trains extremely large-scale graphs with spark? In:Proc. of the 2020 IEEE 36th Int'l Conf. on Data Engineering (ICDE). IEEE, 2020. 1549-1557.
    [14] Yang H. Aligraph:A comprehensive graph neural network platform. In:Proc. of the 25th ACM SIGKDD Int'l Conf. on Knowledge Discovery & Data Mining. 2019. 3165-3166.
    [15] Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y. A closer look at skip-gram modelling. In:Proc. of LREC. 2006. 1222-1225.
    [16] Smola A, Narayanamurthy S. An architecture for parallel topic models. Proc. of the VLDB Endowment, 2010,3(1-2):703-710.
    [17] Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Le QV. Large scale distributed deep networks. In:Proc. of the Advances in Neural Information Processing Systems. 2012. 1223-1231.
    [18] Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Yu Y. Petuum:A new platform for distributed machine learning on big data. IEEE Trans. on Big Data, 2015,1(2):49-67.
    [19] Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Su BY. Scaling distributed machine learning with the parameter server. In:Proc. of the 11th USENIX Symp. on Operating Systems Design and Implementation (OSDI 2014). 2014. 583-598.
    [20] Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark:Cluster computing with working sets. HotCloud, 2010, 10(10-10):95.
    [21] Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In:Proc. of the 26th IEEE Symp. on Mass Storage Systems and Technologies (MSST). IEEE, 2010. 1-10.
    [22] Recht B, Re C, Wright S, Niu F. Hogwild:A lock-free approach to parallelizing stochastic gradient descent. In:Proc. of the Advances in Neural Information Processing Systems. 2011. 693-701.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张文涛,苑斌,张智鹏,崔斌.图嵌入算法的分布式优化与实现.软件学报,2021,32(3):636-649

复制
分享
文章指标
  • 点击次数:2369
  • 下载次数: 5472
  • HTML阅读次数: 3574
  • 引用次数: 0
历史
  • 收稿日期:2020-08-23
  • 最后修改日期:2020-09-03
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
文章二维码
您是第19727118位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号