面向大规模二部图的分布式Tip分解算法
作者:
作者简介:

周旭(1983-),女,博士,副教授,主要研究领域为并行计算,图数据管理,图计算;
李博仁(1998-),男,硕士生,主要研究领域为图计算,社区搜索;
翁同峰(1992-),男,博士生,CCF学生会员,主要研究领域为分布式图计算;
张吉(1977-),男,教授,主要研究领域为数据挖掘,图数据管理;
杨志邦(1984-),男,博士,副教授,CCF专业会员,主要研究领域为并行计算,图数据管理;
李肯立(1971-),男,博士,教授,CCF会士,主要研究领域为并行分布式处理,超级计算与云计算,面向大数据和人工智能的高效能计算.

通讯作者:

翁同峰,E-mail:wengtongfeng@hnu.edu.cn

基金项目:

国家自然科学基金(61772182,61802032,69189338,62172146,62172157);之江实验室开放课题(2021KD0AB02);国防科技大学信息系统工程重点实验室基金


Distributed Algorithm for Tip Decomposition on Large Bipartite Graphs
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [33]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    Tip分解作为图数据管理领域的热点研究问题,已被广泛应用于文档聚类和垃圾邮件组检测等实际场景中.随着图数据规模的爆炸式增长,单机内存已无法满足其存储需求,亟需研究分布式环境下Tip分解技术.现有分布式图计算系统的通信模式无法适用于二部图,为此,首先提出一种基于中继的通信模式,以实现分布式环境下处理二部图时消息的有效传递;其次,提出分布式butterfly计数算法(DBC)和tip分解算法(DTD),特别地,为解决处理大规模二部图时DBC面临的内存溢出问题,提出了一种可控的并行顶点激活策略;最后,引入基于顶点优先级的消息剪枝策略和消息有效性剪枝策略,通过减少冗余通信和计算开销,进一步提高算法效率.实验平台部署于国家超算中心高性能分布式集群上,在多个真实数据集上的实验结果验证了所提算法的有效性和高效性.

    Abstract:

    Tip decomposition has a pivotal role in mining cohesive subgraph in bipartite graphs. It is a popular research topic with wide applications in document clustering, spam group detection, and analysis of affiliation networks. With the explosive growth of bipartite graph data scale in these scenarios, it is necessary to use distributed method to realize its effective storage. For this reason, this work studies the problem of tip decomposition on a bipartite graph in the distributed environment for the first time. Firstly, a new relay based communication mode is proposed to realize effective message transmission when the given bipartite graph is decomposed in distributed environment. Secondly, the distributed butterfly counting algorithm (DBC) and tip decomposition algorithm (DTD) are designed. In particular, a controllable parallel vertex activation strategy is proposed to solve the problem of memory overflow when DBC decomposes large-scale bipartite graphs. Finally, the message pruning strategy based on vertex priority and message validity pruning strategy are introduced to further improve the efficiency of the algorithm by reducing redundant communication and computing overhead. The experiment is deployed on the high performance distributed computing platform of National Supercomputing Center. The effectiveness and efficiency of the proposed algorithms are verified by experiments on several real datasets.

    参考文献
    [1] Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning. In:Proc. of the 7th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2001. 269-274.
    [2] Wang K, Lin X, Qin L, et al. Efficient bitruss decomposition for large-scale bipartite graphs. In:Proc. of the 2020 IEEE 36th Int'l Conf. on Data Engineering (ICDE). IEEE, 2020. 661-672.
    [3] Gibson D, Kumar R, Tomkins A. Discovering large dense subgraphs in massive graphs. In:Proc. of the 31st Int'l Conf. on Very Large Data Bases. 2005. 721-732.
    [4] Seidman SB. Network structure and minimum degree. Social Networks, 1983, 5(3):269-287.
    [5] Malliaros FD, Giatsidis C, Papadopoulos AN, et al. The core decomposition of networks:Theory, algorithms and applications. The VLDB Journal, 2020, 29(1):61-92.
    [6] Cohen J. Trusses:Cohesive subgraphs for social network analysis. National Security Agency Technical Report, 2008.
    [7] Newman MEJ. Scientific collaboration networks. I. Network construction and fundamental results. Physical Review E, 2001, 64(1): No.016131.[doi:10.1103/PhysRevE.64.016131]
    [8] Sarıyüce AE, Pinar A. Peeling bipartite networks for dense subgraph discovery. In:Proc. of the 11th ACM Int'l Conf. on Web Search and Data Mining. 2018. 504-512.
    [9] Shi J, Shun J. Parallel algorithms for butterfly computations. In:Proc. of the Symp. on Algorithmic Principles of Computer Systems. Society for Industrial and Applied Mathematics, 2020. 16-30.
    [10] Wang K, Lin X, Qin L, et al. Vertex priority-based butterfly counting for large-scale bipartite networks. Proc. of the VLDB Endowment, 2019, 12(10):1139-1152.
    [11] Sanei-Mehri SV, Sariyuce AE, Tirthapura S. Butterfly counting in bipartite networks. In:Proc. of the 24th ACM SIGKDD Int'l Conf. on Knowledge Discovery & Data Mining. 2018. 2150-2159.
    [12] Lakhotia K, Kannan R, Prasanna V, et al. RECEIPT:REfine CoarsE-grained independent tasks for parallel tip decomposition of bipartite graphs. arXiv preprint arXiv:2010.08695, 2020.
    [13] Malewicz G, Austern MH, Bik AJC, et al. Pregel:A system for large-scale graph processing. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. New York:Association for Computing Machinery, 2010. 135-146.
    [14] Sakr S, Orakzai FM, Abdelaziz I, et al. Large-scale Graph Processing using Apache Giraph. Springer, 2016.
    [15] Gonzalez JE, Xin RS, Dave A, et al. GraphX:Graph processing in a distributed dataflow framework. In:Proc. of the 11th USENIX Symp. on Operating Systems Design and Implementation. 2014. 599-613.
    [16] Yan D, Cheng J, Lu Y, et al. Effective techniques for message reduction and load balancing in distributed graph computation. In: Proc. of the 24th Int'l Conf. on World Wide Web. Florence, 2015. 1307-1317.
    [17] Yu G, Gu Y, Bao YB, Wang ZG. Large scale graph data processing on cloud computing environments. Chinese Journal of Computers, 2011, 34(10):1753-1767(in Chinese with English abstract).
    [18] Lu L, Hua B. A platform-and-workload aware online graph partitioning algorithm. Chinese Journal of Computers, 2020, 43(7): 1230-1245(in Chinese with English abstract).
    [19] Zhang CB, Li Y, Jia T. Survey of state-of-the-art fault tolerance for distributed graph processing jobs. Ruan Jian Xue Bao/Journal of Software, 2021, 32(7):2078-2102(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6269.htm[doi:10.13328/j.cnki.jos.006269]
    [20] Engelhardt N, So HKH. Gravf:A vertex-centric distributed graph processing framework on FPGAS. In:Proc. of the 201626th Int'l Conf. on Field Programmable Logic and Applications (FPL). IEEE, 2016. 1-4.
    [21] Weng T, Zhou X, Li K, et al. Efficient distributed approaches to core maintenance on large dynamic graphs. IEEE Trans. on Parallel and Distributed Systems, 2021, 33(1):129-143.
    [22] Luo W, Zhou X, Yang J, et al. Efficient approaches to top-r influential community search. IEEE Internet of Things Journal, 2020, 8(16):12650-12657.
    [23] Ma C, Cheng R, Lakshmanan LVS, et al. LINC:A motif counting algorithm for uncertain graphs. Proc. of the VLDB Endowment, 2019, 13(2):155-168.
    [24] Fang Y, Cheng R, Luo S, et al. Effective community search for large attributed graphs. Proc. of the VLDB Endowment, 2016, 9(12):1233-1244.
    [25] Ma C, Fang Y, Cheng R, et al. Efficient algorithms for densest subgraph discovery on large directed graphs. In:Proc. of the 2020 ACM SIGMOD Int'l Conf. on Management of Data. 2020. 1051-1066.
    [26] Chen R, Shi JX, Chen HB, et al. Bipartite-oriented distributed graph partitioning for big learning. Journal of Computer Science and Technology, 2015, 30(1):20-29.
    [27] Liu B, Yuan L, Lin X, et al. Efficient (α,β)-core computation in bipartite graphs. The VLDB Journal, 2020, 29(5):1075-1099.
    [28] Chiba N, Nishizeki T. Arboricity and subgraph listing algorithms. SIAM Journal on Computing, 1985, 14(1):210-223.
    [29] Yan D, Cheng J, Özsu MT, et al. A general-purpose query-centric framework for querying big graphs. Proc. of the VLDB Endowment, 2016, 9(7):564-575.
    附中文参考文献:
    [17] 于戈, 谷峪, 鲍玉斌, 王志刚. 云计算环境下的大规模图数据处理技术. 计算机学报, 2011, 34(10):1753-1767.
    [18] 陆李, 华蓓. 平台和负载特征感知的在线图分割算法. 计算机学报, 2020, 43(7):1230-1245.
    [19] 张程博, 李影, 贾统. 面向分布式图计算作业的容错技术研究综述. 软件学报, 2021, 32(7):2078-2102. http://www.jos.org.cn/1000-9825/6269.htm[doi:10.13328/j.cnki.jos.006269]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

周旭,翁同峰,杨志邦,李博仁,张吉,李肯立.面向大规模二部图的分布式Tip分解算法.软件学报,2022,33(3):1043-1056

复制
分享
文章指标
  • 点击次数:1234
  • 下载次数: 4656
  • HTML阅读次数: 2668
  • 引用次数: 0
历史
  • 收稿日期:2021-06-30
  • 最后修改日期:2021-07-31
  • 在线发布日期: 2021-10-21
  • 出版日期: 2022-03-06
文章二维码
您是第19700975位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号