[关键词]
[摘要]
Tip分解作为图数据管理领域的热点研究问题,已被广泛应用于文档聚类和垃圾邮件组检测等实际场景中.随着图数据规模的爆炸式增长,单机内存已无法满足其存储需求,亟需研究分布式环境下Tip分解技术.现有分布式图计算系统的通信模式无法适用于二部图,为此,首先提出一种基于中继的通信模式,以实现分布式环境下处理二部图时消息的有效传递;其次,提出分布式butterfly计数算法(DBC)和tip分解算法(DTD),特别地,为解决处理大规模二部图时DBC面临的内存溢出问题,提出了一种可控的并行顶点激活策略;最后,引入基于顶点优先级的消息剪枝策略和消息有效性剪枝策略,通过减少冗余通信和计算开销,进一步提高算法效率.实验平台部署于国家超算中心高性能分布式集群上,在多个真实数据集上的实验结果验证了所提算法的有效性和高效性.
[Key word]
[Abstract]
Tip decomposition has a pivotal role in mining cohesive subgraph in bipartite graphs. It is a popular research topic with wide applications in document clustering, spam group detection, and analysis of affiliation networks. With the explosive growth of bipartite graph data scale in these scenarios, it is necessary to use distributed method to realize its effective storage. For this reason, this work studies the problem of tip decomposition on a bipartite graph in the distributed environment for the first time. Firstly, a new relay based communication mode is proposed to realize effective message transmission when the given bipartite graph is decomposed in distributed environment. Secondly, the distributed butterfly counting algorithm (DBC) and tip decomposition algorithm (DTD) are designed. In particular, a controllable parallel vertex activation strategy is proposed to solve the problem of memory overflow when DBC decomposes large-scale bipartite graphs. Finally, the message pruning strategy based on vertex priority and message validity pruning strategy are introduced to further improve the efficiency of the algorithm by reducing redundant communication and computing overhead. The experiment is deployed on the high performance distributed computing platform of National Supercomputing Center. The effectiveness and efficiency of the proposed algorithms are verified by experiments on several real datasets.
[中图分类号]
[基金项目]
国家自然科学基金(61772182,61802032,69189338,62172146,62172157);之江实验室开放课题(2021KD0AB02);国防科技大学信息系统工程重点实验室基金