[关键词]
[摘要]
随着互联网信息技术的发展, 社交网络、计算机网络及生物信息网络等领域涌现海量大规模图数据. 鉴于传统图数据管理技术在处理大规模图时存在存储及性能方面的局限, 大规模图的分布式处理技术已成为图数据库领域的研究热点, 并得到工业界和学术界的广泛关注. 图的核分解用于计算图中所有顶点的核值, 有助于挖掘重要图结构信息, 在社区搜索、蛋白质结构分析和网络结构可视化等诸多应用中发挥着关键作用. 当前以顶点为中心计算模式的分布式核分解算法中采用一种广播的消息传递机制, 一方面, 存在大量的冗余通信及计算开销; 另一方面, 处理大规模图核分解过程中易产生内存溢出问题. 为此, 分别提出基于全局激活和层次剥离计算框架, 并提出分布式核分解新算法, 通过引入基于顶点核值局部性特点的消息剪枝策略和以计算节点为中心的计算新模式, 保证算法有效性的同时提升其性能. 在国家超级计算长沙中心分布式集群上, 分别针对大规模真实和合成数据集, 算法总耗时性能提升比例为37%–98%, 验证所提模型和算法的有效性和高效性.
[Key word]
[Abstract]
With the development of Internet information technology, large-scale graphs have widely emerged in social networks, computer networks, and biological information networks. In view of the storage and performance limitations of traditional graph data management technology when dealing with large-scale graphs, distributed management technology has become a hotspot in industry and academia fields. The core decomposition is adopted to get core numbers of vertices in a graph and plays a key role in many applications, including community search, protein structure analysis, and network structure visualization. The existing distributed core decomposition algorithm applied a broadcast message delivery mechanism based on the vertex-centric mode, which may generate a large amount of redundant communication and computation overhead and lead to memory overflow when processing large-scale graphs. To address these issues, this study proposes novel distributed core decomposition algorithms based on global activation and peeling calculation frameworks, respectively. In addition, there are several strategies designed to improve algorithm performance. Based on the locality of the vertex core number, the study proposes a new message-pruning strategy and a new worker-centric computing mode, thereby improving the efficiency of our algorithms. To verify those strategies, this study deploys the proposed models and algorithms on the distributed cluster of the National Supercomputing Center in Changsha, and the effectiveness and efficiency of the proposed methods are evaluated through a large number of experiments on real and synthetic data sets. The total time performance of the algorithm is improved by 37% to 98%.
[中图分类号]
TP301
[基金项目]
国家自然科学基金(62172146, 62102143); 湖南省自然科学基金(2023JJ10016, 2022JJ30009, 2023JJ30083); 湖南省教育厅科学研究重点项目(22A0592)