GPPR:跨域分布式个性化PageRank算法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划(2022YFB2702100);国家自然科学基金(61932004,62225203,U21A20516);中央高校基本科研业务专项资金(N232405-16)


GPPR: Cross-geo-distributed Personalized PageRank Algorithm
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    个性化PageRank作为大图分析中的基本算法,在搜索引擎、社交推荐、社区检测等领域具有广泛的应用,一直是研究者们关注的热点问题.现有的分布式个性化PageRank算法均假设所有数据位于同一地理位置,且数据所在的计算节点之间具有相同的网络环境.然而在现实世界中,这些数据可能分布在跨洲的多个数据中心中,这些跨域分布(cross-geo-distributed)的数据中心之间通过广域网连接,存在网络带宽异构、硬件差异巨大、通信费用高昂等特点.分布式个性化PageRank算法需要多轮迭代,并在全局图上进行随机游走.因此,现有的分布式个性化PageRank算法不适用于跨域环境.针对此问题,提出了GPPR (cross-geo-distributed personalized PageRank)算法.该算法首先对跨域环境中的大图数据进行预处理,采用启发式算法映射图数据,以降低网络带宽异构对算法迭代速度的影响;其次,GPPR改进了随机游走方式,提出了基于概率的Push算法,通过减少工作节点之间传输数据的带宽负载,进一步减少算法所需的迭代次数.基于Spark框架实现了GPPR算法,并在阿里云中构建真实的跨域环境,在8个开源大图数据上,与现有的多个代表性分布式个性化PageRank算法进行了对比实验.结果显示,GPPR的通信数据量在跨域环境中比其他算法平均减少30%.在算法运行效率方面,GPPR比其他算法平均提升了2.5倍.

    Abstract:

    Personalized PageRank, as a basic algorithm in large graph analysis, has a wide range of applications in search engines, social recommendation, community detection, and other fields, and has been a hot problem of interest to researchers. The existing distributed personalized PageRank algorithms assume that all data are located in the same geographic location and the network environment is the same among the computing nodes where the data are located. However, in the real world, these data may be distributed in multiple data centers across continents, and these cross-geo-distributed data centers are connected to each other through WANs, which are characterized by heterogeneous network bandwidth, huge hardware differences, and high communication costs. The distributed personalized PageRank algorithm requires multiple iterations and random wandering on the global graph. Therefore, the existing distributed personalized PageRank algorithms are not applicable to the cross-geo-distributed environment. To address this problem, the GPPR (cross-geo- distributed personalized PageRank) algorithm is proposed in this study. The algorithm first preprocesses the big graph data in the cross-geo-distributed environment and maps the graph data by using a heuristic algorithm to reduce the impact of network bandwidth heterogeneity on the iteration speed of the algorithm. Secondly, GPPR improves the random wandering approach and proposes a probability-based push algorithm to further reduce the number of iterations required by the algorithm by reducing the bandwidth load of transmitting data between working nodes. The GPPR algorithm is implemented based on the Spark framework and a real cross-geo-distributed environment in AliCloud is built to conduct experiments on eight open-source big graph data compared with several existing representative distributed personalized PageRank algorithms. The results show that the communication data volume of GPPR is reduced by 30% on average in the cross-geo-distributed environment compared with other algorithms. In terms of algorithm running efficiency, GPPR improves by an average of 2.5 times compared to other algorithms.

    参考文献
    相似文献
    引证文献
引用本文

陈子俊,马德龙,王一舒,袁野. GPPR:跨域分布式个性化PageRank算法.软件学报,2024,35(3):1090-1106

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-07-17
  • 最后修改日期:2023-09-05
  • 录用日期:
  • 在线发布日期: 2023-11-08
  • 出版日期: 2024-03-06
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号