[关键词]
[摘要]
分布式图计算是目前处理大图数据的主流技术,但是存在诸多无法避免的问题,比如分布式计算的负载均衡和分布式实现的调试和优化仍然非常困难.另一方面,近几年的研究结果表明:通过设计合理的数据结构和处理模型,在单个PC上基于大容量磁盘的大图计算往往可以获得与分布式图计算相当的处理性能.例如,GraphChi在单机上的处理性能与Spark在50台节点上的处理性能相差无几.结合累加迭代计算和单机并行处理技术,提出流式处理的异步计算模型ASP.它实现了对磁盘的完全顺序访问,允许流式的顺序载入结构数据的同时进行异步更新计算.基于ASP模型,提出了一种流式处理的异步图处理框架S-Maiter,实现了高效率的基于外存的单机大图处理,通过I/O线程优化、内存资源监控、shard级优先级调度等优化技术,提高了系统处理大图数据的性能.实验结果表明:在处理大图数据(1 300万顶点,5亿连边)时,仅仅需要1台PC机计算资源的S-Maiter与在16台PC上运行的分布式Maiter的性能几乎相当.并且,S-Maiter比另外一个流行的单机大图处理系统GraphChi要快1.5倍.
[Key word]
[Abstract]
Distributed graph processing is mainstream but suffers from a few unavoidable issues, such as workload imbalancing and the debugging/optimizing difficulties in distributed programs. On the other hand, recent research results show that with a reasonable design of data structure and processing model, graph processing on a single PC can achieve comparable performance as the systems using large number of machines. For example, GraphChi on a single PC can achieve almost the same performance with Spark with 50 nodes. In this paper, a streamlined asynchronous graph processing model, ASP is proposed based on accumulated iterative model and external storage based parallel computing techniques. ASP relies on sequential disk access and allows asynchronous computations on the graph structure data. Based on ASP, a streamlined graph processing framework, S-Maiter is designed and implemented to provide high performance graph processing ability on a single PC. By optimizing I/O threading, memory monitoring, and shard-level priority scheduling, the performance of S-Maiter is greatly improved. Experimental results on a big graph dataset (13 million nodes and 500 million edges) show that, 1-node S-Maiter can achieve comparable performance with distributed Maiter with 16 nodes. Furthermore, S-Maiter is 1.5 times faster than the popular single-PC graph processing system GraphChi.
[中图分类号]
TP311
[基金项目]
国家自然科学基金(61672141,61528203);计算机体系结构国家重点实验室开放课题(CARCH201610);中央高校基本科研业务费专项资金(N161604008)