一种面向异构众核处理器的并行编译框架
作者:
作者简介:

李雁冰(1989-),男,甘肃陇西人,博士生,主要研究领域为高性能计算,并行编译优化;赵捷(1987-),男,博士,讲师,CCF专业会员,主要研究领域为高性能计算,并行编译优化;赵荣彩(1957-),男,博士,教授,博士生导师,CCF杰出会员,主要研究领域为高性能计算,并行编译,反编译;徐金龙(1985-),男,博士,讲师,主要研究领域为高性能计算,并行编译优化;韩林(1978-),男,博士,副教授,CCF专业会员,主要研究领域为高性能计算,并行编译优化;李颖颖(1984-),女,讲师,CCF专业会员,主要研究领域为高性能计算,并行编译优化.

通讯作者:

李雁冰,E-mail:li.yanbing@outlook.com

基金项目:

国家自然科学基金(61702546);国家高技术研究发展计划(863)(2014AA01A300)


Parallelizing Compilation Framework for Heterogeneous Many-core Processors
Author:
Fund Project:

National Natural Science Foundation of China (61702546); National High Technology Research and Development Program of China (863)(2014AA01A300)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [43]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块:任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果.

    Abstract:

    Heterogeneous many-core processors become an important trend in high-performance computing area, but the issue that the sophisticated architecture complicates the programming is more significantly. To solve this problem, this study proposes a parallelizing compilation framework for heterogeneous many-core processors based on the open source Open64 compiler, automating the transformation from a sequential program to heterogeneous parallel code. The framework mainly comprises a work scheduling module that identifies the parallelizable regions and achieves a multi-dimensional parallelization recognition for nested loops; a data mapping module that maps data between the main memory and SPM and realizes array boundary analysis and pointer range analysis; a transmission optimizing module that implements optimizations by merging, hoisting and packaging data transmission, and transposing array; and a performance estimation module that proposes a dynamic-static hybrid method to analyze benefit based on the cost model for SW26010. The compilation framework is implemented on top of Sunway SW26010 processors, and experimental evaluations are conducted on numerous benchmarks. The experimental results show that the proposed framework can parallelize these applications and obtain a promising performance improvement on heterogeneous many-core platforms.

    参考文献
    [1] Zang DW, Cao Zh, Sun NH. The development of high-performance computing. Science & Technology Review, 2016,34(14):22-28(in Chinese with English abstract).[doi:10.3981/j.issn.1000-7857.2016.14.002]
    [2] Zheng F, Xu Y, LI HL, Xie XH, Chen ZN. A homegrown many-core processor architecture for high-performance computing. Sciece China (Information Sciences), 2015,45(4):523-534(in Chinese with English abstract).
    [3] Yang GW, Zhao WL, Ding L. "Sunway TaihuLight" and application system. Science (Shanghai), 2017,69(3):12-16(in Chinese with English abstract).
    [4] Sodani A, Gramunt R, Corbal J. Knights landing:Second-generation Intel Xeon Phi Product. IEEE Micro, 2016,36(2):34-46.
    [5] Wu G, Greathouse JL, Lyashevsky A. GPGPU performance and power estimation using machine learning. In:Proc. of the IEEE Int'l Symp. on High Performance Computer Architecture. IEEE, 2015. 564-576.
    [6] Ju XG, Yang L, Huang ST. An overview of architecture of cell processor. Engineering Journal of Wuhan University, 2010,43(6):774-779(in Chinese with English abstract).
    [7] Daga M, Aji AM, Feng W. On the efficacy of a fused CPU+ GPU processor (or APU) for parallel computing. In:Proc. of the 2011 IEEE Symp. on Application Accelerators in High-performance Computing. IEEE, 2011. 141-149.
    [8] Keckler SW, Dally WJ, Khailany B. GPUs and the future of parallel computing. IEEE Micro, 2011,31:7-17.
    [9] Carter NP, Agrawal A, Borkar S. Runnemede:An architecture for ubiquitous high-performance computing. In:Proc. of the IEEE Int'l Symp. on High Performance Computer Architecture (HPCA). Shenzhen:IEEE, 2013. 198-209.
    [10] Lee S, Min SJ, Eigenmann R. OpenMP to GPGPU:A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices, 2009,44(4):101-110.
    [11] Lee S, Eigenmann R. OpenMPC:Extended open MP programming and tuning for GPUs. In:Proc. of the 2010 ACM/IEEE Int'l Conf. for High Performance Computing, Networking, Storage and Analysis. IEEE, 2010. 1-11.
    [12] Han TD, Abdelrahman TS. hi CUDA:High-level GPGPU programming. IEEE Trans. on Parallel and Distributed Systems, 2011, 22(1):78-90.
    [13] Baskaran MM, Ramanujam J, Sadayappan P. Automatic C-to-CUDA code generation for affine programs. Compiler Construction, 2010,6011:244-263.
    [14] Nishkam R, Yi Y, Tao B, Srimat C. Apricot:An optimizing compiler and productivity tool for X86-compatible many-core coprocessors. In:Proc. of the ICS 2012. Venice:IEEE, 2012. 1-11.
    [15] Eichenberger AE, O'Brien JK, O'Brien KM. Using advanced compiler technology to exploit the performance of the cell broadband EngineTM architecture. IBM Systems Journal, 2006,45(1):59-84.
    [16] Wang M, Bodin F, Matz S. Automatic data distribution for improving data locality on the cell BE. In:Proc. of the 22nd Int'l Workshop on Languages and Compilers for Parallel Computing (LCPC 2009). Heidelberg:Springer-Verlag, 2009. 247-262.
    [17] Chan SC, Gao GR, Chapman B. Open64 compiler infrastructure for emerging multicore/manycore architecture all symposium tutorial. In:Proc. of the IEEE Int'l Parallel & Distributed Processing Symp. IEEE, 2008. 1.
    [18] Fu H, Liao J, Yang J, et al. The Sunway TaihuLight supercomputer:System and applications. Sciece China (Information Sciences), 2016,59:1-16.
    [19] OpenACC-Standard.org. The openacc application programming interface. v2.5. OpenACC-Standard.org, 2015. 1-118.
    [20] Zhao J, Zhao RC, Han L. An MPI backend for Open64 compiler. Chinese Journal of Computers, 2014,37(7):1620-1632(in Chinese with English abstract).
    [21] Liu P, Zhao RC, Pang JM, Yao Y. Prioritizing pointer analysis algorithm based on points-to updating. Ruan Jian Xue Bao/Journal of Software, 2014,25(11):2486-2498(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4596.htm[doi:10. 13328/j.cnki.jos.004596]
    [22] Yong SH, Horwitz S. Pointer-range analysis. Lecture Notes in Computer Science, 2004,3148:133-148.
    [23] Li YB, Zhao RC, Liu XX, Zhao J. Cost model for automatic OpenMP parallelization. Ruan Jian Xue Bao/Journal of Software, 2014,25(Suppl.(2)):101-110(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/14028.htm
    [24] Wang Z, Tournavitis G, Franke B. Towards a holistic approach to auto-parallelization integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. on Architecture and Code Optimization, 2014,11(1):2.
    [25] Tobias G, Torsten H. Polly-ACC transparent compilation to heterogeneous hardware. In:Proc. of the 2016 Int'l Conf. on Supercomputing. ACM, 2016.
    [26] Huang PF, Zhao RC, Yao Y, Zhao J. Parallel cost model for heterogeneous multi-core processors. Journal of Computer Applications, 2013,33(6):1544-1547(in Chinese with English abstract).
    [27] Chunhua L. A compile-time OpenMP cost model[Ph.D. Thesis]. Houston:University of Houston, 2007.
    [28] Henning JL. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 2006,34(4):1-17.
    [29] Li ZH, Zhang HX. Unified algorithm for three-dimensional complex problems covering various flow regimes based on Boltzmann model equations. Sciece China, 2009,(3):414-427(in Chinese with English abstract).
    [30] Li XL, Fu DX, Ma YW. Development of high accuracy CFD software Hoam-OpenCFD. e-Science Technology & Application, 2010,(1):53-59(in Chinese with English abstract).
    [31] He X, Zhou ZM, Liu X. Design and implemention of multi-level heterogeneous parallel algorithm of 3D acoustic wave equation forwarded. Computer Applications & Software, 2014,(1):264-267(in Chinese with English abstract).
    附中文参考文献:
    [1] 臧大伟,曹政,孙凝晖.高性能计算的发展.科技导报,2016,34(14):22-28.[doi:10.3981/j.issn.1000-7857.2016.14.002]
    [2] 郑方,许勇,李宏亮,等.一种面向高性能计算的自主众核处理器结构.中国科学(信息科学),2015,45(4):523-534.
    [3] 杨广文,赵文来,丁楠,等."神威·太湖之光"及其应用系统.科学(上海),2017,69(3):12-16.
    [6] 巨新刚,杨靓,黄士坦.Cell处理器结构概述.武汉大学学报(工学版),2010,43(6):774-779.
    [20] 赵捷,赵荣彩,韩林,等.面向MPI代码生成的Open64编译器后端.计算机学报,2014,37(7):1620-1632.
    [21] 刘鹏,赵荣彩,庞建民,姚远.基于指向更新的优先权指针分析算法.软件学报,2014,25(11):2486-2498. http://www.jos.org.cn/1000-9825/4596.htm[doi:10.13328/j.cnki.jos.004596]
    [23] 李雁冰,赵荣彩,刘晓娴,赵捷.面向OpenMP自动并行化的代价模型.软件学报,2014,25(Suppl.(2)):101-110. http://www.jos.org.cn/1000-9825/14028.htm
    [26] 黄品丰,赵荣彩,姚远,赵捷.面向异构多核处理器的并行代价模型.计算机应用,2013,33(6):1544-1547.
    [29] 李志辉,张涵信.基于Boltzmann模型方程各流域三维复杂绕流问题统一算法研究.中国科学,2009,(3):414-427.
    [30] 李新亮,傅德薰,马延文,等.高精度计算流体力学软件Hoam-OpenCFD开发.科研信息化技术与应用,2010,(1):53-59.
    [31] 何香,周明忠,刘鑫.三维声波方程正演多级异构并行算法设计与实现.计算机应用与软件,2014,(1):264-267.
    相似文献
引用本文

李雁冰,赵荣彩,韩林,赵捷,徐金龙,李颖颖.一种面向异构众核处理器的并行编译框架.软件学报,2019,30(4):981-1001

复制
分享
文章指标
  • 点击次数:2624
  • 下载次数: 4740
  • HTML阅读次数: 1671
  • 引用次数: 0
历史
  • 收稿日期:2016-12-13
  • 最后修改日期:2017-01-23
  • 在线发布日期: 2019-04-01
文章二维码
您是第19791708位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号