一种六边形循环分块的Jacobi计算优化方法

doi:10.13328/j.cnki.jos.006945

微信服务号

微信订阅号

2025年8月5日 1:08 星期二

首页 > 过刊浏览>2024年第35卷第8期 >3721-3738. DOI:10.13328/j.cnki.jos.006945

PDF HTML阅读 XML下载导出引用引用提醒

一种六边形循环分块的Jacobi计算优化方法
DOI:
                        10.13328/j.cnki.jos.006945
                    
CSTR:
                        
                    
作者:
                        屈彬屈彬
西安交通大学 计算机科学与技术学院, 陕西 西安 710049
在期刊界中查找
在百度中查找
在本站中查找
刘松刘松
西安交通大学 计算机科学与技术学院, 陕西 西安 710049
在期刊界中查找
在百度中查找
在本站中查找
张增源张增源
西安交通大学 计算机科学与技术学院, 陕西 西安 710049
在期刊界中查找
在百度中查找
在本站中查找
马洁马洁
西安交通大学 计算机科学与技术学院, 陕西 西安 710049
在期刊界中查找
在百度中查找
在本站中查找
伍卫国伍卫国
西安交通大学 计算机科学与技术学院, 陕西 西安 710049
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:屈彬(1996－), 男, 硕士, 主要研究领域为程序性能优化, 并行计算, 编译优化;刘松(1987－), 男, 博士, 副教授, CCF专业会员, 主要研究领域为程序性能优化, 计算机体系结构, 编译优化;张增源(1998－), 男, 硕士, 主要研究领域为程序性能优化, 并行计算;马洁(1997－), 女, 硕士, 主要研究领域为并行计算, 任务调度;伍卫国(1963－), 男, 博士, 教授, 博士生导师, CCF高级会员, 主要研究领域为高性能计算架构, 海量存储系统, 计算机网络, 嵌入式系统.
通讯作者:伍卫国, E-mail: wgwu@xjtu.edu.cn
中图分类号:TP301
基金项目:国家自然科学基金(62002279); 陕西省自然科学基础研究计划一般项目(青年) (2020JQ-077)

Hexagonal Loop Tiling for Jacobi Computation Optimization Method

Author:

QU Bin
QU Bin
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Song
LIU Song
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Zeng-Yuan
ZHANG Zeng-Yuan
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
在期刊界中查找
在百度中查找
在本站中查找
MA Jie
MA Jie
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
在期刊界中查找
在百度中查找
在本站中查找
WU Wei-Guo
WU Wei-Guo
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [35]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

Jacobi计算是一种模板计算, 在科学计算领域具有广泛的应用. 围绕Jacobi计算的性能优化是一个经典的课题, 其中循环分块是一种较有效的优化方法. 现有的循环分块主要关注分块对并行通信和程序局部性的影响, 缺少对负载均衡和向量化等其他因素的考虑. 面向多核计算架构, 分析比较不同分块方法, 并选择一种先进的六边形分块作为加速Jacobi计算的主要方法. 在分块大小选择上, 综合考虑分块对程序向量化效率、局部性和计算核负载均衡等多方面的影响, 提出一种六边形分块大小选择算法Hexagon_TSS. 实验结果表明所提算法相对于原始串行程序计算方法, 最好情况可将L1数据缓存失效率降低至其5.46%, 最大加速比可达24.48, 并且具有良好的可扩展性.

关键词:Jacobi计算;六边形分块方法;分块大小选择;性能优化;多核架构

Abstract:

Jacobi computation is a kind of stencil computation, which has been widely applied in the field of scientific computing. The performance optimization of Jacobi computation is a classic topic, where loop tiling is an effective optimization method. The existing loop tiling methods mainly focus on the impact of tiling on parallel communication and program locality and fail to consider other factors such as load balancing and vectorization. This study analyzes and compares several tiling methods based on multi-core computing architecture and chooses an advanced hexagonal tiling as the main method to accelerate Jacobi computation. For tile size selection, this study proposes a hexagonal tile size selection algorithm called Hexagon_TSS by comprehensively considering the impact of tiling on load balancing, vectorization efficiency, and locality. The experimental results show that the L1 data cache miss rate can be reduced to 5.46% of original serial program computation in the best case by Hexagon_TSS, and the maximum speedup reaches 24.48. The proposed method also has excellent scalability.

Key words:Jacobi computation;hexagonal tiling method;tile size selection;performance optimization;multi-core architecture

参考文献

[1] Stoltzfus L, Hagedorn B, Steuwer M, Gorlatch S, Dubach C. Tiling optimizations for stencil computations using rewrite rules in Lift. ACM Transactions on Architecture and Code Optimization, 2019, 16(4): 52. [doi: 10.1145/3368858]

[2] Levchenko V, Zakirov A, Perepelkina A. GPU implementation of conetorre algorithm for fluid dynamics simulation. In: Proc. of the 15th Int’l Conf. on Parallel Computing Technologies. Almaty: Springer, 2019. 199–213.

[3] Pouchet LN, Bondhugula U, Bastoul C, Cohen A, Ramanujam J, Sadayappan P, Vasilache N. Loop transformations: Convexity, pruning and optimization. In: Proc. of the 38th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages. Austin: ACM, 2011. 549–562.

[4] 刘松, 伍卫国, 赵博, 蒋庆. 面向局部性和并行优化的循环分块技术. 计算机研究与发展, 2015, 52(5): 1160–1176. [doi: 10.7544/issn1000-1239.2015.20131387]

Liu S, Wu WG, Zhao B, Jiang Q. Loop tiling for optimization of locality and parallelism. Journal of Computer Research and Development, 2015, 52(5): 1160–1176 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.2015.20131387]

[5] Pop S, Cohen A, Bastoul C, Girbal S, Silber GA, Vasilache N. GRAPHITE: Polyhedral analyses and optimizations for GCC. In: Proc. of the 4th GCC Developper’s Summit. Ottawa, 2006. 1–18.

[6] Sun XH, Liu YH. Utilizing concurrency: A new theory for memory wall. In: Proc. of the 29th Int’l Workshop on Languages and Compilers for Parallel Computing. Rochester: Springer, 2017. 18–23.

[7] Xue JL. Loop Tiling for Parallelism. 2nd ed., New York: Springer, 2012.

[8] Neumaier A. Introduction to Numerical Analysis. Cambridge: Cambridge University Press, 2001.

[9] 陈国良. 并行计算: 结构·算法·编程. 第3版, 北京: 高等教育出版社, 2011.

Chen GL. Parallel Computing: Architecture, Algorithm and Programming. 3rd ed., Beijing: Higher Education Press, 2011 (in Chinese).

[10] 赵捷, 李颖颖, 赵荣彩. 基于多面体模型的编译“黑魔法”. 软件学报, 2018, 29(8): 2371-2396. http://www.jos.org.cn/1000-9825/5563.htm

Zhao J, Li YY, Zhao RC. “Black magic” of polyhedral compilation. Ruan Jian Xue Bao/Journal of Software, 2018, 29(8): 2371-2396 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5563.htm

[11] Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In: Proc. of the 29th ACM SIGPLAN Conf. on Programming Language Design and Implementation. Tucson: ACM, 2008. 101–113.

[12] Grosser T, Cohen A, Kelly PHJ, Ramanujam J, Sadayappan P, Verdoolaege S. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In: Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. Houston: ACM, 2013. 24–31.

[13] Grosser T, Cohen A, Holewinski J, Sadayappan P, Verdoolaege S. Hybrid hexagonal/classical tiling for GPUs. In: Proc. of the 2014 Annual IEEE/ACM Int’l Symp. on Code Generation and Optimization. Orlando: ACM, 2014. 66–75.

[14] Grosser T, Verdoolaege S, Cohen A, Sadayappan P. The relation between diamond tiling and hexagonal tiling. Parallel Processing Letters, 2014, 24(3): 1441002. [doi: 10.1142/S0129626414410023]

[15] Bondhugula U, Bandishti V, Pananilath I. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(5): 1285–1298. [doi: 10.1109/TPDS.2016.2615094]

[16] Wolf ME, Lam MS. A data locality optimizing algorithm. In: Proc. of the 1991 ACM SIGPLAN Conf. on Programming Language Design and Implementation. Toronto: ACM, 1991. 30–44.

[17] 刘松, 赵博, 蒋庆, 伍卫国. 一种面向循环优化和非规则代码段的粗粒度半自动并行化方法. 计算机学报, 2017, 40(9): 2127–2147. [doi: 10.11897/SP.J.1016.2017.02127]

Liu S, Zhao B, Jiang Q, Wu WG. A semi-automatic coarse-grained parallelization approach for loop optimization and irregular code sections. Chinese Journal of Computers, 2017, 40(9): 2127–2147 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2017.02127]

[18] Whaley RC, Petitet A, Dongarra JJ. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 2001, 27(1–2): 3–35.

[19] Wang Z, O’Boyle M. Machine learning in compiler optimization. Proceedings of the IEEE, 2018, 106(11): 1879–1901. [doi: 10.1109/JPROC.2018.2817118]

[20] 刘慧, 徐金龙, 赵荣彩, 姚金阳. 学习模型指导的编译器优化顺序选择方法. 计算机研究与发展, 2019, 56(9): 2012–2026. [doi: 10.7544/issn1000-1239.2019.20180789]

Liu H, Xu JL, Zhao RC, Yao JY. Compiler optimization sequence selection method based on learning model. Journal of Computer Research and Development, 2019, 56(9): 2012–2026 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.2019.20180789]

[21] Xiang XY, Ding C, Luo H, Bao B. HOTL: A higher order theory of locality. In: Proc. of the 18th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Houston: ACM, 2013. 343–356.

[22] Sabarimuthu JM, Venkatesh TG. Analytical miss rate calculation of L2 cache from the RD profile of L1 cache. IEEE Transactions on Computers, 2018, 67(1): 9–15. [doi: 10.1109/TC.2017.2723878]

[23] 狄鹏, 胡长军, 李建江. GPU上高效Jacobi迭代算法的研究与实现. 小型微型计算机系统, 2012, 33(9): 1962–1967. [doi: 10.3969/j.issn.1000-1220.2012.09.018]

Di P, Hu CJ, Li JJ. Research and implementation of effective Jacobi iteration algorithms on GPU. Journal of Chinese Computer Systems, 2012, 33(9): 1962–1967 (in Chinese with English abstract). [doi: 10.3969/j.issn.1000-1220.2012.09.018]

[24] Liu S, Cui YZ, Zou NJ, Zhu WH, Zhang D, Wu WG. Revisiting the parallel strategy for DOACROSS loops. Journal of Computer Science and Technology, 2019, 34(2): 456–475. [doi: 10.1007/s11390-019-1919-7]

[25] Li YZ, Schwiebert L. Memory-optimized wavefront parallelism on GPUs. International Journal of Parallel Programming, 2020, 48(6): 1008–1031. [doi: 10.1007/s10766-020-00658-y]

[26] Assis ÍAS, Fernandes JB, Barros T, Xavier-De-Souza S. Auto-tuning of dynamic scheduling applied to 3D reverse time migration on multicore systems. IEEE Access, 2020, 8: 145115–145127. [doi: 10.1109/ACCESS.2020.3015045]

[27] Huang PH, Chen YS, Liao JH. QT-adaptation engine: Adaptive QoS-aware scheduling and governing in thermally constrained mobile devices. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(3): 585–598. [doi: 10.1109/TCAD.2019.2897697]

[28] Guo Y, Zhao JS, Cave V, Sarkar V. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. Bangalore: ACM, 2010. 341–342.

[29] Karimov J, Rabl T, Markl V. PolyBench: The first benchmark for polystores. In: Proc. of the 10th TPC Technology Conf. on Performance Evaluation and Benchmarking. Rio de Janeiro: Springer, 2019. 24–41.

引用本文

屈彬,刘松,张增源,马洁,伍卫国.一种六边形循环分块的Jacobi计算优化方法.软件学报,2024,35(8):3721-3738

复制

文章指标

点击次数:699
下载次数: 1834
HTML阅读次数: 962
引用次数: 0

历史

收稿日期:2021-07-18
最后修改日期:2022-02-23
录用日期:
在线发布日期: 2023-09-13
出版日期: 2024-08-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码