面向Stencil计算的自动混合精度优化
作者:
作者简介:

宋广辉(1997-),男,硕士生,主要研究领域为高性能计算,先进编译技术.;郭绍忠(1964-),女,教授,CCF高级会员,主要研究领域为高性能计算,分布式处理.;赵捷(1987-),男,讲师,CCF专业会员,主要研究领域为先进编译技术.;陶小涵(1996-),男,博士生,主要研究领域为先进编译技术.;李飞(1996-),男,硕士生,主要研究领域为高性能计算.;许瑾晨(1987-),男,讲师,主要研究领域为高性能计算.

通讯作者:

许瑾晨,E-mail:atao728208@126.com

中图分类号:

TP18

基金项目:

国家自然科学基金(U20A20226)


Automatic Mixed Precision Optimization for Stencil Computation
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [33]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    混合精度在深度学习和精度调整与优化方面取得了许多进展, 广泛研究表明, 面向Stencil计算的混合精度优化也是一个很有挑战性的方向. 同时, 多面体模型在自动并行化领域取得的一系列研究成果表明, 该模型为循环嵌套提供很好的数学抽象, 可以在其基础上进行一系列的循环变换. 基于多面体编译技术设计并实现了一个面向Stencil计算的自动混合精度优化器, 通过在中间表示层进行迭代空间划分、数据流分析和调度树转换, 首次实现了源到源的面向Stencil计算的混合精度优化代码自动生成. 实验表明, 经过自动混合精度优化之后的代码, 在减少精度冗余的基础上能够充分发挥其并行潜力, 提升程序性能. 以高精度计算为基准, 在x86平台上最大加速比是1.76, 几何平均加速比是1.15; 在新一代国产申威平台上最大加速比是1.64, 几何平均加速比是1.20.

    Abstract:

    Mixed precision has made many advances in deep learning and precision tuning and optimization. Extensive research shows that mixed precision optimization for stencil computation is challenging. Moreover, the research achievements secured by the polyhedral model in the field of automatic parallelization indicate that the model provides a good mathematical abstraction for loop nesting, on the basis of which loop transformations can be performed. This study designs and implements an automatic mixed precision optimizer for Stencil computation on the basis of polyhedral compilation technology. By performing iterative domain partitioning, data flow analysis, and scheduling tree transformation on the intermediate representation layers, this study implements the source-to-source automatic generation of mixed precision codes for Stencil computation for the first time. The experiments demonstrate that the code after automatic mixed precision optimization can give full play to its parallelism potential and improve the performance of the program by reducing precision redundancy. With high-precision computing as the benchmark, the maximum speedup is 1.76, and the geometric average speedup is 1.15 on the x86 architecture; on the new-generation Sunway architecture, the maximum speedup is 1.64, and the geometric average speedup is 1.20.

    参考文献
    [1] Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz JD, Lee EA, Morgan N, Necula G, Patterson DA, Sen K, Wawrzynek J, Wessel D, Yelick KA. The parallel computing laboratory at U. C. Berkeley: A research agenda based on the Berkeley view. Technical Report, Berkeley: University of California at Berkeley, 2008.
    [2] Taflove A, Hagness SC, Piket-May M. Computational electromagnetics: The finite-difference time-domain method. In: Chen WK, ed. The Electrical Engineering Handbook. Boston: Elsevier, 2005. 629–670.
    [3] 许瑾晨, 黄永忠, 郭绍忠, 周蓓, 赵捷. 一个浮点数学函数库测试平台. 软件学报, 2015, 26(6): 1306–1321. http://www.jos.org.cn/1000-9825/4589.htm
    Xu JC, Huang YZ, Guo SZ, Zhou B, Zhao J. Testing platform for floating mathematical function libraries. Ruan Jian Xue Bao/Journal of Software, 2015, 26(6): 1306–1321 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4589.htm
    [4] DeVries PMR, Viégas F, Wattenberg M, Meade BJ. Deep learning of aftershock patterns following large earthquakes. Nature, 2018, 560(7720): 632–634. [doi: 10.1038/s41586-018-0438-y]
    [5] Micikevicius P, Narang S, Alben J, Diamos GF, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
    [6] Cherubin S, Agosta G. Tools for reduced precision computation: A survey. ACM Computing Surveys, 2021, 53(2): 33. [doi: 10.1145/3381039]
    [7] Damouche N, Martel M. Salsa: An automatic tool to improve the numerical accuracy of programs. Automated Formal Methods, 2018, 5: 63–76.
    [8] Cherubin S, Cattaneo D, Chiari M, Di Bello A, Agosta G. TAFFO: Tuning assistant for floating to fixed point optimization. IEEE Embedded Systems Letters, 2020, 12(1): 5–8. [doi: 10.1109/LES.2019.2913774]
    [9] Darulova E, Kuncak V, Majumdar R, Saha I. Synthesis of fixed-point programs. In: Proc. of the 2013 Int’l Conf. on Embedded Software (EMSOFT). Montreal: IEEE, 2013. 1–10.
    [10] Kotipalli PV, Singh R, Wood P, Laguna I, Bagchi S. AMPT-GA: Automatic mixed precision floating point tuning for GPU applications. In: Proc. of the 2019 ACM Int’l Conf. on Supercomputing. Phoenix: ACM, 2019. 160–170.
    [11] Chiang WF, Baranowski M, Briggs I, Solovyev A, Gopalakrishnan G, Rakamarić Z. Rigorous floating-point mixed-precision tuning. In: Proc. of the 44th ACM SIGPLAN Symp. on Principles of Programming Languages. Paris: ACM, 2017. 300–315.
    [12] Kum KI, Kang JY, Sung W. AUTOSCALER for C: An optimizing floating-point to integer C program converter for fixed-point digital signal processors. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2000, 47(9): 840–848. [doi: 10.1109/82.868453]
    [13] Menon H, Lam MO, Osei-Kuffuor D, Schordan M, Lloyd S, Mohror K, Hittinger J. ADAPT: Algorithmic differentiation applied to floating-point precision tuning. In: Proc. of the 2018 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Dallas: IEEE, 2018. 614–626.
    [14] Nathan R, Naeimi H, Sorin DJ, Sun XB. Profile-driven automated mixed precision. arXiv:1606.00251, 2016.
    [15] Rubio-González C, Nguyen C, Nguyen HD, Demmel J, Kahan W, Sen K, Bailey DH, Iancu C, Hough D. Precimonious: Tuning assistant for floating-point precision. In: Proc. of the 2013 Int’l Conf. on High Performance Computing, Networking, Storage and Analysis. Denver: ACM, 2013. 27.
    [16] Rubio-González C, Nguyen C, Mehne B, Sen K, Demmel J, Kahan W, Iancu C, Lavrijsen W, Bailey DH, Hough D. Floating-point precision tuning using blame analysis. In: Proc. of the 38th Int’l Conf. on Software Engineering. Austin: ACM, 2016. 1074–1085.
    [17] Guo H, Rubio-González C. Exploiting community structure for floating-point precision tuning. In: Proc. of the 27th ACM SIGSOFT Int’l Symp. on Software Testing and Analysis. Amsterdam: ACM, 2018. 333–343.
    [18] Feautrier P, Lengauer C. Polyhedron model. In: Padua D, ed. Encyclopedia of Parallel Computing. Boston: Springer, 2011. 1581–1592.
    [19] Grosser T, Verdoolaege S, Cohen A. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems, 2015, 37(4): 12. [doi: 10.1145/2743016]
    [20] 赵捷, 李颖颖, 赵荣彩. 基于多面体模型的编译“黑魔法”. 软件学报, 2018, 29(8): 2371–2396. http://www.jos.org.cn/1000-9825/5563.htm
    Zhao J, Li YY, Zhao RC. “Black magic” of polyhedral compilation. Ruan Jian Xue Bao/Journal of Software, 2018, 29(8): 2371–2396 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5563.htm
    [21] Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In: Proc. of the 29th ACM SIGPLAN Conf. on Programming Language Design and Implementation. Tucson: ACM Press, 2008. 101–113.
    [22] Verdoolaege S, Carlos Juega J, Cohen A, Gómez JI, Tenllado C, Catthoor F. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization, 2013, 9(4): 54. [doi: 10.1145/2400682.2400713]
    [23] Trifunovic K, Cohen A, Edelsohn D, Li F, Grosser T, Jagasia H, Ladelsky R, Pop S, Sjödin J, Upadrasta R. GRAPHITE two years after: First lessons learned from real-world polyhedral compilation. 2010. https://hal.inria.fr/inria-00551516/
    [24] Grosser T, Groesslinger A, Lengauer C. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 2012, 22(4): 1250010. [doi: 10.1142/S0129626412500107]
    [25] Kelly W, Pugh W. A unifying framework for iteration reordering transformations. In: Proc. of the 1st Int’l Conf. on Algorithms and Architectures for Parallel Processing. Brisbane: IEEE, 1995. 153–162.
    [26] Girbal S, Vasilache N, Bastoul C, Cohen A, Parello D, Sigler M, Temam O. Semi-Automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 2006, 34(3): 261–317. [doi: 10.1007/s10766-006-0012-3]
    [27] Verdoolaege S. Counting affine calculator and applications. 2011. http://perso.ens-lyon.fr/christophe.alias/impact2011/impact-05-slides.pdf
    [28] Verdoolaege S, Grosser T. Polyhedral extraction tool. In: Proc. of the 2nd Int’l Workshop on Polyhedral Compilation Techniques. Paris: IMPACT, 2012. 1–8.
    [29] Verdoolaege S. isl: An integer set library for the polyhedral model. In: Proc. of the 3rd Int’l Congress on Mathematical Software. Kobe: Springer, 2010. 299–302.
    [30] Terpstra D, Jagode H, You HH, Dongarra J. Collecting performance data with PAPI-C. In: Müller MS, Resch MM, Schulz A, Nagel WE, eds. Tools for High Performance Computing 2009. Berlin: Springer, 2010. 157–173.
    [31] PolyBench/C 4.2. 2018. https://sourceforge.net/projects/polybench/
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

宋广辉,郭绍忠,赵捷,陶小涵,李飞,许瑾晨.面向Stencil计算的自动混合精度优化.软件学报,2023,34(12):5704-5723

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-04-01
  • 最后修改日期:2022-06-11
  • 在线发布日期: 2023-02-22
  • 出版日期: 2023-12-06
文章二维码
您是第20054042位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号