Automatic Generation of Pipeline Parallel Code for Regular DOACROSS Loops

doi:10.13328/j.cnki.jos.004425

微信服务号

微信订阅号

2025-6-5- 8

Home > Archive>Volume 25, Issue 6, 2014 >1154-1168. DOI:10.13328/j.cnki.jos.004425

PDF HTML XML Export Cite reminder

Automatic Generation of Pipeline Parallel Code for Regular DOACROSS Loops
DOI:
                        10.13328/j.cnki.jos.004425
                    
Author:
                        LIU Xiao-XianLIU Xiao-Xian
PLA Information Engineering University, Zhengzhou 450002, China;State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHAO Rong-CaiZHAO Rong-Cai
PLA Information Engineering University, Zhengzhou 450002, China;State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHAO JieZHAO Jie
PLA Information Engineering University, Zhengzhou 450002, China;State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
XU Jin-LongXU Jin-Long
PLA Information Engineering University, Zhengzhou 450002, China;State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [27]

Related [20]

Cited by

Materials

Comments

Abstract:

To obtain as much performance improvement as possible for sequential applications, it is important to exploit parallelism lurking in DOACROSS loops and find good schemes for their parallel execution. Pipelining is such a parallelizing method which can work well for regular DOACROSS loops. However, it is so hard to maintain high performance pipeline parallel codes automatically that parallel compilers always treat DOACROSS loops conservatively. Compilers usually serialize DOACROSS loops, which loses the inherent parallelism of DOACROSS loops and affects the performance of generated parallel programs. To solve this problem, automatic generation of pipeline parallel code for regular DOACROSS loops is implemented for multicore platform based on OpenMP. Firstly, a heuristic is proposed to choose the partition loop and the tiling loop of regular DOACROSS loops. Secondly, a formula based on pipelining cost model is given to compute the optimal tiling size. Lastly, the synchronization between threads is implemented with counter semaphores. Measuring against the wavefront loops of finite difference relaxation, the representative loops of finite difference time domain, and Poisson, LU and Jacobi procedures, the pipeline parallel loops automatically generated by the proposed method increase execution efficiency on multicore platform with the average speedup up to 89% of the optimal speedup obtained manually.

Key words:pipeline parallel;automatic parallelization;DOACROSS loops;cost model

Reference

[1] Benoit A, Melhem R, Renaud-Goud P, Robert Y. Power-Aware manhattan routing on chip multiprocessors. In: Proc. of the 26thInt'l Parallel and Distributed Processing Symp. (IPDPS 2012). Washington: IEEE Computer Society, 2012. 189-200. [doi: 10.1109/IPDPS.2012.27]

[2] Jin H, Jespersen D, Mehrotra P, Biswas R, Huang L, Chapman B. High performance computing using MPI and OpenMP on multicoreparallel systems. Parallel Computing, 2011,37(9):562-575. [doi: 10.1016/j.parco.2011.02.002]

[3] Cytron R. Doacross: Beyond vectorization for multiprocessors. In: Proc. of the Int'l Conf. on Parallel Processing. 1986. 836-844.

[4] Chen DK, Yew PC. An empirical study on DOACROSS loops. In: Proc. of the Supercomputing. New York: ACM Press, 1991.620-632.

[5] Hurson AR, Lim JT, Kavi KM, Lee B. Parallelization of DOALL and DOACROSS loops—A survey. Advances in Computers, 1997,45:53-103. [doi: 10.1016/S0065-2458(08)60706-8]

[6] Ma L. Tuning pipeline granularity based on feedback directed framework [MS. Thesis]. Beijing: Institute of ComputingTechnology of Chinese Academy of Sciences, 2005 (in Chinese with English abstract).

[7] Lin YT, Wang SC, Shih WL, Hsieh BKY. Enable OpenCL compiler with Open64 infrastructures. In: Proc. of the 13th IEEE Int'lConf. on High Performance Computing and Communications (HPCC 2011). Washington: IEEE Computer Society, 2011. 863-868.[doi: 10.1109/HPCC.2011.123]

[8] Midkiff SP, Padua DA. Compiler algorithms for synchronization. IEEE Trans. on Computers, 1987,C-36(12):1485-1495. [doi: 10.1109/TC.1987.5009499]

[9] Wolfe M. Multiprocessor synchronization for concurrent loops. IEEE Software, 1988,5(1):34-42. [doi: 10.1109/52.1992]

[10] Su HM, Yew PC. On data synchronization for multiprocessors. In: Proc. of the 16th Annual Int'l Symp. on Computer Architecture.New York: ACM Press, 1989. 416-423. [doi: 10.1145/74925.74972]

[11] Li ZY. Compiler algorithms for event variable synchronization. In: Proc. of the 5th Int'l Conf. on Supercomputing. New York:ACM Press, 1991. 85-95. [doi: 10.1145/109025.109051]

[12] Tang P, Yew PC, Zhu CQ. Compiler techniques for data synchronization in nested parallel loop. In: Proc. of the 4th Int'l Conf. onSupercomputing. New York: ACM Press, 1990. 177-186. [doi: 10.1145/255129.255155]

[13] Krothapalli VP, Sadayappan P. Removal of redundant dependences in doacross loops with constant dependences. IEEE Trans. onParallel and Distributed Systems, 1991,2(3):281-289. [doi: 10.1109/71.86104]

[14] Rajamony R, Cox AL. Optimally synchronizing doacross loops on shared memory multiprocessors. In: Proc. of the Int'l Conf. onParallel Architectures and Compilation Techniques. Washington: IEEE Computer Society, 1997. 214-224. [doi: 10.1109/PACT.1997.644017]

[15] Chen DK, Yew PC. Statement re-ordering for DOACROSS loops. In: Proc. of the Int'l Conf. on Parallel Processing. Washington:IEEE Computer Society, 1994. 24-28. [doi: 10.1109/ICPP.1994.186]

[16] Chen DK, Yew PC. On effective execution of non-uniform DOACROSS loops. IEEE Trans. on Parallel and Distributed Systems,1996,7(5):463-476. [doi: 10.1109/71.503771]

[17] Chen DK, Yew PC. Redundant synchronization elimination for DOACROSS loops. In: Proc. of the 8th Int'l Parallel ProcessingSymp. Washington: IEEE Computer Society, 1994. 477-481. [doi: 10.1109/IPPS.1994.288260]

[18] Pan ZL, Armstrong B, Bae H, Eigenmann R. On the interaction of tiling and automatic parallelization. In: Proc. of the OpenMPShared Memory Parallel Programming. Berlin, Heidelberg: Springer-Verlag, 2008. 24-35. [doi: 10.1007/978-3-540-68555-5_3]

[19] Lowenthal DK. Accurately selecting block size at runtime in pipelined parallel programs. Int'l Journal of Parallel Programming,2000,28(3):245-274. [doi: 10.1023/A:1007577115980]

[20] Unnikrishnan P, Shirako J, Barton K, Chatterjee S, Silvera R, Sarkar V. A practical approach to DOACROSS parallelization. In:Proc. of the Euro-Par 2012. LNCS, Berlin, Heidelberg: Springer-Verlag, 2012. 219-231. [doi: 10.1007/978-3-642-32820-6_23]

[21] Thoman P, Jordan H, Pellegrini S, Fahringer T. Automatic OpenMP loop scheduling: A combined compiler and runtime approach.In: Proc. of the 8th Int'l Workshop on OpenMP (IWOMP 2012). Berlin, Heidelberg: Springer-Verlag, 2012. 88-101. [doi: 10.1007/978-3- 642-30961-8_7]

[22] Ramshankar R. X86 Open64 compiler suite. 2009. http://developer.amd.com/tools/cpu/open64/Documents/open64_ compiler_developer_guide.html

[23] Allen R, Kennedy K. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan KaufmannPublishers, 2001. 63-68.

[24] Taflove A, Hagness SC. Computational Electrodynamics. 3rd ed., Artech House Publishers, 1995. 1-9.

[25] Chen DK, Oesterreich DA, Torrellas J, Yew PC. An efficient algorithm for the run-time parallelization of doacross loops. In: Proc.of the Supercomputing. Washington: IEEE Computer Society, 1994. 518-527. [doi: 10.1109/SUPERC.1994.344315]

[26] Jeyaraman T, Krothapalli VP, Giesbrecht M. Run-Time parallelization of irregular doacross loops. Lecture Notes in ComputerScience, 1995,980:75-80. [doi: 10.1007/3-540-60321-2_5]

[27] Xu CZ, Chaudhary V. Time stamp algorithms for runtime parallelization of DOACROSS loops with dynamic dependences. IEEETrans. on Parallel and Distributed Systems, 2001,12(5):433-450. [doi: 10.1109/71.926166]

Get Citation

刘晓娴,赵荣彩,赵捷,徐金龙.面向规则DOACROSS循环的流水并行代码自动生成.软件学报,2014,25(6):1154-1168

Copy

Article Metrics

Abstract:3231
PDF: 5127
HTML: 1394
Cited by: 0

History

Received:December 10,2012
Revised:March 07,2013
Adopted:
Online: May 30,2014
Published:

You are the first2051302Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History