Automatic Generation of Pipeline Parallel Code for Regular DOACROSS Loops
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [27]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    To obtain as much performance improvement as possible for sequential applications, it is important to exploit parallelism lurking in DOACROSS loops and find good schemes for their parallel execution. Pipelining is such a parallelizing method which can work well for regular DOACROSS loops. However, it is so hard to maintain high performance pipeline parallel codes automatically that parallel compilers always treat DOACROSS loops conservatively. Compilers usually serialize DOACROSS loops, which loses the inherent parallelism of DOACROSS loops and affects the performance of generated parallel programs. To solve this problem, automatic generation of pipeline parallel code for regular DOACROSS loops is implemented for multicore platform based on OpenMP. Firstly, a heuristic is proposed to choose the partition loop and the tiling loop of regular DOACROSS loops. Secondly, a formula based on pipelining cost model is given to compute the optimal tiling size. Lastly, the synchronization between threads is implemented with counter semaphores. Measuring against the wavefront loops of finite difference relaxation, the representative loops of finite difference time domain, and Poisson, LU and Jacobi procedures, the pipeline parallel loops automatically generated by the proposed method increase execution efficiency on multicore platform with the average speedup up to 89% of the optimal speedup obtained manually.

    Reference
    [1] Benoit A, Melhem R, Renaud-Goud P, Robert Y. Power-Aware manhattan routing on chip multiprocessors. In: Proc. of the 26thInt'l Parallel and Distributed Processing Symp. (IPDPS 2012). Washington: IEEE Computer Society, 2012. 189-200. [doi: 10.1109/IPDPS.2012.27]
    [2] Jin H, Jespersen D, Mehrotra P, Biswas R, Huang L, Chapman B. High performance computing using MPI and OpenMP on multicoreparallel systems. Parallel Computing, 2011,37(9):562-575. [doi: 10.1016/j.parco.2011.02.002]
    [3] Cytron R. Doacross: Beyond vectorization for multiprocessors. In: Proc. of the Int'l Conf. on Parallel Processing. 1986. 836-844.
    [4] Chen DK, Yew PC. An empirical study on DOACROSS loops. In: Proc. of the Supercomputing. New York: ACM Press, 1991.620-632.
    [5] Hurson AR, Lim JT, Kavi KM, Lee B. Parallelization of DOALL and DOACROSS loops—A survey. Advances in Computers, 1997,45:53-103. [doi: 10.1016/S0065-2458(08)60706-8]
    [6] Ma L. Tuning pipeline granularity based on feedback directed framework [MS. Thesis]. Beijing: Institute of ComputingTechnology of Chinese Academy of Sciences, 2005 (in Chinese with English abstract).
    [7] Lin YT, Wang SC, Shih WL, Hsieh BKY. Enable OpenCL compiler with Open64 infrastructures. In: Proc. of the 13th IEEE Int'lConf. on High Performance Computing and Communications (HPCC 2011). Washington: IEEE Computer Society, 2011. 863-868.[doi: 10.1109/HPCC.2011.123]
    [8] Midkiff SP, Padua DA. Compiler algorithms for synchronization. IEEE Trans. on Computers, 1987,C-36(12):1485-1495. [doi: 10.1109/TC.1987.5009499]
    [9] Wolfe M. Multiprocessor synchronization for concurrent loops. IEEE Software, 1988,5(1):34-42. [doi: 10.1109/52.1992]
    [10] Su HM, Yew PC. On data synchronization for multiprocessors. In: Proc. of the 16th Annual Int'l Symp. on Computer Architecture.New York: ACM Press, 1989. 416-423. [doi: 10.1145/74925.74972]
    [11] Li ZY. Compiler algorithms for event variable synchronization. In: Proc. of the 5th Int'l Conf. on Supercomputing. New York:ACM Press, 1991. 85-95. [doi: 10.1145/109025.109051]
    [12] Tang P, Yew PC, Zhu CQ. Compiler techniques for data synchronization in nested parallel loop. In: Proc. of the 4th Int'l Conf. onSupercomputing. New York: ACM Press, 1990. 177-186. [doi: 10.1145/255129.255155]
    [13] Krothapalli VP, Sadayappan P. Removal of redundant dependences in doacross loops with constant dependences. IEEE Trans. onParallel and Distributed Systems, 1991,2(3):281-289. [doi: 10.1109/71.86104]
    [14] Rajamony R, Cox AL. Optimally synchronizing doacross loops on shared memory multiprocessors. In: Proc. of the Int'l Conf. onParallel Architectures and Compilation Techniques. Washington: IEEE Computer Society, 1997. 214-224. [doi: 10.1109/PACT.1997.644017]
    [15] Chen DK, Yew PC. Statement re-ordering for DOACROSS loops. In: Proc. of the Int'l Conf. on Parallel Processing. Washington:IEEE Computer Society, 1994. 24-28. [doi: 10.1109/ICPP.1994.186]
    [16] Chen DK, Yew PC. On effective execution of non-uniform DOACROSS loops. IEEE Trans. on Parallel and Distributed Systems,1996,7(5):463-476. [doi: 10.1109/71.503771]
    [17] Chen DK, Yew PC. Redundant synchronization elimination for DOACROSS loops. In: Proc. of the 8th Int'l Parallel ProcessingSymp. Washington: IEEE Computer Society, 1994. 477-481. [doi: 10.1109/IPPS.1994.288260]
    [18] Pan ZL, Armstrong B, Bae H, Eigenmann R. On the interaction of tiling and automatic parallelization. In: Proc. of the OpenMPShared Memory Parallel Programming. Berlin, Heidelberg: Springer-Verlag, 2008. 24-35. [doi: 10.1007/978-3-540-68555-5_3]
    [19] Lowenthal DK. Accurately selecting block size at runtime in pipelined parallel programs. Int'l Journal of Parallel Programming,2000,28(3):245-274. [doi: 10.1023/A:1007577115980]
    [20] Unnikrishnan P, Shirako J, Barton K, Chatterjee S, Silvera R, Sarkar V. A practical approach to DOACROSS parallelization. In:Proc. of the Euro-Par 2012. LNCS, Berlin, Heidelberg: Springer-Verlag, 2012. 219-231. [doi: 10.1007/978-3-642-32820-6_23]
    [21] Thoman P, Jordan H, Pellegrini S, Fahringer T. Automatic OpenMP loop scheduling: A combined compiler and runtime approach.In: Proc. of the 8th Int'l Workshop on OpenMP (IWOMP 2012). Berlin, Heidelberg: Springer-Verlag, 2012. 88-101. [doi: 10.1007/978-3- 642-30961-8_7]
    [22] Ramshankar R. X86 Open64 compiler suite. 2009. http://developer.amd.com/tools/cpu/open64/Documents/open64_ compiler_developer_guide.html
    [23] Allen R, Kennedy K. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan KaufmannPublishers, 2001. 63-68.
    [24] Taflove A, Hagness SC. Computational Electrodynamics. 3rd ed., Artech House Publishers, 1995. 1-9.
    [25] Chen DK, Oesterreich DA, Torrellas J, Yew PC. An efficient algorithm for the run-time parallelization of doacross loops. In: Proc.of the Supercomputing. Washington: IEEE Computer Society, 1994. 518-527. [doi: 10.1109/SUPERC.1994.344315]
    [26] Jeyaraman T, Krothapalli VP, Giesbrecht M. Run-Time parallelization of irregular doacross loops. Lecture Notes in ComputerScience, 1995,980:75-80. [doi: 10.1007/3-540-60321-2_5]
    [27] Xu CZ, Chaudhary V. Time stamp algorithms for runtime parallelization of DOACROSS loops with dynamic dependences. IEEETrans. on Parallel and Distributed Systems, 2001,12(5):433-450. [doi: 10.1109/71.926166]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘晓娴,赵荣彩,赵捷,徐金龙.面向规则DOACROSS循环的流水并行代码自动生成.软件学报,2014,25(6):1154-1168

Copy
Share
Article Metrics
  • Abstract:3231
  • PDF: 5127
  • HTML: 1394
  • Cited by: 0
History
  • Received:December 10,2012
  • Revised:March 07,2013
  • Online: May 30,2014
You are the first2051302Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063