Due to the increasing speed gap between memory system and processor, cache hierarchies have been implemented into memory system, but additional latency (cache penalty) is introduced. This paper presents an algorithm named as prevent cache penalty by loop partition-unrolling (PCPLPU), which can prevent cache penalty in loops by the combination of loop partition and unrolling. Experimental results show that PCPLPU can prevent cache penalty and improve the performance of programs.
[3] Chen F, Sha EHM. Loop scheduling and partitions for hiding memory latencies. In: Proc. of the IEEE 12th Int’l Symp. on System Synthesis. 1999. 64-70. http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/6603/17629/00814262.pdf?tp=&isnumber=&amumber
=814262
[4] Sarkar V. Optimized unrolling of nested loops. In: Proc. of the 14th Int’l Conf. on Supercomputing. New Mexico: ACM Press, 2000. http://portal.acm.org/citation.cfm?id=335246
[5] Li WL, Liu L, Tang ZZ. Loop unrolling optimization for software pipelining. Journal of Beijing University of Aeronautics and Astronautics, 2004,30(11):1111-1115 (in Chinese with English abstract).
[7] Huck J, Morris D, Ross J, Knies A, Mulder H, Zahir R. Introducing the IA-64 architecture. IEEE Micro, 2000,20(5):12-23.
[8] Song YH, Xu R, Wang C, Li ZY. Improving data locality by array contraction. IEEE Trans. on Computers, 2004,53(9):1073-1084.
[9] Liu L, Li WL, Chen Y, Li SM, Tang ZZ. Hiding memory access latency in software pipelining. Journal of Software, 2005,16(10): 1833-1841 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/16/1833.htm
[10] Collard JF, Lavery D. Optimizations to prevent cache penalties for the Intel? Itanium? 2 processor. In: Proc. of the Int’l Symp. on Code Generation and Optimization (CGO 2003). 2003. 105-114. http://portal.acm.org/citation.cfm?id=776273
[11] Liu L, Li WL, Guo ZY, Li SM, Tang ZZ. Optimization to prevent cache penalty in modulo scheduling. Journal of Software, 2005, 16(10):1842-1852 (in Chinese with English abstract). http://jos.org.cn/1000-9825/16/1842.htm
[13] Callahan D, Kennedy K, Porterfield A. Software prefetching. In: Proc. of the 4th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM Press, 1991. http://portal.acm.org/citation.cfm?id=106979&coll= portal&dl=ACM
[14] Allan VH, Jones RB, Lee RM, Allan SJ. Software pipelining. ACM Computing Surveys, 1995,27(3):367-432.
[15] Wang Z, Kirkpatrick M, Sha EHM. Optimal two level partition and loop scheduling for hiding memory latency for DSP applications. In: Proc. of the ACM 37th Design Automation Conf. 2000. 540-545. http://portal.acm.org/citation.cfm?id=337571
[16] Wu CC, Chen CF. A loop partition technique for reducing cache bank conflict in multithreaded architecture. Computers and Digital Techniques (IEE Proc.), 1996,143(1):30-36.
[17] Doshi G, Krishnaiyer R, Muthukumar K. Optimizing software data prefetches with rotating registers. In: Hurson AR, ed. Proc. of the 2001 Int’l Conf. on Parallel Architecture and Compilation Techniques. IEEE Press, 2001. 257-267. http://ieeexplore.ieee.org/ Xplore/login.jsp?url=/iel5/7564/20612/00953306.pdf?arnumber=953306
[18] Roy J, Sun C, Wu CY. Open research compiler for Itanium processor family (IPF). MICRO-34 Tutorial. ACM Press, 2001. http://www.microarch.org/micro34/tutorials/orc/