利用循环分割和循环展开避免Cache代价

微信服务号

微信订阅号

2025年4月25日 0:52 星期五

首页 > 过刊浏览>2008年第19卷第9期 >2228-2242

利用循环分割和循环展开避免Cache代价
DOI:
                        
                    
CSTR:
                        
                    
作者:
                        刘 利刘 利
清华大学 计算机科学与技术系,北京 100084
在期刊界中查找
在百度中查找
在本站中查找
陈 彧陈 彧
清华大学 计算机科学与技术系,北京 100084
在期刊界中查找
在百度中查找
在本站中查找
乔 林乔 林
清华大学 计算机科学与技术系,北京 100084
在期刊界中查找
在百度中查找
在本站中查找
汤志忠汤志忠
清华大学 计算机科学与技术系,北京 100084
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60573100 (国家自然科学基金)

Optimization to Prevent Cache Penalty by Loop Partition and Loop Unrolling

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [18]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

存储系统与处理器之间的速度差距逐渐变大,为此,cache使用了分级机制,但这也带来了额外的存储延迟(cache代价).提出一种利用循环分割和循环展开相结合避免cache代价的PCPLPU(prevent cache penalty by loop partition-unrolling)算法.实验结果表明,PCPLPU算法能够有效避免循环代价,提高程序性能.

关键词:循环分割;循环展开;cache代价;bank冲突

Abstract:

Due to the increasing speed gap between memory system and processor, cache hierarchies have been implemented into memory system, but additional latency (cache penalty) is introduced. This paper presents an algorithm named as prevent cache penalty by loop partition-unrolling (PCPLPU), which can prevent cache penalty in loops by the combination of loop partition and unrolling. Experimental results show that PCPLPU can prevent cache penalty and improve the performance of programs.

Key words:loop partition; loop unrolling; cache penalty; bank conflict

参考文献

[1] Intel Corp. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Intel Corp Press, 2002.

[2] Intel Corp. Intel Pentium 4 and Intel Xeon Processor Optimization. Reference Manual. Intel Corp Press, 2002.

[3] Chen F, Sha EHM. Loop scheduling and partitions for hiding memory latencies. In: Proc. of the IEEE 12th Int’l Symp. on System Synthesis. 1999. 64-70. http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/6603/17629/00814262.pdf?tp=&isnumber=&amumber =814262

[4] Sarkar V. Optimized unrolling of nested loops. In: Proc. of the 14th Int’l Conf. on Supercomputing. New Mexico: ACM Press, 2000. http://portal.acm.org/citation.cfm?id=335246

[5] Li WL, Liu L, Tang ZZ. Loop unrolling optimization for software pipelining. Journal of Beijing University of Aeronautics and Astronautics, 2004,30(11):1111-1115 (in Chinese with English abstract).

[6] Intel itanium architecture software developer’s manual. Revision 2.0. 2001.

[7] Huck J, Morris D, Ross J, Knies A, Mulder H, Zahir R. Introducing the IA-64 architecture. IEEE Micro, 2000,20(5):12-23.

[8] Song YH, Xu R, Wang C, Li ZY. Improving data locality by array contraction. IEEE Trans. on Computers, 2004,53(9):1073-1084.

[9] Liu L, Li WL, Chen Y, Li SM, Tang ZZ. Hiding memory access latency in software pipelining. Journal of Software, 2005,16(10): 1833-1841 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/16/1833.htm

[10] Collard JF, Lavery D. Optimizations to prevent cache penalties for the Intel? Itanium? 2 processor. In: Proc. of the Int’l Symp. on Code Generation and Optimization (CGO 2003). 2003. 105-114. http://portal.acm.org/citation.cfm?id=776273

[11] Liu L, Li WL, Guo ZY, Li SM, Tang ZZ. Optimization to prevent cache penalty in modulo scheduling. Journal of Software, 2005, 16(10):1842-1852 (in Chinese with English abstract). http://jos.org.cn/1000-9825/16/1842.htm

[12] Rau BR. Iterative modulo scheduling. HPL-94-115. Hewlett-Packard Laboratories, 1994.

[13] Callahan D, Kennedy K, Porterfield A. Software prefetching. In: Proc. of the 4th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM Press, 1991. http://portal.acm.org/citation.cfm?id=106979&coll= portal&dl=ACM

[14] Allan VH, Jones RB, Lee RM, Allan SJ. Software pipelining. ACM Computing Surveys, 1995,27(3):367-432.

[15] Wang Z, Kirkpatrick M, Sha EHM. Optimal two level partition and loop scheduling for hiding memory latency for DSP applications. In: Proc. of the ACM 37th Design Automation Conf. 2000. 540-545. http://portal.acm.org/citation.cfm?id=337571

[16] Wu CC, Chen CF. A loop partition technique for reducing cache bank conflict in multithreaded architecture. Computers and Digital Techniques (IEE Proc.), 1996,143(1):30-36.

[17] Doshi G, Krishnaiyer R, Muthukumar K. Optimizing software data prefetches with rotating registers. In: Hurson AR, ed. Proc. of the 2001 Int’l Conf. on Parallel Architecture and Compilation Techniques. IEEE Press, 2001. 257-267. http://ieeexplore.ieee.org/ Xplore/login.jsp?url=/iel5/7564/20612/00953306.pdf?arnumber=953306

[18] Roy J, Sun C, Wu CY. Open research compiler for Itanium processor family (IPF). MICRO-34 Tutorial. ACM Press, 2001. http://www.microarch.org/micro34/tutorials/orc/

引用本文

刘利,陈彧,乔林,汤志忠.利用循环分割和循环展开避免Cache代价.软件学报,2008,19(9):2228-2242

复制

文章指标

点击次数:4863
下载次数: 5817
HTML阅读次数: 0
引用次数: 0

历史

收稿日期:2005-10-08
最后修改日期:2006-07-10
录用日期:
在线发布日期:
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码