利用循环分割和循环展开避免Cache代价
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant No.60573100 (国家自然科学基金)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [18]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    存储系统与处理器之间的速度差距逐渐变大,为此,cache使用了分级机制,但这也带来了额外的存储延迟(cache代价).提出一种利用循环分割和循环展开相结合避免cache代价的PCPLPU(prevent cache penalty by loop partition-unrolling)算法.实验结果表明,PCPLPU算法能够有效避免循环代价,提高程序性能.

    Abstract:

    Due to the increasing speed gap between memory system and processor, cache hierarchies have been implemented into memory system, but additional latency (cache penalty) is introduced. This paper presents an algorithm named as prevent cache penalty by loop partition-unrolling (PCPLPU), which can prevent cache penalty in loops by the combination of loop partition and unrolling. Experimental results show that PCPLPU can prevent cache penalty and improve the performance of programs.

    参考文献
    [1] Intel Corp. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Intel Corp Press, 2002.
    [2] Intel Corp. Intel Pentium 4 and Intel Xeon Processor Optimization. Reference Manual. Intel Corp Press, 2002.
    [3] Chen F, Sha EHM. Loop scheduling and partitions for hiding memory latencies. In: Proc. of the IEEE 12th Int’l Symp. on System Synthesis. 1999. 64-70. http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/6603/17629/00814262.pdf?tp=&isnumber=&amumber =814262
    [4] Sarkar V. Optimized unrolling of nested loops. In: Proc. of the 14th Int’l Conf. on Supercomputing. New Mexico: ACM Press, 2000. http://portal.acm.org/citation.cfm?id=335246
    [5] Li WL, Liu L, Tang ZZ. Loop unrolling optimization for software pipelining. Journal of Beijing University of Aeronautics and Astronautics, 2004,30(11):1111-1115 (in Chinese with English abstract).
    [6] Intel itanium architecture software developer’s manual. Revision 2.0. 2001.
    [7] Huck J, Morris D, Ross J, Knies A, Mulder H, Zahir R. Introducing the IA-64 architecture. IEEE Micro, 2000,20(5):12-23.
    [8] Song YH, Xu R, Wang C, Li ZY. Improving data locality by array contraction. IEEE Trans. on Computers, 2004,53(9):1073-1084.
    [9] Liu L, Li WL, Chen Y, Li SM, Tang ZZ. Hiding memory access latency in software pipelining. Journal of Software, 2005,16(10): 1833-1841 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/16/1833.htm
    [10] Collard JF, Lavery D. Optimizations to prevent cache penalties for the Intel? Itanium? 2 processor. In: Proc. of the Int’l Symp. on Code Generation and Optimization (CGO 2003). 2003. 105-114. http://portal.acm.org/citation.cfm?id=776273
    [11] Liu L, Li WL, Guo ZY, Li SM, Tang ZZ. Optimization to prevent cache penalty in modulo scheduling. Journal of Software, 2005, 16(10):1842-1852 (in Chinese with English abstract). http://jos.org.cn/1000-9825/16/1842.htm
    [12] Rau BR. Iterative modulo scheduling. HPL-94-115. Hewlett-Packard Laboratories, 1994.
    [13] Callahan D, Kennedy K, Porterfield A. Software prefetching. In: Proc. of the 4th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. ACM Press, 1991. http://portal.acm.org/citation.cfm?id=106979&coll= portal&dl=ACM
    [14] Allan VH, Jones RB, Lee RM, Allan SJ. Software pipelining. ACM Computing Surveys, 1995,27(3):367-432.
    [15] Wang Z, Kirkpatrick M, Sha EHM. Optimal two level partition and loop scheduling for hiding memory latency for DSP applications. In: Proc. of the ACM 37th Design Automation Conf. 2000. 540-545. http://portal.acm.org/citation.cfm?id=337571
    [16] Wu CC, Chen CF. A loop partition technique for reducing cache bank conflict in multithreaded architecture. Computers and Digital Techniques (IEE Proc.), 1996,143(1):30-36.
    [17] Doshi G, Krishnaiyer R, Muthukumar K. Optimizing software data prefetches with rotating registers. In: Hurson AR, ed. Proc. of the 2001 Int’l Conf. on Parallel Architecture and Compilation Techniques. IEEE Press, 2001. 257-267. http://ieeexplore.ieee.org/ Xplore/login.jsp?url=/iel5/7564/20612/00953306.pdf?arnumber=953306
    [18] Roy J, Sun C, Wu CY. Open research compiler for Itanium processor family (IPF). MICRO-34 Tutorial. ACM Press, 2001. http://www.microarch.org/micro34/tutorials/orc/
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

刘 利,陈 彧,乔 林,汤志忠.利用循环分割和循环展开避免Cache代价.软件学报,2008,19(9):2228-2242

复制
分享
文章指标
  • 点击次数:4863
  • 下载次数: 5817
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2005-10-08
  • 最后修改日期:2006-07-10
文章二维码
您是第19867387位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号