基于RISC-V VLIW架构的混合指令调度算法
作者:
通讯作者:

赵家程,E-mail:zhaojiacheng@ict.ac.cn

中图分类号:

TP314

基金项目:

新一代人工智能国家科技重大专项(2021ZD0110504); 国家自然科学基金(U23B2020, 62090024, 62302479); 中国科学院计算技术研究所创新课题(E361010, E261110)


Hybrid Instruction Scheduling Algorithm for RISC-V VLIW Architecture
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [45]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    指令级并行是处理器体系结构研究的经典难题. VLIW架构是数字信号处理器领域中提升指令级并行的一种常用架构. VLIW架构的指令发射顺序是由编译器决定的, 因此其指令级并行的性能强依赖于编译器的指令调度. 为了探索RISC-V VLIW架构的扩展潜力, 丰富RISC-V生态, 研究RISC-V VLIW架构的指令调度算法优化. 针对单个调度区域, 整数线性规划调度算法能够得到调度最优解但复杂度较高, 表调度算法复杂度较低但无法得到调度最优解. 为了结合两种调度算法的优点, 提出了一种IPC理论模型指导的混合指令调度算法, 即通过IPC理论模型定位到表调度未达最优解的调度区域, 再对该调度区域进一步实施整数线性规划调度算法. 该理论模型基于数据流分析技术协同考虑指令依赖和硬件资源, 能够以线性复杂度给出IPC的理论上界. 混合调度的核心在于IPC理论模型的准确性, 理论模型准确率为95.74%. 在给定的测评基准上, 提出的理论模型应用于混合指令调度时, 能够平均认定94.62%的调度区域在表调度下已达最优解, 因此仅有5.38%的调度区域需再进行整数线性规划调度. 该混合调度算法能够以接近表调度的复杂度达到整数线性规划调度的调度效果.

    Abstract:

    Instruction-level parallelism is a fundamental challenge in processor architecture research. Very long instruction word (VLIW) architecture is widely used in the field of digital signal processing to enhance instruction-level parallelism. In VLIW architecture, the instruction issue order is determined by the compiler, making its performance highly dependent on the compiler’s instruction scheduling. To explore the potential of RISC-V VLIW architecture and further enrich the RISC-V ecosystem, this study focuses on optimizing instruction scheduling algorithms for RISC-V VLIW architecture. For a single scheduling region, the integer linear programming (ILP) scheduling can achieve optimal solutions but suffers from high computational complexity, whereas list scheduling offers lower complexity at the cost of potentially suboptimal solutions. To leverage the strengths of both approaches, this study proposes a hybrid instruction scheduling algorithm. The scheduling region where the list scheduling has not reached the optimal solution can be located with the IPC theoretical model, and then the integer linear programming scheduling algorithm further processes the located scheduling region. The theoretical model is based on data flow analysis, accounting for both instruction dependencies and hardware resources, and provides a theoretical upper bound for IPC with linear complexity. The accuracy of the IPC theoretical model is a critical factor for the success of hybrid scheduling and achieves 95.74% accuracy in this study. On the given benchmark, the IPC model identifies that 94.62% of scheduling regions has reached optimal solution with list scheduling, leaving only 5.38% requiring further refinement with ILP scheduling. The proposed hybrid scheduling algorithm achieves the scheduling quality of ILP scheduling while maintaining a complexity comparable to that of list scheduling.

    参考文献
    [1] Waterman A, Asanovic K. The RISC-V Instruction Set Manual Volume I: Unprivileged IAS. Document Version 20191213. 2024. https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf
    [2] 包云岗, 孙凝晖. 开源芯片生态技术体系构建面临的机遇与挑战. 中国科学院院刊, 2022, 37(1): 24–29.
    Bao YG, Sun NH. Opportunities and challenges of building CPU ecosystem with open-source mode. Bulletin of Chinese Academy of Sciences, 2022, 37(1): 24–29 (in Chinese with English abstract).
    [3] Celio C, Patterson DA, Asanovi? K. The Berkeley out-of-order machine (BOOM): An industry-competitive, synthesizable, parameterized RISC-V processor. Berkeley: University of California, 2015.
    [4] 李晓霖, 韩萌, 郝凯, 薛海韵, 卢圣健, 张昆明, 祁楠, 牛星茂, 肖利民, 郝沁汾. 面向100 Gbps网络应用的RISC-V CPU设计与实现. 计算机辅助设计与图形学学报, 2021, 33(6): 956–962.
    Li XL, Han M, Hao K, Xue HY, Lu SJ, Zhang KM, Qi N, Niu XM, Xiao LM, Hao QF. Design of RISC-V CPU for 100 Gbps network application. Journal of Computer-aided Design & Computer Graphics, 2021, 33(6): 956–962 (in Chinese with English abstract).
    [5] 胡振波. 手把手教你设计CPU-RISC-V处理器篇. 北京: 人民邮电出版社, 2018.
    Hu ZB. Teach You to Design CPU by hand. RISC-V Processor. Beijing: Posts and Telecommunications Press, 2018 (in Chinese).
    [6] Tine B, Yalamarthy KP, Elsabbagh F, Hyesoon K. Vortex: Extending the RISC-V ISA for GPGPU and 3D-graphics. In: Proc. of the 54th Annual IEEE/ACM Int’l Symp. on Microarchitecture (MICRO-54). Virtual Event: ACM, 2021. 754–766. [doi: 10.1145/3466752.3480128]
    [7] Fisher JA. Very long instruction word architectures and the ELI-512. In: Proc. of the 10th Annual Int’l Symp. on Computer Architecture. Stockholm: ACM, 1983. 140–150. [doi: 10.1145/800046.801649]
    [8] Qualcomm. The Qualcomm Hexagon SDK. 2024. https://www.qualcomm.com/developer/software/hexagon-npu-sdk
    [9] Texas Instruments. TMS320C64x Technical Overview. 2024. https://www.ti.com/lit/ug/spru395b/spru395b.pdf
    [10] 王向前, 洪一, 王昊, 郑启龙. 魂芯DSP的编译器设计与优化. 电子学报, 2015, 43(8): 1656–1661.
    Wang XQ, Hong Y, Wang H, Zheng QL. Compiler design and optimization for BWDSP. Acta Electronica Sinica, 2015, 43(8): 1656–1661 (in Chinese with English abstract).
    [11] Michael Larabel. Kalray VLIW processor family (KVX). 2024. https://www.phoronix.com/news/Kalray-KVX-Linux-Port
    [12] Qui NM, Lin CH, Chen P. Design and implementation of a 256-bit RISC-V-based dynamically scheduled very long instruction word on FPGA. IEEE Access, 2020, 8: 172996–173007.
    [13] Hennessy JL, Gross T. Postpass code optimization of pipeline constraints. ACM Trans. on Programming Languages & Systems, 1983, 5(3): 422–448.
    [14] Bernstein D, Gertner I. Scheduling expressions on a pipelined processor with a maximal delay of one cycle. ACM Trans. on Programming Languages and Systems (TOPLAS), 1989, 11(1): 57–66.
    [15] Auyeung A, Gondra I, Dai HK. Multi-heuristic list scheduling genetic algorithm for task scheduling. In: Proc. of the 2003 ACM Symp. on Applied Computing. Melbourne: ACM, 2003. 721–724. [doi: 10.1145/952532.952673]
    [16] 黄磊, 冯晓兵. 结合的指令调度与寄存器分配技术. 计算机应用研究, 2008, 25(4): 979–982.
    Huang L, Feng XB. Survey on techniques of integrated instruction scheduling and register allocation. Application Research of Computers, 2008, 25(4): 979–982 (in Chinese with English abstract).
    [17] Deng C, Chen ZY, Shi Y, Ma YM, Wen M, Luo L. Optimizing VLIW instruction scheduling via a two-dimensional constrained dynamic programming. ACM Trans. on Design Automation of Electronic Systems, 2024, 29(5): 83.
    [18] Fisher JA. Trace scheduling: A technique for global microcode compaction. IEEE Trans. on Computers, 1981, C-30(7): 478–490.
    [19] Colwell RP, Nix RP, O'Donnell JJ, Papworth DB, Rodman PK. A VLIW architecture for a trace scheduling compiler. ACM SIGARCH Computer Architecture News, 1987, 15(5): 180–192.
    [20] Hwu WMW, Mahlke SA, Chen WY, Chang PP, Warter NJ, Bringmann RA, Ouellette RG, Hank RE, Kiyohara T, Haab GE, Holm JG, Lavery DM. The superblock: An effective technique for VLIW and superscalar compilation. The Journal of Supercomputing, 1993, 7(1): 229–248.
    [21] Mahlke SA, Lin DC, Chen WY, Hank RE, Bringmann RA. Effective compiler support for predicated execution using the hyperblock. ACM SIGMICRO Newsletter, 1992, 23(1–2): 45–54.
    [22] Giesemann F, Payá-Vayá G, Gerlach L, Blume H, Pflug F, von Voigt G. Using a genetic algorithm approach to reduce register file pressure during instruction scheduling. In: Proc. of the 2017 Int’l Conf. on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). Pythagorion: IEEE, 2017. 179–187. [doi: 10.1109/SAMOS.2017.8344626]
    [23] Giesemann F, Gerlach L, Payá-Vayá G. Evolutionary algorithms for instruction scheduling, operation merging, and register allocation in VLIW compilers. Journal of Signal Processing Systems, 2020, 92(7): 655–678.
    [24] Stuckmann F, Payá-Vayá G. A graph neural network approach to improve list scheduling heuristics under register-pressure. In: Proc. of the 13th Int’l Conf. on Modern Circuits and Systems Technologies (MOCAST). Sofia: IEEE, 2024. 1–6. [doi: 10.1109/MOCAST61810.2024.10615463]
    [25] Six C, Boulmé S, Monniaux D. Certified and efficient instruction scheduling: Application to interlocked VLIW processors. Proc. of the ACM on Programming Languages, 2020, 4: 129.
    [26] Six C, Gourdin L, Boulmé S, Monniaux D, Fasse J, Nardino N. Formally verified superblock scheduling. In: Proc. of the 11th ACM SIGPLAN Int’l Conf. on Certified Programs and Proofs. Philadelphia: ACM, 2022. 40–54. [doi: 10.1145/3497775.3503679]
    [27] Yang ZT, Shirako J, Sarkar V. Fully Verified Instruction Scheduling. Proc. of the ACM on Programming Languages, 2024, 8(OOPSLA2): 791–816.
    [28] Herklotz Y, Wickerson J. Hyperblock scheduling for verified high-level synthesis. Proc. of the ACM on Programming Languages, 2024, 8(PLDI攩愺爠挱根′漹渓?猹漵昳琮眼慢牲放?栲愹牝搠睨懗狄攬?捕潎?漠瀠瓶榛洬椠穨懭琬椠潙湉?戮愠猀旍搨?漆渇?噌??埓?憄犌援梛槏琕旼指球疦犗旕?嬠?华??呦梥攨珪椶珑嵦???攠椲樰椰游本??唸渨椱瘰攩爺猠椱琶礴″漓昱??栶椮渼敢獲放??捯慵搠敚浘礬?潈晥?午挬椠敚湨捡敮獧????㈠???楧渠??栠楓湵敮猠教?眮椠瑔桷??湄杩汭楥獮桳?慯扮獡瑬爠慦捯瑲??戭牤?孲??嵴??礠湣浬極???浲瀠汳散浨敥湤瑵慬瑩楮潧渠?潬晧??汩整硨乭攠瑦?睲椠瑴桨?????び????桤琠瑖灌獉???杲楣瑨桩畴扥?捴潵浲??礠湊浯極??污敬砠乯敦琠?扳物?ghua University (Science and Technology) 2008, 48(10): 1643–1646 (in Chinese with English abstract).
    [30] Desoli G. Instruction assignment for clustered VLIW DSP compilers: A new approach. Palo Alto: Hewlett Packard Laboratories, 1998.
    [31] Porpodas V, Cintra M. CAeSaR: Unified cluster-assignment scheduling and communication reuse for clustered VLIW processors. In: Proc. of the 2013 Int’l Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES). Montreal: IEEE, 2013. 1–10. [doi: 10.1109/CASES.2013.6662513]
    [32] Park JCH, Schlansker M. On predicated execution. Palo Alto: Hewlett-Packard Laboratories, 1991.
    [33] Traber A, Zaruba F, Stucki S, Pullini A, Haugou G, Flamand E, Gürkaynak FK, Benini L. PULPino: A small single-core RISC-V SoC. In: Proc. of the 3rd RISC-V Workshop. 2016.
    [34] RISCV-Collab/RISCV-GNU-toolchain. 2024. https://github.com/riscv-collab/riscv-gnu-toolchain
    [35] The LLVM Compiler Infrastructure. 2024. https://github.com/llvm/llvm-project
    [36] Spike RISC-V ISA Simulator. 2024. https://github.com/riscv-software-src/riscv-isa-sim
    [37] Bellard F. QEMU, a fast and portable dynamic translator. In: Proc. of the 2005 USENIX Annual Technical Conf. Anaheim: USENIX Association, 2005. 41–46.
    [38] Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7.
    [39] CoreMark is an industry-standard benchmark that measures the performance of central processing units (CPU) and embedded microcrontrollers (MCU). 2024. https://github.com/eembc/coremark
    [40] 查永权. 基于VLIW架构的软硬件协同优化研究 [硕士学位论文]. 北京: 中国科学院大学, 2024.
    Zha YQ. Res
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李奕瑾,杜绍敏,赵家程,王雪莹,查永权,崔慧敏.基于RISC-V VLIW架构的混合指令调度算法.软件学报,2025,36(9):1-17

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-23
  • 最后修改日期:2024-10-15
  • 在线发布日期: 2024-12-10
文章二维码
您是第20049469位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号