LI Yi-Jin , DU Shao-Min , ZHAO Jia-Cheng , WANG Xue-Ying , ZHA Yong-Quan , CUI Hui-Min
2025, 36(9):0-0. DOI: 10.13328/j.cnki.jos.007357
Abstract:Instruction level parallelism is a classic problem in the research of processor architecture. VLIW architecture is a common architecture to enhance Instruction level parallelism in the field of digital signal processor. The instruction issue order is determined by the compiler for VLIW architecture, so VLIW’s Instruction level parallelism performance strongly depends on the instruction scheduling of compiler. In order to explore the performance poteneial of RISC-V VLIW architecture and enrich the RISC-V ecosystem, this paper studies the optimization of instruction scheduling algorithm of RISC-V VLIW architecture. For a single scheduling region, the integer linear programming scheduling can obtain the optimal scheduling solution with high complexity, and the list scheduling, which has low complexity, cannot obtain the optimal scheduling solution. In order to combine the advantages of the two scheduling algorithms, this paper proposes an IPC theoretical model guided hybrid instruction scheduling algorithm. The scheduling region where the list scheduling has not reached the optimal solution can be located with IPC theoretical model, and then the integer linear programming scheduling algorithm further processes the located scheduling region. The theoretical model is based on data flow analysis and considers both instruction dependency and hardware resources, and can give the theoretical upper bound of IPC in linear complexity. The core of hybrid scheduling lies in the accuracy of IPC theoretical model, which is 95.74% in this paper. On the given benchmark, the IPC theoretical model can identify that 94.62% of the scheduling region has reached the optimal solution under list scheduling, so only 5.38% of the scheduling region needs to be further scheduled by integer linear programming. The hybrid scheduling algorithm can achieve the scheduling effect of integer linear programming scheduling with the complexity close to that of list scheduling.
HAN Jin-Chi , WANG Zhi-Dong , MA Hao , SONG Wei
2025, 36(9):0-0. DOI: 10.13328/j.cnki.jos.007358
Abstract:Cache simulators play an indispensable role in exploring cache architectures and researching cache side channels. Spike, as the standard implementation of the RISC-V instruction set, provides a complete environment for cache research based on RISC-V. However, Spike’s cache model has several issues, including low simulation granularity and significant differences from the cache structures of real processors. To address these issues, this paper modifies and extends Spike’s cache model, naming the modified version FlexiCAS (Flexible Cache Architectural Simulator), and refers to the modified Spike as Spike-FlexiCAS. FlexiCAS supports various cache architectures, featuring flexible configuration and easy extensibility, and allows arbitrary combinations of cache features, such as coherence protocols and implementation methods. Additionally, FlexiCAS can simulate cache behavior independently of Spike.The performance test results show that FlexiCAS has a significant performance advantage over the cache model of the currently fastest execution-driven simulator, ZSim.
LI Chuan-Dong , YI Ran , LUO Ying-Wei , WANG Xiao-Lin , WANG Zhen-Lin
2025, 36(9):0-0. DOI: 10.13328/j.cnki.jos.007359
Abstract:As a key component of virtualization technology, memory virtualization directly affects the performance of virtual machines. However, the current memory virtualization methods always tradeoff between the overhead of two-dimensional address translation and the overhead of page table synchronization. The traditional shadow paging method uses an extra page table maintained by software to achieve the native address translation performance. But the synchronization of shadow page table based on write protection always causes VM-exits, which seriously decreases the performance. The nested paging method uses hardware-assisted virtualization, and the process page table of applications and the nested page table of the VM can be directly loaded into the MMU, thus avoiding the overhead of page table synchronization, but the two-dimensional page table traversal will seriously degrade the address translation performance. Based on the privilege model and hardware features under RISC-V architecture, this paper present Lazy Shadow Paging(LSP), which reduces the overhead of page table synchronization while maintaining the efficiency of address translation of shadow page tables. The lazy shadow paging first analyzes the access characteristics of process page table pages by guest OS, and combines the synchronization with the TLB flush. It then delays the synchronization software overhead to the first visit after that. At the same time, lazy shadow paging designs a fast path for VM-exits based on the privilege level model under RISC-V. Experiments show that under the basic RISC-V architecture, the lazy shadow paging is reduced 50% of the VM-exits compared with the traditional shadow paging in the micro-benchmark. For the typical application in SPEC2006 benchmark, the lazy shadow paging reduces the number of VM-exits by up to 25% compared with the traditional shadow paging, and reduces 12 memory accesses per TLB miss compared with the nested paging.
HAN Liu-Tong , ZHANG Hong-Bin , XING Ming-Jie , WU Yan-Jun , ZHAO Chen
2025, 36(9):0-0. DOI: 10.13328/j.cnki.jos.007360
Abstract:The performance acceleration of high-performance libraries on CPUs can be achieved by efficiently leveraging SIMD hardware through vectorization. Implementing vectorization depends on programming methods specific to the target SIMD hardware. However, the programming models and methods of different SIMD extensions vary significantly. To avoid redundant implementation of algorithm optimizations across various platforms and improve the maintainability of algorithm libraries, a hardware abstraction layer (HAL) is often introduced in high-performance libraries. Since existing SIMD extension instruction sets are designed with fixed-length vector registers, most hardware abstraction layers only support fixed-length vector types and operations. However, the design of fixed-length vector representations in hardware abstraction layers cannot accommodate the variable vector register lengths introduced by the RISC-V vector extension. Treating RISC-V vector extensions as fixed-length vectors within existing HAL designs would introduce unnecessary overhead and cause performance degradation. To address this problem, the paper proposes a HAL design method compatible with variable-length vector extension platforms and fixed-length SIMD extension platforms. Based on our method, the OpenCV universal intrinsic functions have been redesigned and optimized to support RISC-V vector extension devices better while maintaining compatibility with existing SIMD platforms. Moreover, we designed experiments to compare the performance of the OpenCV library optimized using our approach against the original version. The results demonstrate that the universal intrinsic redesigned by our method efficiently integrates RISC-V vector extensions into the hardware abstraction layer optimization framework. Our method achieved a 3.93x performance improvement in core modules, significantly enhancing the execution performance of the high-performance library on RISC-V devices, thereby validating the effectiveness of this paper. Additionally, our work has been open-sourced and integrated into the OpenCV source code, demonstrating our approach’s practicality and application value.
XU Xue-Zheng , YANG De-Heng , WANG Lu , WANG Tao , HUANG An-Wen , LI Qiong
2025, 36(9):1-18. DOI: 10.13328/j.cnki.jos.007292
Abstract:The memory consistency model defines constraints on memory access orders for parallel programs on multi-core systems and is an important architectural specification that is jointly followed by software and hardware. Sequential Consistency (SC) Per Location is one of the classic axioms of memory consistency model, which specifies that all memory operations with the same address in a multi-core system follow sequential consistency. It has been widely used in the memory consistency axiom model of classic architectures such as X86/TSO, Power, ARM, and plays an important role in chip memory consistency verification, system software, and parallel program development. As an open-source architectural specification, RISC-V's memory model is defined by global memory order, preserved program order, and three axioms (load value axiom, atomicity axiom, and progress axiom). It does not directly include SC Per Location as an axiom, which poses challenges for existing memory model verification tools and system software development. In this paper, we formalize the SC Per Location as a theorem based on the defined axioms and rules in the RISC-V memory model. The proof process abstracts the construction of arbitrary same-address memory access sequences into deterministic finite automata for inductive proof. This research is a theoretical supplement to the formal methods of RISC-V memory consistency.