Abstract:Task parallelism is one of the fundamental patterns for designing parallel algorithms. Due to algorithm complexity and distinctive hardware features, however, implementation of algorithms in task parallelism often remains to be challenging. On the newly SW26010 many-core CPU platform, a general runtime framework, SWAN, which supports nested task parallelism is proposed in this study. SWAN provides high-level abstractions for programmers to implement task parallelism so that they can focus mainly on the algorithm itself, enjoying an enhanced productivity. In the aspect of performance, the shared resources and information manipulated by SWAN are partitioned in a fine-grained manner to avoid fierce contention among working threads. The core data structures within SWAN take advantage of the high-bandwidth memory access mechanism, fast on-chip scratchpad cache as well as atomic operations of the platform to reduce the overhead of SWAN itself. Besides, SWAN provides dynamic load-balancing strategies in runtime to ensure a full occupation of the threads. In the experiment, a set of recursive algorithms in nested parallelism, including the N-queens problem, binary-tree traversal, quick sort, and convex hull, are implemented using SWAN on the target platform. The experimental results reveal that each of the algorithms can gain a significant speedup, from 4.5x to 32x, against its serial counterpart, which suggests that SWAN has a high usability and performance.