Abstract:As the scale of business data increases, distributed online analytical processing (OLAP) is widely performed in business intelligence (BI), enterprise resource planning (ERP), user behavior analysis, and other application scenarios to support large-scale data analysis. Moreover, distributed OLAP overcomes the limitations of single-machine storage and stores data in memory to improve the performance of OLAP. However, after the in-memory distributed OLAP eliminates disk input/output (I/O), the join operation becomes one of its new performance bottlenecks. As a common practice in OLAP, the join operation involves a huge amount of data accessing and computation operations. By analyzing existing methods for the join operation, this study presents a graph structure indexing method that can accelerate the join operation and a new join method called LinkJoin based on it. Graph structure indexing stores the in-memory position of data in the form of a graph structure according to the join relationship specified by users. The join method based on graph structure indexing reduces the amount of data accessing and computation operations with a low complexity equivalent to that of the hash join. This study expands the state-of-the-art open-source in-memory OLAP system called MonetDB from a single-machine system to a distributed one and designs and implements a join method based on graph structure indexing on it. A series of designs and optimizations are also conducted in the aspects of graph indexing structure, columnar storage, and distributed execution engine to improve the distributed OLAP performance of the system. The test results show that in the TPC-H benchmark tests, the join operation based on graph structure indexing improves the performance on queries with join operations by 1.64 times on average and 4.1 times at most. For the join operation part of these queries, it enhances the performance by 9.8–22.1 times.