Abstract:In recent years, the prevalent research on big-data processing often deals with increased data scale and high data complexity. The frequent usage of high-dimensional data poses challenges during application, such as efficient query and fast access of database in the system. Hence, it is critical to design an effective high-dimensional index to increase query throughput and decrease memory footage. Kraska et al. proposed learned index, which has been proved superior in real-world low-dimensional datasets. With the success of wide adoption of machine learning and deep learning on database management system, more and more researchers aim to set up learned index on high-dimensional datasets so as to improve the query efficiency. However, current solutions fail to effectively utilize the distribution information of data, and sometimes incur high overhead on the initialization of complex deep learning models. In this work, an improved high-dimensional learned index (IHDL index) is proposed based on the division of data space and dimension reduction. Specifically, the index utilizes multiple linear models on the dataset, and decreases the initialization overhead while maintains high query accuracy. Experiments on the synthetic dataset and the OSM dataset verifyits superiority in terms of initialization overhead, query throughput, and memory footage.