基于中间层的可扩展学习索引技术

doi:10.13328/j.cnki.jos.005910

微信服务号

微信订阅号

2025年4月7日 0:15 星期一

首页 > 过刊浏览>2020年第31卷第3期 >620-633. DOI:10.13328/j.cnki.jos.005910

PDF HTML阅读 XML下载导出引用引用提醒

基于中间层的可扩展学习索引技术
DOI:
                        10.13328/j.cnki.jos.005910
                    
CSTR:
                        
                    
作者:
                        高远宁高远宁
上海市可扩展计算与系统重点实验室, 上海 200240;上海交通大学 计算机科学与工程系, 上海 200240
在期刊界中查找
在百度中查找
在本站中查找
叶金标叶金标
上海市可扩展计算与系统重点实验室, 上海 200240;上海交通大学 计算机科学与工程系, 上海 200240
在期刊界中查找
在百度中查找
在本站中查找
杨念祖杨念祖
上海市可扩展计算与系统重点实验室, 上海 200240;上海交通大学 计算机科学与工程系, 上海 200240
在期刊界中查找
在百度中查找
在本站中查找
高晓沨高晓沨
上海市可扩展计算与系统重点实验室, 上海 200240;上海交通大学 计算机科学与工程系, 上海 200240
在期刊界中查找
在百度中查找
在本站中查找
陈贵海陈贵海
上海市可扩展计算与系统重点实验室, 上海 200240;上海交通大学 计算机科学与工程系, 上海 200240
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:高远宁(1994-),男,山东平阴人,博士生,主要研究领域为数据工程,数据库索引;高晓沨(1982-),女,博士,教授,博士生导师,CCF专业会员,主要研究领为数据库索引,算法设计与优化;叶金标(1998-),男,本科生,主要研究领域为数据库索引,机器学习;陈贵海(1963-),男,博士,教授,博士生导师,CCF会士,主要研究领为云计算,分布式系统,无线网络;杨念祖(1999-),男,本科生,主要研究领域为数据库索引,机器学习.
通讯作者:高晓沨,E-mail:gao-xf@cs.sjtu.edu.cn
中图分类号:
基金项目:国家重点研发计划（2018YFB1004700）；国家自然科学基金（61872238，61972254，61832005）；上海市科技创新行动计划（17510740200）；CCF-华为数据库创新研究计划（CCF-Huawei DBIR2019002A）

Middle Layer Based Scalable Learned Index Scheme

Author:

GAO Yuan-Ning
GAO Yuan-Ning
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai 200240, China;Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai 200240, China
在期刊界中查找
在百度中查找
在本站中查找
YE Jin-Biao
YE Jin-Biao
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai 200240, China;Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai 200240, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Nian-Zu
YANG Nian-Zu
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai 200240, China;Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai 200240, China
在期刊界中查找
在百度中查找
在本站中查找
GAO Xiao-Feng
GAO Xiao-Feng
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai 200240, China;Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai 200240, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Gui-Hai
CHEN Gui-Hai
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai 200240, China;Department of Computer Science and Engineering, Shanghai JiaoTong University, Shanghai 200240, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Key Research and Development Program of China (2018YFB1004700); National Natural Science Foundation of China (61872238, 61972254, 61832005); Shanghai Science and Technology Fund (17510740200); CCF-Huawei Database System Innovation Research Plan (CCF-Huawei DBIR2019002A)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

在大数据与云计算时代，数据访问速度是衡量大规模存储系统性能的一个重要指标.因此，如何设计一种轻量、高效的数据索引结构，从而满足系统高吞吐率、低内存占用的需求，是当前数据库领域的研究热点之一.Kraska等人提出使用机器学习模型代替传统的B树索引，并在真实数据集上取得了不错的效果，但其提出的模型假设工作负载是静态的、只读的，对于索引更新问题没有提出很好的解决办法.提出了基于中间层的可扩展的学习索引模型Dabble，用来解决索引更新引发的模型重训练问题.首先，Dabble模型利用K-Means聚类算法将数据集划分为K个区域，并训练K个神经网络分别学习不同区域的数据分布.在模型训练阶段，创新性地把数据的访问热点信息融入到神经网络中，从而提高模型对热点数据的预测精度.在数据插入时，借鉴了LSM树延迟更新的思想，提高了数据写入速度.在索引更新阶段，提出一种基于中间层的机制将模型解耦，从而缓解由于数据插入带来的模型更新问题.分别在Lognormal数据集以及Weblogs数据集上进行实验验证，结果表明，与当前先进的方法相比，Dabble模型在查询以及索引更新方面都取得了非常好的效果.

关键词:学习索引;聚类;神经网络;动态更新

Abstract:

In the era of big data and cloud computing, efficient data access is an important metric to measure the performance of a large-scale storage system. Therefore, design a lightweight and efficient index structure, which can meet the system's demand for high throughput and low memory footprint, is one of the research hotspots in the current database field. Recently, Kraska, et al proposed to use the machine learning models instead of traditional B-tree indexes, and remarkable results are achieved on real data sets. However, the proposed model assumes that the workload is static and read-only, failing to handle the index update problem. This study proposes Dabble, a middle layer based scalable learning index model, which is used to mitigate the index update problem. Dabble first uses K-means algorithm to divide the data set into K regions, and trains K neural networks to learn the data distribution of different regions. During the training phase, it innovatively integrates the data access patterns into the neural network, which can improve the prediction accuracy of the model for hotspot data. For data insertion, it borrows the idea of LSM tree, i.e., delay update mechanism, which greatly improved the data writing speed. In the index update phase, a middle layer based mechanism is proposed for model decoupling, thus easing the problem of index updating cost. Dabble model is evaluated on two datasets, the Lognormal distribution dataset and the real-world Weblogs dataset. The experiment results demonstrate the effectiveness and efficiency of the proposed model compared with the state-of-the-art methods.

Key words:learned index;clustering;neural network;dynamic update

引用本文

高远宁,叶金标,杨念祖,高晓沨,陈贵海.基于中间层的可扩展学习索引技术.软件学报,2020,31(3):620-633

复制

文章指标

点击次数:3998
下载次数: 6773
HTML阅读次数: 3573
引用次数: 0

历史

收稿日期:2019-07-20
最后修改日期:2019-11-25
录用日期:
在线发布日期: 2020-01-10
出版日期: 2020-03-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码