主页期刊介绍编委会编辑部服务介绍道德声明在线审稿编委办公编辑办公English
2018-2019年专刊出版计划 微信服务介绍 最新一期:2019年第6期
     
在线出版
各期目录
纸质出版
分辑系列
论文检索
论文排行
综述文章
专刊文章
美文分享
各期封面
E-mail Alerts
RSS
旧版入口
中国科学院软件研究所
  
投稿指南 问题解答 下载区 收费标准 在线投稿
江佳伟,符芳诚,邵蓥侠,崔斌.面向高维特征和多分类的分布式梯度提升树.软件学报,2019,30(3):784-798
面向高维特征和多分类的分布式梯度提升树
Distributed Gradient Boosting Decision Tree Algorithm for High-dimensional and Multi-classification Problems
投稿时间:2018-07-19  修订日期:2018-09-20
DOI:10.13328/j.cnki.jos.005690
中文关键词:  梯度提升树  数据并行  特征并行  系统实现  算法比较
英文关键词:gradient boosting decision tree  data parallel  feature parallel  system implementation  performance comparison
基金项目:国家自然科学基金(61832001,61702015,61702016);国家重点研发计划(2018YFB1004403)
作者单位E-mail
江佳伟 高可信软件技术教育部重点实验室(北京大学), 北京 100871  
符芳诚 高可信软件技术教育部重点实验室(北京大学), 北京 100871  
邵蓥侠 北京邮电大学 计算机学院, 北京 100876  
崔斌 高可信软件技术教育部重点实验室(北京大学), 北京 100871 bin.cui@pku.edu.cn 
摘要点击次数: 413
全文下载次数: 295
中文摘要:
      梯度提升树算法由于其高准确率和可解释性,被广泛地应用于分类、回归、排序等各类问题.随着数据规模的爆炸式增长,分布式梯度提升树算法成为研究热点.虽然目前已有一系列分布式梯度提升树算法的实现,但是它们在高维特征和多分类任务上性能较差,原因是它们采用的数据并行策略需要传输梯度直方图,而高维特征和多分类情况下梯度直方图的传输成为性能瓶颈.针对这个问题,研究更加适合高维特征和多分类的梯度提升树的并行策略,具有重要的意义和价值.首先比较了数据并行与特征并行策略,从理论上证明特征并行更加适合高维和多分类场景.根据理论分析的结果,提出了一种特征并行的分布式梯度提升树算法FP-GBDT.FP-GBDT设计了一种高效的分布式数据集转置算法,将原本按行切分的数据集转换为按列切分的数据表征;在建立梯度直方图时,FP-GBDT使用一种稀疏感知的方法来加快梯度直方图的建立;在分裂树节点时,FP-GBDT设计了一种比特图压缩的方法来传输数据样本的位置信息,从而减少通信开销.通过详尽的实验,对比了不同并行策略下分布式梯度提升树算法的性能,首先验证了FP-GBDT提出的多种优化方法的有效性;然后比较了FP-GBDT与XGBoost的性能,在多个数据集上验证了FP-GBDT在高维特征和多分类场景下的有效性,取得了最高6倍的性能提升.
英文摘要:
      Gradient boosting decision tree algorithm is widely used in various tasks, such as classification, regression, and ranking, owing to its high accuracy and strong interpretability. With the explosive growth of data volume, distributed gradient boosting decision tree algorithms have become an important research issue. Although there exists a series of implementations of distributed gradient boosting decision tree, they perform poorly on high-dimensional and multi-classification tasks. The data parallel strategy they adopt requires the transmission of gradient histograms, and this communication overhead becomes the bottleneck in many high-dimensional and multi-classification task. This study aims at this problem and tries to find an efficient parallel strategy that is more suitable for the target. Data-parallel and feature-parallel strategies are first compared based on a cost model, and it is theoretically proved that feature-parallel is more suitable for high-dimensional and multi-classification tasks. Based on the analysis, this paper proposes a feature-parallel distributed gradient boosting decision tree algorithm, named FP-GBDT. FP-GBDT designs an efficient distributed dataset transposition method to partition the training dataset by column. During the construction of gradient histogram, FP-GBDT uses a sparsity-aware method to accelerate the histogram construction. When splitting tree nodes, FP-GBDT develops a bitmap compression method to transmit the placement of instances, thereby reduces the communication overhead. This study compares the performance of distributed gradient boosting decision tree algorithm under different parallel strategies through extensive experiments. First, the effectiveness of proposed optimization methods in FP-GBDT is verified. Then, the representative of data-parallel strategy of FP-GBDT and XGBoost are compared. On various datasets, it is proved that FP-GBDT is more efficient in high-dimensional and multi-classification tasks. FP-GBDT achieves up to 6 times performance improvement than data-parallel implementations.
HTML  下载PDF全文  查看/发表评论  下载PDF阅读器
 

京公网安备 11040202500064号

主办单位:中国科学院软件研究所 中国计算机学会
编辑部电话:+86-10-62562563 E-mail: jos@iscas.ac.cn
Copyright 中国科学院软件研究所《软件学报》版权所有 All Rights Reserved
本刊全文数据库版权所有,未经许可,不得转载,本刊保留追究法律责任的权利