Distributed Gradient Boosting Decision Tree Algorithm for High-dimensional and Multi-classification Problems

doi:10.13328/j.cnki.jos.005690

微信服务号

微信订阅号

2025-4-6- 0

Home > Archive>Volume 30, Issue 3, 2019 >784-798. DOI:10.13328/j.cnki.jos.005690

PDF HTML XML Export Cite reminder

Distributed Gradient Boosting Decision Tree Algorithm for High-dimensional and Multi-classification Problems
DOI:
                        10.13328/j.cnki.jos.005690
                    
Author:
                        JIANG Jia-WeiJIANG Jia-Wei
Key Laboratory of High Confidence Software Technologies of Ministry of Education(Peking University), Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
FU Fang-ChengFU Fang-Cheng
Key Laboratory of High Confidence Software Technologies of Ministry of Education(Peking University), Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
SHAO Ying-XiaSHAO Ying-Xia
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
CUI BinCUI Bin
Key Laboratory of High Confidence Software Technologies of Ministry of Education(Peking University), Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:National Natural Science Foundation of China (61702423, 61532021, U1501252, 61402180); National Key Research and Development Program of China (2016YFB1000905)

Article

Figures

Metrics

Reference [23]

Related [20]

Cited by

Materials

Comments

Abstract:

Gradient boosting decision tree algorithm is widely used in various tasks, such as classification, regression, and ranking, owing to its high accuracy and strong interpretability. With the explosive growth of data volume, distributed gradient boosting decision tree algorithms have become an important research issue. Although there exists a series of implementations of distributed gradient boosting decision tree, they perform poorly on high-dimensional and multi-classification tasks. The data parallel strategy they adopt requires the transmission of gradient histograms, and this communication overhead becomes the bottleneck in many high-dimensional and multi-classification task. This study aims at this problem and tries to find an efficient parallel strategy that is more suitable for the target. Data-parallel and feature-parallel strategies are first compared based on a cost model, and it is theoretically proved that feature-parallel is more suitable for high-dimensional and multi-classification tasks. Based on the analysis, this paper proposes a feature-parallel distributed gradient boosting decision tree algorithm, named FP-GBDT. FP-GBDT designs an efficient distributed dataset transposition method to partition the training dataset by column. During the construction of gradient histogram, FP-GBDT uses a sparsity-aware method to accelerate the histogram construction. When splitting tree nodes, FP-GBDT develops a bitmap compression method to transmit the placement of instances, thereby reduces the communication overhead. This study compares the performance of distributed gradient boosting decision tree algorithm under different parallel strategies through extensive experiments. First, the effectiveness of proposed optimization methods in FP-GBDT is verified. Then, the representative of data-parallel strategy of FP-GBDT and XGBoost are compared. On various datasets, it is proved that FP-GBDT is more efficient in high-dimensional and multi-classification tasks. FP-GBDT achieves up to 6 times performance improvement than data-parallel implementations.

Key words:gradient boosting decision tree;data parallel;feature parallel;system implementation;performance comparison

Reference

[1] Friedman J, Hastie T, Tibshirani R, et al. Additive logistic regression:A statistical view of boosting. The Annals of Statistics, 2000, 28(2):337-407.

[2] Friedman JH. Greedy function approximation:A gradient boosting machine. The Annals of Statistics, 2001, 1189-1232.

[3] Li P. Robust logitboost and adaptive base class (abc) logitboost. arXiv:1203.3491. arXiv Preprint, 2012.

[4] Burges CJ. From ranknet to lambdarank to lambdamart:An overview. Learning, 2010,11(23-581):81.

[5] Stephen T, Weinberger KQ, Agrawal K, Paykin J. Parallel boosted regression trees for Web search ranking. In:Proc. of the 20th Int'l Conf. on World Wide Web. 2011. 387-396.

[6] He X, Pan J, Jin O, Xu T, Liu B, Xu Tao, Shi Y, Atallah A, Herbrich R, Bowers S, et al. Practical lessons from predicting clicks on ads at facebook. In:Proc. of the 8th Int'l Workshop on Data Mining for Online Advertising. 2014. 1-9.

[7] Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, et al. Mllib:Machine learning in apache spark. The Journal of Machine Learning Research, 2016,17(1):1235-1241.

[8] Chen TQ, Guestrin C. Xgboost:A scalable tree boosting system. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 785-794.

[9] Meng Q, Ke GL, Wang TF, When W, Ye QW, Ma ZM, Liu TY. A communication-efficient parallel algorithm for decision tree. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 1279-1287.

[10] Ke GL, Meng Q, Finley T, Wang TF, Chen W, Ma WD, Ye QW, Liu TY. Lightgbm:A highly efficient gradient boosting decision tree. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 3149-3157.

[11] Karnin Z, Lang K, Liberty E. Optimal quantile approximation in streams. In:Proc. of the 57th IEEE Annual Symp. on Foundations of Computer Science (FOCS). 2016. 71-78.

[12] Greenwald M, Khanna S. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 2001,30:58-66.

[13] Yahoo. Data sketches. https://datasketches.github.io/

[14] Jiang J, Jiang JW, Cui B, Zhang C. Tencentboost:A gradient boosting tree system with parameter server. In:Proc. of the 2017 IEEE 33rd Int'l Conf. on Data Engineering. 2017. 281-284.

[15] Jiang JW, Cui B, Zheng C, Fu FC. DimBoost:Boosting gradient boosting decision tree to higher dimensions. In:Proc. of the 2018 Int'l Conf. on Management of Data. 2018. 1363-1376.

[16] Abadi M, Barham P, Chen JM, Chen ZF, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow:A system for large-scale machine learning. OSDI, 2016,16:265-283.

[17] Jiang JW, Cui B, Zhang C, Yu LL. Heterogeneity-aware distributed parameter servers. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 463-478.

[18] Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY. Scaling distributed machine learning with the parameter server. OSDI, 2014,14:583-598.

[19] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In:Proc. of the 9th USENIX Conf. on Networked Systems Design and Implementation. 2012. 2.

[20] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. Apache hadoop yarn:Yet another resource negotiator. In:Proc. of the 4th Annual Symp. on Cloud Computing. 2013. 5.

[21] RCV1 dataset. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary

[22] RCV1-Multi dataset. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#rcv1.multiclass

[23] Lewis DD, Yang YM, Rose TG, Li Fan. Rcv1:A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004,5:361-397.

Get Citation

江佳伟,符芳诚,邵蓥侠,崔斌.面向高维特征和多分类的分布式梯度提升树.软件学报,2019,30(3):784-798

Copy

Article Metrics

Abstract:3595
PDF: 5772
HTML: 3708
Cited by: 0

History

Received:July 19,2018
Revised:September 20,2018
Adopted:
Online: March 06,2019
Published:

You are the first2033255Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History