Distributed Gradient Boosting Decision Tree Algorithm for High-dimensional and Multi-classification Problems
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61702423, 61532021, U1501252, 61402180); National Key Research and Development Program of China (2016YFB1000905)

  • Article
  • | |
  • Metrics
  • |
  • Reference [23]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Gradient boosting decision tree algorithm is widely used in various tasks, such as classification, regression, and ranking, owing to its high accuracy and strong interpretability. With the explosive growth of data volume, distributed gradient boosting decision tree algorithms have become an important research issue. Although there exists a series of implementations of distributed gradient boosting decision tree, they perform poorly on high-dimensional and multi-classification tasks. The data parallel strategy they adopt requires the transmission of gradient histograms, and this communication overhead becomes the bottleneck in many high-dimensional and multi-classification task. This study aims at this problem and tries to find an efficient parallel strategy that is more suitable for the target. Data-parallel and feature-parallel strategies are first compared based on a cost model, and it is theoretically proved that feature-parallel is more suitable for high-dimensional and multi-classification tasks. Based on the analysis, this paper proposes a feature-parallel distributed gradient boosting decision tree algorithm, named FP-GBDT. FP-GBDT designs an efficient distributed dataset transposition method to partition the training dataset by column. During the construction of gradient histogram, FP-GBDT uses a sparsity-aware method to accelerate the histogram construction. When splitting tree nodes, FP-GBDT develops a bitmap compression method to transmit the placement of instances, thereby reduces the communication overhead. This study compares the performance of distributed gradient boosting decision tree algorithm under different parallel strategies through extensive experiments. First, the effectiveness of proposed optimization methods in FP-GBDT is verified. Then, the representative of data-parallel strategy of FP-GBDT and XGBoost are compared. On various datasets, it is proved that FP-GBDT is more efficient in high-dimensional and multi-classification tasks. FP-GBDT achieves up to 6 times performance improvement than data-parallel implementations.

    Reference
    [1] Friedman J, Hastie T, Tibshirani R, et al. Additive logistic regression:A statistical view of boosting. The Annals of Statistics, 2000, 28(2):337-407.
    [2] Friedman JH. Greedy function approximation:A gradient boosting machine. The Annals of Statistics, 2001, 1189-1232.
    [3] Li P. Robust logitboost and adaptive base class (abc) logitboost. arXiv:1203.3491. arXiv Preprint, 2012.
    [4] Burges CJ. From ranknet to lambdarank to lambdamart:An overview. Learning, 2010,11(23-581):81.
    [5] Stephen T, Weinberger KQ, Agrawal K, Paykin J. Parallel boosted regression trees for Web search ranking. In:Proc. of the 20th Int'l Conf. on World Wide Web. 2011. 387-396.
    [6] He X, Pan J, Jin O, Xu T, Liu B, Xu Tao, Shi Y, Atallah A, Herbrich R, Bowers S, et al. Practical lessons from predicting clicks on ads at facebook. In:Proc. of the 8th Int'l Workshop on Data Mining for Online Advertising. 2014. 1-9.
    [7] Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, et al. Mllib:Machine learning in apache spark. The Journal of Machine Learning Research, 2016,17(1):1235-1241.
    [8] Chen TQ, Guestrin C. Xgboost:A scalable tree boosting system. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 785-794.
    [9] Meng Q, Ke GL, Wang TF, When W, Ye QW, Ma ZM, Liu TY. A communication-efficient parallel algorithm for decision tree. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 1279-1287.
    [10] Ke GL, Meng Q, Finley T, Wang TF, Chen W, Ma WD, Ye QW, Liu TY. Lightgbm:A highly efficient gradient boosting decision tree. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 3149-3157.
    [11] Karnin Z, Lang K, Liberty E. Optimal quantile approximation in streams. In:Proc. of the 57th IEEE Annual Symp. on Foundations of Computer Science (FOCS). 2016. 71-78.
    [12] Greenwald M, Khanna S. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 2001,30:58-66.
    [13] Yahoo. Data sketches. https://datasketches.github.io/
    [14] Jiang J, Jiang JW, Cui B, Zhang C. Tencentboost:A gradient boosting tree system with parameter server. In:Proc. of the 2017 IEEE 33rd Int'l Conf. on Data Engineering. 2017. 281-284.
    [15] Jiang JW, Cui B, Zheng C, Fu FC. DimBoost:Boosting gradient boosting decision tree to higher dimensions. In:Proc. of the 2018 Int'l Conf. on Management of Data. 2018. 1363-1376.
    [16] Abadi M, Barham P, Chen JM, Chen ZF, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow:A system for large-scale machine learning. OSDI, 2016,16:265-283.
    [17] Jiang JW, Cui B, Zhang C, Yu LL. Heterogeneity-aware distributed parameter servers. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. 2017. 463-478.
    [18] Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su BY. Scaling distributed machine learning with the parameter server. OSDI, 2014,14:583-598.
    [19] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In:Proc. of the 9th USENIX Conf. on Networked Systems Design and Implementation. 2012. 2.
    [20] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. Apache hadoop yarn:Yet another resource negotiator. In:Proc. of the 4th Annual Symp. on Cloud Computing. 2013. 5.
    [21] RCV1 dataset. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#rcv1.binary
    [22] RCV1-Multi dataset. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#rcv1.multiclass
    [23] Lewis DD, Yang YM, Rose TG, Li Fan. Rcv1:A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004,5:361-397.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

江佳伟,符芳诚,邵蓥侠,崔斌.面向高维特征和多分类的分布式梯度提升树.软件学报,2019,30(3):784-798

Copy
Share
Article Metrics
  • Abstract:3595
  • PDF: 5772
  • HTML: 3708
  • Cited by: 0
History
  • Received:July 19,2018
  • Revised:September 20,2018
  • Online: March 06,2019
You are the first2033255Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063