Adaptive NAG Methods Based on AdaGrad and Its Optimal Individual Convergence
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [34]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Compared with the gradient descent, the adaptive gradient descent (AdaGrad) keeps the geometric information of historical data by using the arithmetic average of squared past gradients, and obtains tighter convergence bounds when copingwith sparse data. On the other hand, by adding the momentum term to gradient descent, Nesterov’s accelerated gradient (NAG) not only obtains order of magnitude accelerated convergence for solving smooth convex optimization problems, but also achieves the optimal individual convergence rate for non-smooth convex problems. Recently, there have been studies on the combination of adaptive strategy and NAG. However, as a typical adaptive NAG algorithm, AcceleGrad fails to reflect the distinctions between dimensions due to using different adaptive variant from AdaGrad, and it only obtains the weighted averaging convergence rate. So far, there still lacks the theoretical analysis of individual convergence rate. In this study, an adaptive NAG method, which inherits AdaGrad’s step size setting, is proposed. It is proved that the proposed algorithm attains the optimal individual convergence rate when solving the constrained non-smooth convex optimization problems. The experiments are conducted on the typical optimization problem of hinge loss function for classification and L1 loss function for regression with L1 norm constraint, and experimental results verify the correctness of the theoretical analysis and superior performance of the proposed algorithm over AcceleGrad.

    Reference
    [1] Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proc. of the 20th Int’l Conf. on Machine Learning (ICML 2003). New York: Association for Computing Machinery, 2003. 928-936.
    [2] Shalev-Shwartz S, Singer Y, Srebro N. Pegasos: Primal estimated sub-GrAdientSOlver for SVM. In: Proc. of the 24th Int’l Conf. on Machine Learning (ICML 2007). New York: Association for Computing Machinery, 2007. 807-814.
    [3] Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In: Proc. of the 30th Int’l Conf. on Machine Learning (ICML 2013). New York: Association for Computing Machinery, 2013. 71-79.
    [4] Ge R, Jain P, Kakade SM, et al. Open problem: Do good algorithms necessarily query bad points? In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 3190-3193.
    [5] Harvey NJA, Plan Y, Randhawa S. Tight analyses for non-smooth stochastic gradient descent. In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 1579-1613.
    [6] Jain P, Nagaraj D, Netrapalli P. Making the last iterate of SGD information theoretically optimal. In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 1752-1755.
    [7] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(7): 257-269.
    [8] Zeiler MD. ADADELTA: An adaptive learning rate method. arXiv: 1212.5701, 2012.
    [9] Kingma DP, Ba JL. Adam: A method for stochastic optimization. In: Proc. of the 3rd Int’l Conf. on Learning Representations (ICLR 2015). 2015. 1-13.
    [10] Polyak BT. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5): 1-17.
    [11] Nesterov Y. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 1983, 27(2): 372-376.
    [12] Nemirovsky AS, Yudin DB. Problem Complexity and Method Efficiency in optimization. New York: Wiley-Interscience, 1983.
    [13] Tseng P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 2010, 125(2): 263-295.
    [14] Lan G. An optimal method for stochastic composite optimization. Mathematical Programming, 2012, 133(1-2): 365-397.
    [15] Devolder O, Glineur F, Nesterov Y. First-Order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 2014, 146(1-2): 37-75.
    [16] Tao W, Pan Z, Wu G, et al. The stength of Nestrerov’s extrapolation in the individual convergence of nonsmooth optimization. IEEE Trans. on Neural Networks and Learning Systems, 2020, 31(7): 2557-2568.
    [17] Liu YX, Cheng YJ, Tao Q. Individual convergence of NAG with biased gradient in nonsmooth cases. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 1051-1062 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5926.htm[doi: 10.13328/j.cnki.jos.005926]
    [18] Dozat T. Incorporating Nesterov momentum into Adam. In: Proc. of the 4th Int’l Conf. on Learning Representations (ICLR 2016). 2016.
    [19] Levy KY, Yurtsever A, Cevher V. Online adaptive methods, universality and acceleration. In: Proc. of the 32nd Annual Conf. on Neural Information Processing Systems (NeurIPS 2018). Cambridge: MIT, 2018. 6501-6510.
    [20] Levy KY. Online to offline conversions, universality and adaptive minibatch sizes. In: Proc. of the 31st Annual Conf. on Neural Information Processing Systems (NeurIPS 2017). Cambridge: MIT, 2017. 1613-1622.
    [21] Allen-Zhu Z, Orecchia L. Linear coupling: An ultimate unification of gradient and mirror descent. In: Proc. of the 8th Innovations in Theoretical Computer Science Conf. (ITCS 2017). 2017. 3:1-3:22.
    [22] Kavis A, Levy KY, Bach F, et al. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In: Proc. of the 33th Annual Conf. on Neural Information Processing Systems (NeurIPS 2019). Cambridge: MIT, 2019. 6257-6266.
    [23] Deng Q, Cheng Y, Lan G. Optimal adaptive and accelerated stochastic gradient descent. arXiv: 1810.00553, 2018.
    [24] Tao W, Long S, Wu G, et al. The role of momentum parameters in the optimal convergence of adaptive Polyak’s Heavy-Ball methods. In: Proc. of the 9th Int’l Conf. on Learning Representations (ICLR 2021). 2021.
    [25] Cheng YJ, Tao W, Liu YX, et al. Optimal individual convergence rate of the Heavy-ball-based momentum methods. Journal of Computer Research and Development, 2019, 56(8): 1686-1694 (in Chinese with English abstract).
    [26] Chen G, Teboulle M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 1993, 3(3): 538-543. [doi: 10.1137/0803026]
    [27] Jaggi M. Revisitting Frank-Wolfe: Projection-free sparse convex optimization. In: Proc. of the 30th Int’l Conf. on Machine Learning (ICML 2013). New York: Association for Computing Machinery, 2013. 427-435.
    [28] Liu J, Ye J. Efficient euclidean projections in linear time. In: Proc. of the 26th Int’l Conf. on Machine Learning (ICML 2009). New York: Association for Computing Machinery, 2009. 657-664.
    [29] Liu J, Ji S, Ye J. SLEP: Sparse learning with efficient projections. Arizona: Arizona State University, 2009.
    [30] Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proc. of the 29th Int’l Conf. on Machine Learning (ICML 2012). New York: Association for Computing Machinery, 2012. 449-456.
    [31] Chen X, Lin Q, Pena J. Optimal regularized dual averaging methods for stochastic optimization. In: Proc. of the 26th Annual Conf. on Neural Information Processing Systems (NIPS 2012). Cambridge: MIT, 2012. 404-412.
    附中文参考文献:
    [17] 刘宇翔, 程禹嘉, 陶卿. 梯度有偏情形非光滑问题NAG的个体收敛性. 软件学报, 2020, 31(4): 1051-1062. http://www.jos.org.cn/1000-9825/5926.htm[doi:10.13328/j.cnki.jos.005926]
    [25] 程禹嘉, 陶蔚, 刘宇翔, 陶卿. Heavy-Ball型动量方法的最优个体收敛速率. 计算机研究与发展, 2019, 56(8): 1686-1694.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

陇盛,陶蔚,张泽东,陶卿.基于AdaGrad的自适应NAG方法及其最优个体收敛性.软件学报,2022,33(4):1231-1243

Copy
Share
Article Metrics
  • Abstract:1325
  • PDF: 4742
  • HTML: 2864
  • Cited by: 0
History
  • Received:March 09,2021
  • Revised:July 16,2021
  • Online: October 26,2021
  • Published: April 06,2022
You are the first2034065Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063