Adaptive NAG Methods Based on AdaGrad and Its Optimal Individual Convergence

doi:10.13328/j.cnki.jos.006464

微信服务号

微信订阅号

2025-4-9- 10

Home > Archive>Volume 33, Issue 4, 2022 >1231-1243. DOI:10.13328/j.cnki.jos.006464

PDF HTML XML Export Cite reminder

Adaptive NAG Methods Based on AdaGrad and Its Optimal Individual Convergence
DOI:
                        10.13328/j.cnki.jos.006464
                    
Author:
                        LONG ShengLONG Sheng
Department of Information Engineering, PLA Army Academy of Artillery and Air Defense, Hefei 230031, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
TAO WeiTAO Wei
Center for Strategic Assessment and Consulting, PLA Academy of Military Sciences, Beijing 100091, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHANG Ze-DongZHANG Ze-Dong
Department of Information Engineering, PLA Army Academy of Artillery and Air Defense, Hefei 230031, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
TAO QingTAO Qing
Department of Information Engineering, PLA Army Academy of Artillery and Air Defense, Hefei 230031, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [34]

Related [20]

Cited by

Materials

Comments

Abstract:

Compared with the gradient descent, the adaptive gradient descent (AdaGrad) keeps the geometric information of historical data by using the arithmetic average of squared past gradients, and obtains tighter convergence bounds when copingwith sparse data. On the other hand, by adding the momentum term to gradient descent, Nesterov’s accelerated gradient (NAG) not only obtains order of magnitude accelerated convergence for solving smooth convex optimization problems, but also achieves the optimal individual convergence rate for non-smooth convex problems. Recently, there have been studies on the combination of adaptive strategy and NAG. However, as a typical adaptive NAG algorithm, AcceleGrad fails to reflect the distinctions between dimensions due to using different adaptive variant from AdaGrad, and it only obtains the weighted averaging convergence rate. So far, there still lacks the theoretical analysis of individual convergence rate. In this study, an adaptive NAG method, which inherits AdaGrad’s step size setting, is proposed. It is proved that the proposed algorithm attains the optimal individual convergence rate when solving the constrained non-smooth convex optimization problems. The experiments are conducted on the typical optimization problem of hinge loss function for classification and L1 loss function for regression with L1 norm constraint, and experimental results verify the correctness of the theoretical analysis and superior performance of the proposed algorithm over AcceleGrad.

Key words:machine learning;convex optimization;adaptive algorithm;NAG method;individual convergence rate

Reference

[1] Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proc. of the 20th Int’l Conf. on Machine Learning (ICML 2003). New York: Association for Computing Machinery, 2003. 928-936.

[2] Shalev-Shwartz S, Singer Y, Srebro N. Pegasos: Primal estimated sub-GrAdientSOlver for SVM. In: Proc. of the 24th Int’l Conf. on Machine Learning (ICML 2007). New York: Association for Computing Machinery, 2007. 807-814.

[3] Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In: Proc. of the 30th Int’l Conf. on Machine Learning (ICML 2013). New York: Association for Computing Machinery, 2013. 71-79.

[4] Ge R, Jain P, Kakade SM, et al. Open problem: Do good algorithms necessarily query bad points? In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 3190-3193.

[5] Harvey NJA, Plan Y, Randhawa S. Tight analyses for non-smooth stochastic gradient descent. In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 1579-1613.

[6] Jain P, Nagaraj D, Netrapalli P. Making the last iterate of SGD information theoretically optimal. In: Proc. of the 32nd Annual Conf. on Learning Theory (COLT 2019). 2019. 1752-1755.

[7] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(7): 257-269.

[8] Zeiler MD. ADADELTA: An adaptive learning rate method. arXiv: 1212.5701, 2012.

[9] Kingma DP, Ba JL. Adam: A method for stochastic optimization. In: Proc. of the 3rd Int’l Conf. on Learning Representations (ICLR 2015). 2015. 1-13.

[10] Polyak BT. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5): 1-17.

[11] Nesterov Y. A method of solving a convex programming problem with convergence rate O(1/k²). Soviet Mathematics Doklady, 1983, 27(2): 372-376.

[12] Nemirovsky AS, Yudin DB. Problem Complexity and Method Efficiency in optimization. New York: Wiley-Interscience, 1983.

[13] Tseng P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 2010, 125(2): 263-295.

[14] Lan G. An optimal method for stochastic composite optimization. Mathematical Programming, 2012, 133(1-2): 365-397.

[15] Devolder O, Glineur F, Nesterov Y. First-Order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 2014, 146(1-2): 37-75.

[16] Tao W, Pan Z, Wu G, et al. The stength of Nestrerov’s extrapolation in the individual convergence of nonsmooth optimization. IEEE Trans. on Neural Networks and Learning Systems, 2020, 31(7): 2557-2568.

[17] Liu YX, Cheng YJ, Tao Q. Individual convergence of NAG with biased gradient in nonsmooth cases. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 1051-1062 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5926.htm[doi: 10.13328/j.cnki.jos.005926]

[18] Dozat T. Incorporating Nesterov momentum into Adam. In: Proc. of the 4th Int’l Conf. on Learning Representations (ICLR 2016). 2016.

[19] Levy KY, Yurtsever A, Cevher V. Online adaptive methods, universality and acceleration. In: Proc. of the 32nd Annual Conf. on Neural Information Processing Systems (NeurIPS 2018). Cambridge: MIT, 2018. 6501-6510.

[20] Levy KY. Online to offline conversions, universality and adaptive minibatch sizes. In: Proc. of the 31st Annual Conf. on Neural Information Processing Systems (NeurIPS 2017). Cambridge: MIT, 2017. 1613-1622.

[21] Allen-Zhu Z, Orecchia L. Linear coupling: An ultimate unification of gradient and mirror descent. In: Proc. of the 8th Innovations in Theoretical Computer Science Conf. (ITCS 2017). 2017. 3:1-3:22.

[22] Kavis A, Levy KY, Bach F, et al. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In: Proc. of the 33th Annual Conf. on Neural Information Processing Systems (NeurIPS 2019). Cambridge: MIT, 2019. 6257-6266.

[23] Deng Q, Cheng Y, Lan G. Optimal adaptive and accelerated stochastic gradient descent. arXiv: 1810.00553, 2018.

[24] Tao W, Long S, Wu G, et al. The role of momentum parameters in the optimal convergence of adaptive Polyak’s Heavy-Ball methods. In: Proc. of the 9th Int’l Conf. on Learning Representations (ICLR 2021). 2021.

[25] Cheng YJ, Tao W, Liu YX, et al. Optimal individual convergence rate of the Heavy-ball-based momentum methods. Journal of Computer Research and Development, 2019, 56(8): 1686-1694 (in Chinese with English abstract).

[26] Chen G, Teboulle M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 1993, 3(3): 538-543. [doi: 10.1137/0803026]

[27] Jaggi M. Revisitting Frank-Wolfe: Projection-free sparse convex optimization. In: Proc. of the 30th Int’l Conf. on Machine Learning (ICML 2013). New York: Association for Computing Machinery, 2013. 427-435.

[28] Liu J, Ye J. Efficient euclidean projections in linear time. In: Proc. of the 26th Int’l Conf. on Machine Learning (ICML 2009). New York: Association for Computing Machinery, 2009. 657-664.

[29] Liu J, Ji S, Ye J. SLEP: Sparse learning with efficient projections. Arizona: Arizona State University, 2009.

[30] Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proc. of the 29th Int’l Conf. on Machine Learning (ICML 2012). New York: Association for Computing Machinery, 2012. 449-456.

[31] Chen X, Lin Q, Pena J. Optimal regularized dual averaging methods for stochastic optimization. In: Proc. of the 26^th Annual Conf. on Neural Information Processing Systems (NIPS 2012). Cambridge: MIT, 2012. 404-412.

附中文参考文献:

[17] 刘宇翔, 程禹嘉, 陶卿. 梯度有偏情形非光滑问题NAG的个体收敛性. 软件学报, 2020, 31(4): 1051-1062. http://www.jos.org.cn/1000-9825/5926.htm[doi:10.13328/j.cnki.jos.005926]

[25] 程禹嘉, 陶蔚, 刘宇翔, 陶卿. Heavy-Ball型动量方法的最优个体收敛速率. 计算机研究与发展, 2019, 56(8): 1686-1694.

Get Citation

陇盛,陶蔚,张泽东,陶卿.基于AdaGrad的自适应NAG方法及其最优个体收敛性.软件学报,2022,33(4):1231-1243

Copy

Article Metrics

Abstract:1325
PDF: 4742
HTML: 2864
Cited by: 0

History

Received:March 09,2021
Revised:July 16,2021
Adopted:
Online: October 26,2021
Published: April 06,2022

You are the first2034065Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History