一种减小方差求解非光滑问题的随机优化算法
作者:
基金项目:

国家自然科学基金(61273296); 安徽省自然科学基金(1308085QF121)


Stochastic Optimization Algorithm with Variance Reduction for Solving Non-Smooth Problems
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [19]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随机优化算法是求解大规模机器学习问题的高效方法之一.随机学习算法使用随机抽取的单个样本梯度代替全梯度,有效节省了计算量,但却会导致较大的方差.近期的研究结果表明:在光滑损失优化问题中使用减小方差策略,能够有效提高随机梯度算法的收敛速率.考虑求解非光滑损失问题随机优化算法COMID(compositeobjective mirror descent)的方差减小问题.首先证明了COMID具有方差形式的O(1/√T+σ2/√T)收敛速率,其中,T是迭代步数,σ2是方差.该收敛速率保证了减小方差的有效性,进而在COMID中引入减小方差的策略,得到一种随机优化算法α-MDVR(mirror descent with variance reduction).不同于Prox-SVRG(proximal stochastic variance reduced gradient),α-MDVR收敛速率不依赖于样本数目,每次迭代只使用部分样本来修正梯度.对比实验验证了α-MDVR既减小了方差,又节省了计算时间.

    Abstract:

    Stochastic optimization is one of the efficient methods for solving large-scale machine learning problems. In stochastic learning, the full gradient is replaced by the gradient of loss function in terms of a randomly selected single sample to avoid high computational cost. However, large variance is usually caused. Recent studies have shown that the convergence rate of stochastic gradient methods can be effectively improved by reducing the variance. In this paper, the variance reduction issue in COMID (composite objective mirror descent) is addressed when solving non-smooth optimization problems. First a proof is provided to show that the COMID has a convergence rate O(1/√T+σ2/√T) in the terms of variance, where T is the iteration number and σ2 is the variance. This convergence rate ensures the effectiveness of reducing the variance. Then, a stochastic optimization algorithm α-MDVR (mirror descent with variance reduction) is obtained by incorporating the strategy of reducing variance into COMID. Unlike Prox-SVRG (proximal stochastic variance reduced gradient), α-MDVR does not depend on the number of samples and only uses a small portion of samples to modify the gradient. The comparative experiments demonstrate that α-MDVR not only reduces the variance but also decreases the computational time.

    参考文献
    [1] Tseng P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 2010,125(2):263-295.[doi:10.1007/s10107-010-0394-2]
    [2] Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009,19(4):1574-1609.[doi:10.1137/070704277]
    [3] Shalev-Shwartz S, Tewari A. Stochastic methods for L1 regularized loss minimization. In:Proc. of the 26th Annual Int'l Conf. on Machine Learning. 2009. 929-936.
    [4] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In:Proc. of the Advances in Neural Information Processing Systems 26. 2013. 315-323.
    [5] Shalev-Shwartz S, Zhang T. Stochastic dual coordinate ascent methods for regularized loss minimization. arXiv preprint arXiv:1209.1873, 2012.
    [6] Le Roux N, Schmidt M, Bach F. A stochastic gradient method with an exponential convergence rate for strongly convex optimization with finite training sets. arXiv preprint, arXiv:1202.6258, 2012.
    [7] Xiao L. Dual averaging methods for regularized stochastic learning and online optimization. In:Advances in Neural Information Processing Systems. 2009. 2116-2124.
    [8] Xiao L, Zhang T. A proximal stochastic gradient method with progressive variance reduction. arXiv:1403.4699v1, 2014.
    [9] Duchi J, Shalev-Shwartz S, Singer Y, Tewari A. Composite objective mirror descent. In:Proc. of the 23rd Annual Workshop on Computational Learning Theory. ACM Press, 2010. 116-128.
    [10] Duchi J, Shalev-Shwartz S, Singer Y. Efficient projections onto the L1-ball for learning in high dimensions. In:Proc. of the 25th Int'l Conf. on Machine Learning. 2008. 272-279.
    [11] Beck A, Teboulle M. A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2009,2(1):183-202.[doi:10.1137/080716542]
    [12] Nesterov Y. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 1983,27(2):372-376.
    [13] Bottou L. Stochastic Gradient Descent Tricks. Neural Networks:Tricks of the Trade. Berlin, Heidelberg:Springer-Verlag, 2012. 421-436.[doi:10.1007/978-3-642-35289-8_25]
    [14] Beck A, Teboulle M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003,31(3):167-175.[doi:10.1016/S0167-6377(02)00231-6]
    [15] Bregman LM. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 1967,7(3):200-217.[doi:10.1016/0041-5553 (67)90040-7]
    [16] Hazan E, Kale S. Beyond the regret minimization barrier:An optimal algorithm for stochastic strongly convex optimization. Journal of Machine Learning Research, 2014,15(1):2489-2512.
    [17] Lan G. An optimal method for stochastic composite optimization. Mathematical Programming, Series A, 2012,133(1-2):365-397.[doi:10.1007/s10107-010-0434-y]
    [18] Shalev-Shwartz S, Singer Y, Srebro N. Pegasos:Primal estimated sub-gradient solver for SVM. Mathematical Programming, 2011, 127(1):3-30.[doi:10.1007/s10107-010-0420-4]
    [19] Lin QH, Chen X, Peña J. A sparsity preserving stochastic gradient method for composite optimization. Manuscript, Carnegie Mellon University, 2011. 15213.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

朱小辉,陶卿,邵言剑,储德军.一种减小方差求解非光滑问题的随机优化算法.软件学报,2015,26(11):2752-2761

复制
分享
文章指标
  • 点击次数:5186
  • 下载次数: 7414
  • HTML阅读次数: 2338
  • 引用次数: 0
历史
  • 收稿日期:2015-02-26
  • 最后修改日期:2015-08-26
  • 在线发布日期: 2015-11-04
文章二维码
您是第19830950位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号