An adaptive variance reduction method with negative momentum

doi:10.7523/j.ucas.2024.024

Abstract

Abstract: Stochastic variance reduction methods have been successful in solving large scale machine learning problems, and researchers cooperate them with adaptive stepsize schemes to further alleviate the burden of parameter-tuning. In this article, we propose that there exists a trade-off between progress and effectiveness of adaptive stepsize arising in the SVRG-BB algorithm. To enhance the practical performance of SVRG-BB, we introduce the Katyusha momentum to handle the aforementioned trade-off. The linear convergence rate of the resulting SVRG-BB-Katyusha algorithm is proven under strong convexity condition. Moreover, we propose SVRG-BB-Katyusha-SPARSE algorithm which uses Katyusha momentum sparsely in the inner iterations. Numerical experiments are given to illustrate that the proposed algorithms have promising advantages over SVRG-BB, in the sense that the optimality gaps of the proposed algorithms are smaller than the optimality gap of SVRG-BB by orders of magnitude.

Key words: adaptive stepsize scheme, stochastic variance reduction methods, Barzilai-Borwein method, Katyusha momentum

CLC Number:

O224

LIU Hai, GUO Tiande, HAN Congying. An adaptive variance reduction method with negative momentum[J]. Journal of University of Chinese Academy of Sciences, 2024, 41(5): 577-588.

References

[1] Robbins H, Monro S. A stochastic approximation method[J]. The Annals of Mathematical Statistics, 1951, 22(3):400-407. DOI:10.1214/aoms/1177729586.
[2] Bottou L. Stochastic gradient learning in neural networks[EB/OL].(2006-04-20)[2023-12-26]. http://leon.bottou.org/papers/bottou-91c.
[3] Nemirovski A, Juditsky A, Lan G, et al. Robust stochastic approximation approach to stochastic programming[J]. SIAM Journal on Optimization, 2009, 19(4):1574-1609. DOI:10.1137/070704277.
[4] Zhang T. Solving large scale linear prediction problems using stochastic gradient descent algorithms[C]//Proceedings of the 21st International Conference on Machine Learning. July 4-8, 2004, Banff, Alberta, Canada. ACM, 2004. DOI:10.1145/1015330.1015332.
[5] 赵文深,韩丛英,金玲子.求解带大量凸约束的随机优化问题的随机增广拉格朗日算法[J/OL].(2023-06-21)[2023-12-26].中国科学院大学学报. DOI:10.7523/j.ucas.2023.055.
[6] Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning[J]. SIAM Review, 2018, 60(2):223-311. DOI:10.1137/16M1080173.
[7] Le Roux N, Schmidt M, Bach F. A stochastic gradient method with an exponential convergence rate for finite training sets[EB/OL].(2012-02-28)[2023-12-26]. https://doi.org/10.48550/arXiv.1202.6258.
[8] Schmidt M, Le Roux N, Bach F. Minimizing finite sums with the stochastic average gradient[J]. Mathematical Programming, 2017, 162(1):83-112. DOI:10.1007/s10107-016-1030-6.
[9] Defazio A, Bach F, Lacoste-Julien S. SAGA:a fast incremental gradient method with support for non-strongly convex composite objectives[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1. December 8-13, 2014, Montreal, Canada. ACM, 2014:1646-1654. DOI:10.5555/2968826.2969010.
[10] Raj A, Stich S U. k-SVRG:variance reduction for large scale optimization[EB/OL]. 2018:arXiv:1805.00982.(2018-05-02)[2023-12-26]. https://arxiv.org/abs/1805.00982.
[11] Driggs D, Liang J W, Schönlieb C B. On biased stochastic gradient estimation[EB/OL]. 2019:arXiv:1906.01133.(2019-06-04)[2023-12-26]. http://arxiv.org/abs/1906.01133.
[12] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction[J]. Advances in Neural Information Processing Systems, 2013, 26. DOI:10.5555/2999611.2999647.
[13] Konečný J, Richtárik P. Semi-stochastic gradient descent methods[J]. Frontiers in Applied Mathematics and Statistics, 2017, 3:9. DOI:10.3389/fams.2017.00009.
[14] Kovalev D, Horváth S, Richtárik P. Don't jump through hoops and remove those loops:SVRG and Katyusha are better without the outer loop[EB/OL]. 2019:arXiv:1901.08689.(2019-06-05)[2023-12-26] http://arxiv.org/abs/1901.08689.
[15] Lei L H, Jordan M I. On the adaptivity of stochastic gradient-based optimization[J]. SIAM Journal on Optimization, 2020, 30(2):1473-1500. DOI:10.1137/19M1256919.
[16] Shang F H, Zhou K W, Liu H Y, et al. VR-SGD:a simple stochastic variance reduction method for machine learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(1):188-202. DOI:10.1109/TKDE.2018.2878765.
[17] Nguyen L M, Liu J, Scheinberg K, et al. SARAH:a novel method for machine learning problems using stochastic recursive gradient[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. August 6-11, 2017, Sydney, NSW, Australia. ACM, 2017:2613-2621. DOI:10.5555/3305890.3305951.
[18] Xiao L, Zhang T. A proximal stochastic gradient method with progressive variance reduction[J]. SIAM Journal on Optimization, 2014, 24(4):2057-2075. DOI:10.1137/140961791.
[19] Morin M, Giselsson P. Sampling and update frequencies in proximal variance-reduced stochastic gradient methods[EB/OL]. 2020:arXiv:2002.05545.(2020-02-13)[2023-12-26]. https://arxiv.org/abs/2002.05545.
[20] Ying B C, Yuan K, Sayed A H. Variance-reduced stochastic learning under random reshuffling[J]. IEEE Transactions on Signal Processing, 2020, 68:1390-1408. DOI:10.1109/TSP.2020.2968280.
[21] Allen-Zhu Z. Katyusha:The first direct acceleration of stochastic gradient methods[C]//Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. June 19-23, 2017, Montreal, Canada. ACM, 2017:1200-1205. DOI:10.1145/3055399.3055448.
[22] Zhou K W, Shang F H, Cheng J. A simple stochastic variance reduced algorithm with fast convergence rates[EB/OL]. 2018:arXiv:1806.11027.(2018-06-28)[2023-12-26]. http://arxiv.org/abs/1806.11027.
[23] Lan G H, Zhou Y. An optimal randomized incremental gradient method[J]. Mathematical Programming, 2018, 171(1):167-215. DOI:10.1007/s10107-017-1173-0.
[24] Tan C H, Ma S Q, Dai Y H, et al. Barzilai-Borwein step size for stochastic gradient descent[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. December 5-10, 2016, Barcelona, Spain. ACM, 2016:685-693. DOI:10.5555/3157096.3157173.
[25] Dubois-Taine B, Vaswani S, Babanezhad R, et al. SVRG meets AdaGrad:painless variance reduction[J]. Machine Learning, 2022, 111(12):4359-4409. DOI:10.1007/s10994-022-06265-x.
[26] Shi Z, Sadiev A, Loizou N, et al. AI-SARAH:adaptive and implicit stochastic recursive gradient methods[EB/OL]. 2021:arXiv:2102.09700.(2021-02-19)[2023-12-26]. https://arxiv.org/abs/2102.09700.
[27] Liu Y, Han C, Guo T. A class of stochastic variance reduced methods with an adaptive stepsize[EB/OL].(2019-04-07)[2023-12-26].http://www.optimization-online.org/DB_FILE/2019/04/7170.
[28] Barzilai J, Borwein J M. Two-point step size gradient methods[J]. IMA Journal of Numerical Analysis, 1988, 8(1):141-148. DOI:10.1093/imanum/8.1.141.
[29] Yang Z, Chen Z F, Wang C. Accelerating mini-batch SARAH by step size rules[J]. Information Sciences, 2021, 558:157-173. DOI:10.1016/j.ins.2020.12.075.
[30] Yu T T, Liu X W, Dai Y H, et al. A minibatch proximal stochastic recursive gradient algorithm using a trust-region-like scheme and Barzilai-Borwein stepsize[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(10):4627-4638. DOI:10.1109/TNNLS.2020.3025383.
[31] Yu T T, Liu X W, Dai Y H, et al. A mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai-Borwein stepsize[J]. Journal of the Operations Research Society of China, 2023, 11(2):277-307. DOI:10.1007/s40305-022-00436-2.
[32] Xie B H, Jin C H, Zhou K W, et al. An adaptive incremental gradient method with support for non-Euclidean norms[EB/OL]. 2022:arXiv:2205.02273.(2022-06-28)[2023-12-26]. http://arxiv.org/abs/2205.02273.
[33] Kingma D P, Ba J. Adam:a method for stochastic optimization[EB/OL]. 2014:arXiv:1412.6980.(2014-12-22)[2023-12-26]. https://arxiv.org/abs/1412.6980.