融合引力搜索的双延迟深度确定策略梯度方法

doi:10.13328/j.cnki.jos.006740

微信服务号

微信订阅号

2025年4月3日 15:39 星期四

首页 > 过刊浏览>2023年第34卷第11期 >5191-5204. DOI:10.13328/j.cnki.jos.006740

PDF HTML阅读 XML下载导出引用引用提醒

融合引力搜索的双延迟深度确定策略梯度方法
DOI:
                        10.13328/j.cnki.jos.006740
                    
CSTR:
                        
                    
作者:
                        徐平安徐平安
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
刘全刘全
苏州大学 计算机科学与技术学院, 江苏 苏州 215006;软件新技术与产业化协同创新中心(南京), 江苏 南京 210093;符号计算与知识工程教育部重点实验室 (吉林大学), 吉林 长春 130012;江苏省计算机信息处理技术重点实验室 (苏州大学), 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
郝少璞郝少璞
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
张立华张立华
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:徐平安(1997－),男,硕士,主要研究领域为强化学习,深度学习,深度强化学习;刘全(1969－),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为智能信息处理,自动推理,机器学习;郝少璞(1994－),男,硕士,主要研究领域为强化学习,深度强化学习,模仿学习;张立华(1992－),男,博士生,CCF学生会员,主要研究领域为强化学习,深度强化学习,模仿学习.
通讯作者:刘全，quanliu@suda.edu.cn
中图分类号:TP18
基金项目:国家自然科学基金(61772355, 61702055, 61876217, 62176175); 江苏高校优势学科建设工程

Twin-delayed-based Deep Deterministic Policy Gradient Method Integrating Gravitational Search

Author:

XU Ping-An
XU Ping-An
School of Computer Science & Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Quan
LIU Quan
School of Computer Science & Technology, Soochow University, Suzhou 215006, China;Collaborative Innovation Center of Novel Software Technology and Industrialization (Nanjing), Nanjing 210093, China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (Jilin University), Changchun 130012, China;Jiangsu Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
HAO Shao-Pu
HAO Shao-Pu
School of Computer Science & Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Li-Hua
ZHANG Li-Hua
School of Computer Science & Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [42]

相似文献

引证文献

资源附件

文章评论

摘要:

近年来, 深度强化学习在复杂控制任务中取得了令人瞩目的效果, 然而由于超参数的高敏感性和收敛性难以保证等原因, 严重影响了其对现实问题的适用性. 元启发式算法作为一类模拟自然界客观规律的黑盒优化方法, 虽然能够有效避免超参数的敏感性, 但仍存在无法适应待优化参数量规模巨大和样本使用效率低等问题. 针对以上问题, 提出融合引力搜索的双延迟深度确定策略梯度方法(twin delayed deep deterministic policy gradient based on gravitational search algorithm, GSA-TD3). 该方法融合两类算法的优势: 一是凭借梯度优化的方式更新策略, 获得更高的样本效率和更快的学习速度; 二是将基于万有引力定律的种群更新方法引入到策略搜索过程中, 使其具有更强的探索性和更好的稳定性. 将GSA-TD3应用于一系列复杂控制任务中, 实验表明, 与前沿的同类深度强化学习方法相比, GSA-TD3在性能上具有显著的优势.

关键词:深度强化学习;元启发式算法;引力搜索;确定策略梯度;策略搜索

Abstract:

In recent years, deep reinforcement learning has achieved impressive results in complex control tasks. However, its applicability to real-world problems has been seriously weakened by the high sensitivity of hyperparameters and the difficulty in guaranteeing convergence. Metaheuristic algorithms, as a class of black-box optimization methods simulating the objective laws of nature, can effectively avoid the sensitivity of hyperparameters. Nevertheless, they are still faced with various problems, such as the inability to adapt to a huge scale of parameters to be optimized and the low efficiency of sample usage. To address the above problems, this study proposes the twin delayed deep deterministic policy gradient based on a gravitational search algorithm (GSA-TD3). The method combines the advantages of the two types of algorithms. Specifically, it updates the policy by gradient optimization for higher sample efficiency and a faster learning speed. Moreover, it applies the population update method based on the law of gravity to the policy search process to make it more exploratory and stable. GSA-TD3 is further applied to a series of complex control tasks, and experiments show that it significantly out performs similar deep reinforcement learning methods at the forefront.

Key words:deep reinforcement learning (DRL);meta-heuristic algorithm;gravitational search;deterministic policy gradient;policy search

参考文献

[1] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed., Cambridge: MIT Press, 2018.

[2] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen YT, Lillicrap T, Hui F, Sifre L, Van Den driessche G, Graepel T, Hassabis D. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354–359. [doi: 10.1038/nature24270]

[3] 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 徐进. 深度强化学习综述. 计算机学报, 2018, 41(1): 1–27. [doi: 10.11897/SP.J.1016.2018.00001]

Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P, Xu J. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1–27 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2018.00001]

[4] 刘建伟, 高峰, 罗雄麟. 基于值函数和策略梯度的深度强化学习综述. 计算机学报, 2019, 42(6): 1406–1438. [doi: 10.11897/SP.J.1016.2019.01406]

Liu JW, Gao F, Luo XL. Survey of deep reinforcement learning based on value function and policy gradient. Chinese Journal of Computers, 2019, 42(6): 1406–1438 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2019.01406]

[5] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533. [doi: 10.1038/nature14236]

[6] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: JMLR, 2016. 1928–1937.

[7] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan: CoRR, 2016.

[8] Fujimoto S, van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 1582–1591.

[9] Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lille: JMLR, 2015. 1889–1897.

[10] Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P. High-dimensional continuous control using generalized advantage estimation. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan, 2016.

[11] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.

[12] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 1856–1865.

[13] Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M. Parameter space noise for exploration. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.

[14] Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P. VIME: Variational information maximizing exploration. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: Curran Associates Inc., 2016. 1117–1125.

[15] Pathak D, Agrawal P, Efros AA, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: PMLR, 2017. 2778–2787.

[16] Fortunato M, Azar MG, Piot B, Menick J, Hessel M, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S. Noisy networks for exploration. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018. 1–21.

[17] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep reinforcement learning that matters. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symp. on Educational Advances in Artificial Intelligence. New Orleans: AAAI Press, 2018. 3207–3214.

[18] Haarnoja T, Tang HR, Abbeel P, Levine S. Reinforcement learning with deep energy-based policies. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: JMLR, 2017. 1352–1361.

[19] Salimans T, Ho J, Chen X, Sidor S, Sutskever I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv: 1703.03864, 2017.

[20] Dosoglu MK, Guvenc U, Duman S, Sonmez Y, Kahraman HT. Symbiotic organisms search optimization algorithm for economic/emission dispatch problem in power systems. Neural Computing and Applications, 2018, 29(3): 721–737. [doi: 10.1007/s00521-016-2481-7]

[21] Lara CL, Trespalacios F, Grossmann IE. Global optimization algorithm for capacitated multi-facility continuous location-allocation problems. Journal of Global Optimization, 2018, 71(4): 871–889. [doi: 10.1007/s10898-018-0621-6]

[22] 高卫峰, 罗宇婷, 原杨飞. 求解非线性方程组的智能优化算法综述, 控制与决策, 2021, 36(4): 769–778.

Gao WF, Luo YT, Yuan YF. Overview of intelligent optimization algorithms for solving nonlinear equation systems. Control and Decision, 2021, 36(4): 769–778 (in Chinese with English abstract).

[23] Mitchell M. An Introduction to Genetic Algorithms. Cambridge: MIT Press, 1998.

[24] Hansen N, Ostermeier A. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 2001, 9(2): 159–195. [doi: 10.1162/106365601750190398]

[25] Rashedi E, Nezamabadi-Pour H, Saryazdi S. GSA: A gravitational search algorithm. Information Sciences, 2009, 179(13): 2232–2248. [doi: 10.1016/j.ins.2009.03.004]

[26] Hatamlou A. Black hole: A new heuristic optimization approach for data clustering. Information Sciences, 2013, 222: 175–184. [doi: 10.1016/j.ins.2012.08.023]

[27] Kennedy J, Eberhart R. Particle swarm optimization. In: Proc. of the 1995 Int’l Conf. on Neural Networks (ICNN 1995). Perth: IEEE, 1995. 1942–1948.

[28] Mirjalili S, Mirjalili SM, Lewis A. Grey wolf optimizer. Advances in Engineering Software, 2014, 69: 46–61. [doi: 10.1016/j.advengsoft.2013.12.007]

[29] Tan Y, Zhu YC. Fireworks algorithm for optimization. In: Proc. of the 1st Int’l Conf. on Swarm Intelligence. Beijing: Springer, 2010. 355–364.

[30] Miconi T, Rawal A, Clune J, Stanley KO. Backpropamine: Training self-modifying neural networks with differentiable neuromodulated plasticity. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2020.

[31] Rockefeller G, Khadka S, Tumer K. Multi-level fitness critics for cooperative coevolution. In: Proc. of the 19th Int’l Conf. on Autonomous Agents and Multiagent Systems. Auckland: Int’l Foundation for Autonomous Agents and Multiagent Systems, 2020. 1143–1151.

[32] Khadka S, Tumer K. Evolutionary reinforcement learning. arXiv:1805.07917, 2018.

[33] Khadka S, Majumdar S, Nassar T, Dwiel Z, Tumer E, Miret S, Liu YY, Tumer K. Collaborative evolutionary reinforcement learning. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 3341–3350.

[34] Bodnar C, Day B, Lió P. Proximal distilled evolutionary reinforcement learning. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conf., the 10th AAAI Symp. on Educational Advances in Artificial Intelligence. New York: AAAI Press, 2020. 3283–3290.

[35] Pourchot A, Sigaud O. CEM-RL: Combining evolutionary and gradient-based methods for policy search. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019. 1–18.

[36] Suri K, Shi XQ, Plataniotis KN, Lawryshyn YA. Maximum mutation reinforcement learning for scalable control. arXiv:2007.13690, 2020.

[37] Hallawa A, Born T, Schmeink A, Dartmann G, Peine A, Martin L, Iacca G, Eiben AE, Ascheid G. Evo-RL: Evolutionary-driven reinforcement learning. In: Proc. of the 2020 Genetic and Evolutionary Computation Conf. Companion. Lille: ACM, 2020. 153–154.

[38] Chen DQ, Wang YZ, Gao W. Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Applied Intelligence, 2020, 50(10): 3301–3317. [doi: 10.1007/s10489-020-01702-7]

[39] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. OpenAI Gym. arXiv:1606.01540, 2016.

引用本文

徐平安,刘全,郝少璞,张立华.融合引力搜索的双延迟深度确定策略梯度方法.软件学报,2023,34(11):5191-5204

复制

文章指标

点击次数:473
下载次数: 1853
HTML阅读次数: 1073
引用次数: 0

历史

收稿日期:2021-08-01
最后修改日期:2021-11-28
录用日期:
在线发布日期: 2023-06-16
出版日期: 2023-11-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码