融合引力搜索的双延迟深度确定策略梯度方法
作者:
作者简介:

徐平安(1997-),男,硕士,主要研究领域为强化学习,深度学习,深度强化学习;刘全(1969-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为智能信息处理,自动推理,机器学习;郝少璞(1994-),男,硕士,主要研究领域为强化学习,深度强化学习,模仿学习;张立华(1992-),男,博士生,CCF学生会员,主要研究领域为强化学习,深度强化学习,模仿学习.

通讯作者:

刘全,quanliu@suda.edu.cn

中图分类号:

TP18

基金项目:

国家自然科学基金(61772355, 61702055, 61876217, 62176175); 江苏高校优势学科建设工程


Twin-delayed-based Deep Deterministic Policy Gradient Method Integrating Gravitational Search
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [42]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    近年来, 深度强化学习在复杂控制任务中取得了令人瞩目的效果, 然而由于超参数的高敏感性和收敛性难以保证等原因, 严重影响了其对现实问题的适用性. 元启发式算法作为一类模拟自然界客观规律的黑盒优化方法, 虽然能够有效避免超参数的敏感性, 但仍存在无法适应待优化参数量规模巨大和样本使用效率低等问题. 针对以上问题, 提出融合引力搜索的双延迟深度确定策略梯度方法(twin delayed deep deterministic policy gradient based on gravitational search algorithm, GSA-TD3). 该方法融合两类算法的优势: 一是凭借梯度优化的方式更新策略, 获得更高的样本效率和更快的学习速度; 二是将基于万有引力定律的种群更新方法引入到策略搜索过程中, 使其具有更强的探索性和更好的稳定性. 将GSA-TD3应用于一系列复杂控制任务中, 实验表明, 与前沿的同类深度强化学习方法相比, GSA-TD3在性能上具有显著的优势.

    Abstract:

    In recent years, deep reinforcement learning has achieved impressive results in complex control tasks. However, its applicability to real-world problems has been seriously weakened by the high sensitivity of hyperparameters and the difficulty in guaranteeing convergence. Metaheuristic algorithms, as a class of black-box optimization methods simulating the objective laws of nature, can effectively avoid the sensitivity of hyperparameters. Nevertheless, they are still faced with various problems, such as the inability to adapt to a huge scale of parameters to be optimized and the low efficiency of sample usage. To address the above problems, this study proposes the twin delayed deep deterministic policy gradient based on a gravitational search algorithm (GSA-TD3). The method combines the advantages of the two types of algorithms. Specifically, it updates the policy by gradient optimization for higher sample efficiency and a faster learning speed. Moreover, it applies the population update method based on the law of gravity to the policy search process to make it more exploratory and stable. GSA-TD3 is further applied to a series of complex control tasks, and experiments show that it significantly out performs similar deep reinforcement learning methods at the forefront.

    参考文献
    [1] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed., Cambridge: MIT Press, 2018.
    [2] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen YT, Lillicrap T, Hui F, Sifre L, Van Den driessche G, Graepel T, Hassabis D. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354–359. [doi: 10.1038/nature24270]
    [3] 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 徐进. 深度强化学习综述. 计算机学报, 2018, 41(1): 1–27. [doi: 10.11897/SP.J.1016.2018.00001]
    Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P, Xu J. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1–27 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2018.00001]
    [4] 刘建伟, 高峰, 罗雄麟. 基于值函数和策略梯度的深度强化学习综述. 计算机学报, 2019, 42(6): 1406–1438. [doi: 10.11897/SP.J.1016.2019.01406]
    Liu JW, Gao F, Luo XL. Survey of deep reinforcement learning based on value function and policy gradient. Chinese Journal of Computers, 2019, 42(6): 1406–1438 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2019.01406]
    [5] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533. [doi: 10.1038/nature14236]
    [6] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: JMLR, 2016. 1928–1937.
    [7] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan: CoRR, 2016.
    [8] Fujimoto S, van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 1582–1591.
    [9] Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lille: JMLR, 2015. 1889–1897.
    [10] Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P. High-dimensional continuous control using generalized advantage estimation. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan, 2016.
    [11] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
    [12] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 1856–1865.
    [13] Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M. Parameter space noise for exploration. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
    [14] Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P. VIME: Variational information maximizing exploration. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: Curran Associates Inc., 2016. 1117–1125.
    [15] Pathak D, Agrawal P, Efros AA, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: PMLR, 2017. 2778–2787.
    [16] Fortunato M, Azar MG, Piot B, Menick J, Hessel M, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S. Noisy networks for exploration. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018. 1–21.
    [17] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep reinforcement learning that matters. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symp. on Educational Advances in Artificial Intelligence. New Orleans: AAAI Press, 2018. 3207–3214.
    [18] Haarnoja T, Tang HR, Abbeel P, Levine S. Reinforcement learning with deep energy-based policies. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: JMLR, 2017. 1352–1361.
    [19] Salimans T, Ho J, Chen X, Sidor S, Sutskever I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv: 1703.03864, 2017.
    [20] Dosoglu MK, Guvenc U, Duman S, Sonmez Y, Kahraman HT. Symbiotic organisms search optimization algorithm for economic/emission dispatch problem in power systems. Neural Computing and Applications, 2018, 29(3): 721–737. [doi: 10.1007/s00521-016-2481-7]
    [21] Lara CL, Trespalacios F, Grossmann IE. Global optimization algorithm for capacitated multi-facility continuous location-allocation problems. Journal of Global Optimization, 2018, 71(4): 871–889. [doi: 10.1007/s10898-018-0621-6]
    [22] 高卫峰, 罗宇婷, 原杨飞. 求解非线性方程组的智能优化算法综述, 控制与决策, 2021, 36(4): 769–778.
    Gao WF, Luo YT, Yuan YF. Overview of intelligent optimization algorithms for solving nonlinear equation systems. Control and Decision, 2021, 36(4): 769–778 (in Chinese with English abstract).
    [23] Mitchell M. An Introduction to Genetic Algorithms. Cambridge: MIT Press, 1998.
    [24] Hansen N, Ostermeier A. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 2001, 9(2): 159–195. [doi: 10.1162/106365601750190398]
    [25] Rashedi E, Nezamabadi-Pour H, Saryazdi S. GSA: A gravitational search algorithm. Information Sciences, 2009, 179(13): 2232–2248. [doi: 10.1016/j.ins.2009.03.004]
    [26] Hatamlou A. Black hole: A new heuristic optimization approach for data clustering. Information Sciences, 2013, 222: 175–184. [doi: 10.1016/j.ins.2012.08.023]
    [27] Kennedy J, Eberhart R. Particle swarm optimization. In: Proc. of the 1995 Int’l Conf. on Neural Networks (ICNN 1995). Perth: IEEE, 1995. 1942–1948.
    [28] Mirjalili S, Mirjalili SM, Lewis A. Grey wolf optimizer. Advances in Engineering Software, 2014, 69: 46–61. [doi: 10.1016/j.advengsoft.2013.12.007]
    [29] Tan Y, Zhu YC. Fireworks algorithm for optimization. In: Proc. of the 1st Int’l Conf. on Swarm Intelligence. Beijing: Springer, 2010. 355–364.
    [30] Miconi T, Rawal A, Clune J, Stanley KO. Backpropamine: Training self-modifying neural networks with differentiable neuromodulated plasticity. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2020.
    [31] Rockefeller G, Khadka S, Tumer K. Multi-level fitness critics for cooperative coevolution. In: Proc. of the 19th Int’l Conf. on Autonomous Agents and Multiagent Systems. Auckland: Int’l Foundation for Autonomous Agents and Multiagent Systems, 2020. 1143–1151.
    [32] Khadka S, Tumer K. Evolutionary reinforcement learning. arXiv:1805.07917, 2018.
    [33] Khadka S, Majumdar S, Nassar T, Dwiel Z, Tumer E, Miret S, Liu YY, Tumer K. Collaborative evolutionary reinforcement learning. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 3341–3350.
    [34] Bodnar C, Day B, Lió P. Proximal distilled evolutionary reinforcement learning. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conf., the 10th AAAI Symp. on Educational Advances in Artificial Intelligence. New York: AAAI Press, 2020. 3283–3290.
    [35] Pourchot A, Sigaud O. CEM-RL: Combining evolutionary and gradient-based methods for policy search. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019. 1–18.
    [36] Suri K, Shi XQ, Plataniotis KN, Lawryshyn YA. Maximum mutation reinforcement learning for scalable control. arXiv:2007.13690, 2020.
    [37] Hallawa A, Born T, Schmeink A, Dartmann G, Peine A, Martin L, Iacca G, Eiben AE, Ascheid G. Evo-RL: Evolutionary-driven reinforcement learning. In: Proc. of the 2020 Genetic and Evolutionary Computation Conf. Companion. Lille: ACM, 2020. 153–154.
    [38] Chen DQ, Wang YZ, Gao W. Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Applied Intelligence, 2020, 50(10): 3301–3317. [doi: 10.1007/s10489-020-01702-7]
    [39] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. OpenAI Gym. arXiv:1606.01540, 2016.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

徐平安,刘全,郝少璞,张立华.融合引力搜索的双延迟深度确定策略梯度方法.软件学报,2023,34(11):5191-5204

复制
分享
文章指标
  • 点击次数:473
  • 下载次数: 1853
  • HTML阅读次数: 1073
  • 引用次数: 0
历史
  • 收稿日期:2021-08-01
  • 最后修改日期:2021-11-28
  • 在线发布日期: 2023-06-16
  • 出版日期: 2023-11-06
文章二维码
您是第19781238位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号