国家自然科学基金(61772355, 61702055, 61876217, 62176175); 江苏高校优势学科建设工程
近年来, 深度强化学习在复杂控制任务中取得了令人瞩目的效果, 然而由于超参数的高敏感性和收敛性难以保证等原因, 严重影响了其对现实问题的适用性. 元启发式算法作为一类模拟自然界客观规律的黑盒优化方法, 虽然能够有效避免超参数的敏感性, 但仍存在无法适应待优化参数量规模巨大和样本使用效率低等问题. 针对以上问题, 提出融合引力搜索的双延迟深度确定策略梯度方法(twin delayed deep deterministic policy gradient based on gravitational search algorithm, GSA-TD3). 该方法融合两类算法的优势: 一是凭借梯度优化的方式更新策略, 获得更高的样本效率和更快的学习速度; 二是将基于万有引力定律的种群更新方法引入到策略搜索过程中, 使其具有更强的探索性和更好的稳定性. 将GSA-TD3应用于一系列复杂控制任务中, 实验表明, 与前沿的同类深度强化学习方法相比, GSA-TD3在性能上具有显著的优势.
In recent years, deep reinforcement learning has achieved impressive results in complex control tasks. However, its applicability to real-world problems has been seriously weakened by the high sensitivity of hyperparameters and the difficulty in guaranteeing convergence. Metaheuristic algorithms, as a class of black-box optimization methods simulating the objective laws of nature, can effectively avoid the sensitivity of hyperparameters. Nevertheless, they are still faced with various problems, such as the inability to adapt to a huge scale of parameters to be optimized and the low efficiency of sample usage. To address the above problems, this study proposes the twin delayed deep deterministic policy gradient based on a gravitational search algorithm (GSA-TD3). The method combines the advantages of the two types of algorithms. Specifically, it updates the policy by gradient optimization for higher sample efficiency and a faster learning speed. Moreover, it applies the population update method based on the law of gravity to the policy search process to make it more exploratory and stable. GSA-TD3 is further applied to a series of complex control tasks, and experiments show that it significantly out performs similar deep reinforcement learning methods at the forefront.