Offline Reinforcement Learning Method with Diffusion Model and Expectation Maximization
Author:
Affiliation:

Clc Number:

TP18

  • Article
  • | |
  • Metrics
  • |
  • Reference [39]
  • | |
  • Cited by
  • | |
  • Comments
    Abstract:

    Offline reinforcement learning has yielded significant results in tasks with continuous and intensive rewards. However, since the training process does not interact with the environment, the generalization ability is reduced, and the performance is difficult to guarantee in a discrete and sparse reward environment. The diffusion model combines the information in the neighborhood of the sample data with noise addition to generate actions that are close to the distribution of the sample data, which strengthens the learning and generalization ability of the agents. To this end, offline reinforcement learning with diffusion models and expectation maximization (DMEM) is proposed. The method updates the objective function by maximizing the expectation of the maximum likelihood logarithm to make the strategy more generalizable. Additionally, the diffusion model is introduced into the strategy network to utilize the diffusion characteristics to enhance the ability of the strategy to learn data samples. Meanwhile, the expectile regression is employed to update the value function from the perspective of high-dimensional space, and a penalty term is introduced to make the evaluation of the value function more accurate. DMEM is applied to a series of tasks with discrete and sparse rewards, and experiments show that DMEM has a large advantage in performance over other classical offline reinforcement learning methods.

    Reference
    [1] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed., Cambridge: The MIT Press, 2018.
    [2] 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 徐进. 深度强化学习综述. 计算机学报, 2018, 41(1): 1–27.
    Liu Q, Zhai JW, Zhang ZC, Zhong S, Zhou Q, Zhang P, Xu J. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1–27 (in Chinese with English abstract).
    [3] 刘建伟, 刘媛, 罗雄麟. 深度学习研究进展. 计算机应用研究, 2014, 31(7): 1921–1930, 1942.
    Liu JW, Liu Y, Luo XL. Research and development on deep learning. Application Research of Computers, 2014, 31(7): 1921–1930, 1942 (in Chinese with English abstract).
    [4] Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643, 2020.
    [5] Peng ZY, Han CL, Liu YD, Zhou ZT. Weighted policy constraints for offline reinforcement learning. Proc. of the AAAI Conf. on Artificial Intelligence, 2023, 37(8): 9435–9443.
    [6] Mao YX, Zhang HC, Chen C, Xu Y, Ji XY. Supported value regularization for offline reinforcement learning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 40587–40609.
    [7] 张伯雷, 刘哲闰. 基于自适应不确定性度量的离线强化学习算法. 南京邮电大学学报(自然科学版), 2024, 44(4): 98–104.
    Zhang BL, Liu ZR. Adaptive uncertainty quantification for model-based offline reinforcement learning. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2024, 44(4): 98–104 (in Chinese with English abstract).
    [8] Moerland TM, Broekens J, Plaat A, Jonker CM. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 2023, 16(1): 1–67.
    [9] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 2052–2062.
    [10] Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 1587–1596.
    [11] Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1179–1191.
    [12] Yu TH, Thomas G, Yu LT, Ermon S, Zou J, Levine S, Finn C, Ma TY. MOPO: Model-based offline policy optimization. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 14129–14142.
    [13] Kostrikov I, Nair A, Levine S. Offline reinforcement learning with implicit Q-learning. arXiv:2110.06169, 2021.
    [14] Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A. Reinforcement learning with augmented data. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 19884–19895.
    [15] Zhu ZD, Lin KX, Jain AK, Zhou JY. Transfer learning in deep reinforcement learning: A survey. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023, 45(11): 13344–13362.
    [16] Bhardwaj M, Xie TY, Boots B, Jiang N, Cheng CA. Adversarial model for offline reinforcement learning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 1245–1269.
    [17] Wang SY, Li XD, Qu H, Chen WY. State augmentation via self-supervision in offline multiagent reinforcement learning. IEEE Trans. on Cognitive and Developmental Systems, 2024, 16(3): 1051–1062.
    [18] Qiao WD, Yang R. Soft Adversarial offline reinforcement learning via reducing the attack strength for generalization. In: Proc. of the 16th Int’l Conf. on Machine Learning and Computing. Shenzhen: ACM, 2024. 498–505. [doi: 10.1145/3651671.3651762]
    [19] Rengarajan D, Vaidya G, Sarvesh A, Kalathil D, Shakkottai S. Reinforcement learning with sparse rewards using guidance from offline demonstration. arXiv:2202.04628, 2022.
    [20] Liu SF, Sun SL. Safe offline reinforcement learning through hierarchical policies. In: Proc. of the 26th Pacific-Asia Conf. on Knowledge Discovery and Data Mining. Chengdu: Springer, 2022. 380–391. [doi: 10.1007/978-3-031-05936-0_30]
    [21] Lin QJ, Liu H, Sengupta B. Switch trajectory Transformer with distributional value approximation for multi-task reinforcement learning. arXiv:2203.07413, 2022.
    [22] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 6840–6851.
    [23] Hong ZW, Kumar A, Karnik S, Bhandwaldar A, Srivastava A, Pajarinen J, Laroche R, Gupta A, Agrawal P. Beyond uniform sampling: Offline reinforcement learning with imbalanced datasets. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 4985–5009.
    [24] Xu HR, Jiang L, Li JX, Yang ZR, Wang ZR, Chan VWK, Zhan XY. Offline RL with no OOD actions: In-sample learning via implicit value regularization. arXiv:2303.15810, 2023.
    [25] Garg D, Hejna J, Geist M, Ermon S. Extreme Q-learning: MaxEnt RL without entropy. arXiv:2301.02328, 2023.
    [26] Omura M, Osa T, Mukuta Y, Harada T. Symmetric Q-learning: Reducing skewness of bellman error in online reinforcement learning. Proc. of the AAAI Conf. on Artificial Intelligence, 2024, 38(13): 14474–14481.
    [27] Jin C, Krishnamurthy A, Simchowitz M, Yu TC. Reward-free exploration for reinforcement learning. In: Proc. of the 37th Int’l Conf. on Machine Learning. JMLR.org, 2020. 4870–4879.
    [28] Racaniere S, Lampinen AK, Santoro A, Reichert DP, Firoiu V, Lillicrap TP. Automated curricula through setter-solver interactions. arXiv:1909.12892, 2020.
    [29] Yin HL, Lin YJ, Yan J, Meng Q, Festl K, Schichler L, Watzenig D. AGV path planning using curiosity-driven deep reinforcement learning. In: Proc. of the 19th IEEE Int’l Conf. on Automation Science and Engineering. Auckland: IEEE, 2023. 1–6.
    [30] Li JN, Tang C, Tomizuka M, Zhan W. Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 2022, 7(4): 10216–10223.
    [31] Isele D, Rahimi R, Cosgun A, Subramanian K, Fujimura K. Navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In: Proc. of the 2018 IEEE Int’l Conf. on Robotics and Automation. Brisbane: IEEE, 2018. 2034–2039.
    [32] Wang ZD, Hunt JJ, Zhou MY. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv:2208.06193, 2023.
    [33] Kang BY, Ma X, Du C, Pang TY, Yan SC. Efficient diffusion policies for offline reinforcement learning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 67195–67212.
    [34] Chen HY, Lu C, Ying CY, Su H, Zhu J. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv:2209.14548, 2023.
    [35] Jiang CX, Jiang M, Xu QF, Huang X. Expectile regression neural network model with applications. Neurocomputing, 2017, 247: 73–86.
    [36] Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: Datasets for deep data-driven reinforcement learning. arXiv:2004.07219, 2021.
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

刘全,颜洁,乌兰.扩散模型期望最大化的离线强化学习方法.软件学报,,():1-15

Copy
Related Videos
Share
Article Metrics
  • Abstract:584
  • PDF: 276
  • HTML: 0
  • Cited by: 0
History
  • Received:May 06,2024
  • Revised:July 18,2024
  • Online: February 19,2025
You are the first2034579Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063