基于模型的强化学习中可学习的样本加权机制
作者:
作者简介:

黄文振(1992-),男,博士,主要研究领域为强化学习;尹奇跃(1990-),男,博士,副研究员,CCF专业会员,主要研究领域为机器学习,数据挖掘,人工智能与游戏;张俊格(1986-),男,博士,研究员,主要研究领域为博弈决策,强化学习,模式识别,人工智能;黄凯奇(1977-),男,博士,研究员,博士生导师,CCF杰出会员,主要研究领域为计算机视觉,模式识别,人机对抗,视觉监控应用

通讯作者:

张俊格,jgzhang@nlpr.ia.ac.cn

中图分类号:

TP181

基金项目:

国家自然科学基金(61876181,61673375);北京市科技创新计划(Z19110000119043);中国科学院青年创新促进会项目;中国科学院项目(QYZDB-SSW-JSC006)


Learnable Weighting Mechanism in Model-based Reinforcement Learning
Author:
  • HUANG Wen-Zhen

    HUANG Wen-Zhen

    School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China;Center for Research on Intelligent System and Engineering (CRISE), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • YIN Qi-Yue

    YIN Qi-Yue

    School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China;Center for Research on Intelligent System and Engineering (CRISE), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • ZHANG Jun-Ge

    ZHANG Jun-Ge

    School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China;Center for Research on Intelligent System and Engineering (CRISE), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • HUANG Kai-Qi

    HUANG Kai-Qi

    School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China;Center for Research on Intelligent System and Engineering (CRISE), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai 200031, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [34]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    基于模型的强化学习方法利用已收集的样本对环境进行建模并使用构建的环境模型生成虚拟样本以辅助训练,因而有望提高样本效率.但由于训练样本不足等问题,构建的环境模型往往是不精确的,其生成的样本也会因携带的预测误差而对训练过程产生干扰.针对这一问题,提出了一种可学习的样本加权机制,通过对生成样本重加权以减少它们对训练过程的负面影响.该影响的量化方法为,先使用待评估样本更新价值和策略网络,再在真实样本上计算更新前后的损失值,使用损失值的变化量来衡量待评估样本对训练过程的影响.实验结果表明,按照该加权机制设计的强化学习算法在多个任务上均优于现有的基于模型和无模型的算法.

    Abstract:

    Model-based reinforcement learning methods train a model to simulate the environment by using the collected samples and utilize the imaginary samples generated by the model to optimize the policy, thus they have potential to improve sample efficiency. Nevertheless, due to the shortage of training samples, the environment model is often inaccurate, and the imaginary samples generated by it would be deleterious for the training process. For this reason, a learnable weighting mechanism is proposed which can reduce the negative effect on the training process by weighting the generated samples. The effect of the imaginary samples on the training process is quantified through calculating the difference between the losses on the real samples before and after updating value and policy networks by the imaginary samples. The experimental results show that the reinforcement learning algorithm using the weighting mechanism is superior to existing model-based and model-free algorithms in multiple tasks.

    参考文献
    [1] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533. [doi: 10.1038/nature14236]
    [2] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489. [doi: 10.1038/nature16961]
    [3] Mousavi SS, Schukat M, Howley E. Traffic light control using deep policy-gradient and value-function-based reinforcement learning. IET Intelligent Transport Systems, 2017, 11(7): 417–423. [doi: 10.1049/iet-its.2017.0153]
    [4] Shao ML, Cao E, Hu M, Zhang Y, Chen WJ, Chen MS. Traffic light optimization control method for priority vehicle awareness. Ruan Jian Xue Bao/Journal of Software, 2021, 32(8): 2425–2438. (in Chinese with English abstract) http://www.jos.org.cn/1000-9825/6191.htm 邵明莉, 曹鹗, 胡铭, 章玥, 陈闻杰, 陈铭松. 面向优先车辆感知的交通灯优化控制方法. 软件学报, 2021, 32(8): 2425–2438. http://www.jos.org.cn/1000-9825/6191.htm
    [5] Liang XY, Du XS, Wang GL, Han Z. A deep reinforcement learning network for traffic light cycle control. IEEE Transactions on Vehicular Technology, 2019, 68(2): 1243–1253. [doi: 10.1109/TVT.2018.2890726]
    [6] Deng Y, Bao F, Kong YY, Ren ZQ, Dai QH. Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(3): 653–664. [doi: 10.1109/TNNLS.2016.2522401]
    [7] Liang TX, Yang XP, Wang L, Han ZY. Review on financial trading system based on reinforcement learning. Ruan Jian Xue Bao/Journal of Software, 2019, 30(3): 845-864 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5689.htm 梁天新, 杨小平, 王良, 韩镇远. 基于强化学习的金融交易系统研究与发展. 软件学报, 2019, 30(3): 845-864. http://www.jos.org.cn/1000-9825/5689.htm
    [8] Huang KQ, Xing JL, Zhang JG, Ni WC, Xu B. Intelligent technologies of human-computer gaming. Scientia Sinica Informationis, 2020, 50(4): 540–550 (in Chinese with English abstract). [doi: 10.1360/N112019-00048] 黄凯奇, 兴军亮, 张俊格, 倪晚成, 徐博. 人机对抗智能技术. 中国科学: 信息科学, 2020, 50(4): 540–550. [doi: 10.1360/N112019-00048].
    [9] Arel I. Deep reinforcement learning as foundation for artificial general intelligence. In: Wang P, Goertzel B, eds. Theoretical Foundations of Artificial General Intelligence. Paris: Atlantis Press, 2012. 89–102.
    [10] Deisenroth MP, Rasmussen CE. PILCO: A model-based and data-efficient approach to policy search. In: Proc. of the 28th Int’l Conf. on Machine Learning. Bellevue: Omnipress, 2011. 465–472.
    [11] Levine S, Koltun V. Guided policy search. In: Proc. of the 30th Int’l Conf. on Machine Learning. Atlanta: JMLR.org, 2013. 1–9.
    [12] Levine S, Abbeel P. Learning neural network policies with guided policy search under unknown dynamics. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2014. 1071–1079.
    [13] Gal Y, McAllister R, Rasmussen CE. Improving PILCO with Bayesian neural network dynamics models. In: Proc. of the 33rd Int’l Conf. on Machine Learning. JMLR.org, 2016. 25–31.
    [14] Depeweg S, Hernández-Lobato JM, Doshi-Velez F, Udluft S. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: ICLR, 2017.
    [15] Nagabandi A, Kahn G, Fearing RS, Levine S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proc. of the 2018 IEEE Int’l Conf. on Robotics and Automation. Brisbane: IEEE, 2018. 7559–7566.
    [16] Liang XX, Feng YH, Huang JC, Wang Q, Ma Y, Liu Z. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 948–966 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5930.htm 梁星星, 冯旸赫, 黄金才, 王琦, 马扬, 刘忠. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4): 948–966. http://www.jos.org.cn/1000-9825/5930.htm
    [17] Chua K, Calandra R, McAllister R, Levine S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montréal: Curran Associates Inc., 2018. 4759–4770.
    [18] Clavera I, Rothfuss J, Schulman J, Fujita Y, Asfour T, Abbeel P. Model-based reinforcement learning via meta-policy optimization. In: Proc. of the 2nd Annual Conf. on Robot Learning. Zürich: PMLR, 2018. 617–629.
    [19] Kalweit G, Boedecker J. Uncertainty-driven imagination for continuous deep reinforcement learning. In: Proc. of the 1st Annual Conf. on Robot Learning. Mountain View: PMLR, 2017. 195–206.
    [20] Heess N, Wayne G, Silver D, Lillicrap TP, Erez T, Tassa Y. Learning continuous control policies by stochastic value gradients. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 2944–2952.
    [21] Janner M, Fu J, Zhang M, Levine S. When to trust your model: Model-based policy optimization. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: NeurIPS, 2019. 12498–12509.
    [22] Thrun S, Pratt L. Learning to Learn. Boston: Springer, 1998. 3–17.
    [23] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: JMLR.org, 2017. 1126–1135.
    [24] Xu ZW, Van Hasselt H, Silver D. Meta-gradient reinforcement learning. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montréal: Curran Associates Inc., 2018. 2402–2413.
    [25] Zheng ZY, Oh J, Singh S. On learning intrinsic rewards for policy gradient methods. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montréal: Curran Associates Inc., 2018. 4649–4659.
    [26] Veeriah V, Hessel M, Xu ZW, Rajendran J, Lewis RL, Oh J, Van Hasselt H, Silver D, Singh S. Discovery of useful questions as auxiliary Tasks. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: NeurIPS, 2019. 9306–9317.
    [27] Huang WZ, Yin QY, Zhang JG, Huang KQ. Learning to reweight imaginary transitions for model-based reinforcement learning. In: Proc. of 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 7848–7856.
    [28] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: JMLR.org, 2018. 1856–1865.
    [29] Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 1724–1734.
    [30] Kingma DP, Ba J. Adam: A method for stochastic optimization. In: Proc. of 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.
    [31] Wang TW, Bao XC, Clavera I, Hoang J, Wen YM, Langlois E, Zhang SS, Zhang GD, Abbeel P, Ba J. Benchmarking model-based reinforcement learning. arXiv:1907.02057, 2019.
    [32] Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: JMLR.org, 2018. 1582–1591.
    [33] Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: ICLR, 2018.
    [34] Wang TW, Ba J. Exploring model-based planning with policy networks. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: ICLR, 2020.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

黄文振,尹奇跃,张俊格,黄凯奇.基于模型的强化学习中可学习的样本加权机制.软件学报,2023,34(6):2765-2775

复制
相关视频

分享
文章指标
  • 点击次数:714
  • 下载次数: 2909
  • HTML阅读次数: 1289
  • 引用次数: 0
历史
  • 收稿日期:2021-04-14
  • 最后修改日期:2021-06-07
  • 在线发布日期: 2022-10-14
  • 出版日期: 2023-06-06
文章二维码
您是第20057263位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号