基于受限MDP的无模型安全强化学习方法
作者:
作者单位:

作者简介:

朱斐(1978-),男,博士,副教授,CCF高级会员,主要研究领域为安全强化学习,深度强化学习,医学信息学;
凌兴宏(1968-),男,博士,副教授,CCF专业会员,主要研究领域为机器学习,语义Web,知识管理,企业信息化;
葛洋洋(1995-),女,硕士生,主要研究领域为安全强化学习;
刘全(1969-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为深度学习,强化学习,统计人工智能.

通讯作者:

朱斐,E-mail:zhufei@suda.edu.cn

中图分类号:

TP18

基金项目:

国家自然科学基金(61303108,61772355);江苏省高校自然科学研究项目(17KJA520004);苏州市重点产业技术创新-前瞻性应用研究项目(SYG201804);高校省级重点实验室(苏州大学)项目(KJS1524);江苏高校优势学科建设工程资助项目


Model-free Safe Reinforcement Learning Method Based on Constrained Markov Decision Processes
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    很多强化学习方法较少地考虑决策的安全性,但研究领域和工业应用领域都要求的智能体所做决策是安全的.解决智能体决策安全问题的传统方法主要有改变目标函数、改变智能体的探索过程等,然而这些方法忽略了智能体遭受的损害和成本,因此不能有效地保障决策的安全性.在受限马尔可夫决策过程的基础上,通过对动作空间添加安全约束,设计了安全Sarsa (λ)方法和安全Sarsa方法.在求解过程中,不仅要求智能体得到最大的状态-动作值,还要求其满足安全约束的限制,从而获得安全的最优策略.由于传统的强化学习求解方法不再适用于求解带约束的安全Sarsa (λ)模型和安全Sarsa模型,为在满足约束条件下得到全局最优状态-动作值函数,提出了安全强化学习的求解模型.求解模型基于线性化多维约束,采用拉格朗日乘数法,在保证状态-动作值函数和约束函数具有可微性的前提下,将安全强化学习模型转化为凸模型,避免了在求解过程中陷入局部最优解的问题,提高了算法的求解效率和精确度.同时,给出了算法的可行性证明.最后,实验验证了算法的有效性.

    Abstract:

    Many reinforcement learning methods do not take into consideration the safety of decisions made by agents. In fact, regardless of many successful applications in research and industrial area, it is still necessary to make sure that agent decisions are safe. The traditional approaches to address the safety problems mainly include changing the objective function, changing the exploration process of agents and so on, which, however, neglect the possible grave consequences caused by unsafety decisions and, as a result, cannot effectively solve the problem. To address the issue, a safe Sarsa(λ) and a safe Sarsa method, based on the constrained Markov decision processes, are proposed by imposing safety constraints to the action space. During the solution process, the agent should not only seek to get the maximum state-action value, but also satisfy the safety constraints, so as to obtain an optimal safety strategy. Since the standard reinforcement learning methods are no longer suitable for solving the safe Sarsa(λ) and safe Sarsa model, in order to obtain the global optimal state-action value function under the constrained conditions, a solution model of safe reinforcement learning is also introduced. Such model is based on linearized multidimensional constraints and adopts the Lagrange multiplier method to transform safe reinforcement learning model into a convex model provided that the objective and constraint functions are differentiable. The proposed solution algorithm guides the agent away from a local optimal and improves the solution efficiency and precision. The feasibility of the algorithm is proved. Finally, the effectiveness of the algorithm is verified by experiments.

    参考文献
    相似文献
    引证文献
引用本文

朱斐,葛洋洋,凌兴宏,刘全.基于受限MDP的无模型安全强化学习方法.软件学报,2022,33(8):3086-3102

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-08-30
  • 最后修改日期:2020-09-08
  • 录用日期:
  • 在线发布日期: 2022-08-13
  • 出版日期: 2022-08-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号