安全强化学习算法及其在CPS智能控制中的应用
作者:
作者简介:

赵恒军(1985-),男,博士,讲师,CCF专业会员,主要研究领域为信息物理系统,形式化方法;
曾霞(1987-),女,博士,讲师,主要研究领域为信息物理系统,数值符号计算;
李权忠(1995-),男,硕士生,主要研究领域为强化学习,智能控制;
刘志明(1961-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为软件理论与方法.

通讯作者:

曾霞,E-mail:xzeng0712@swu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(61902325,62032019,61972385,61732019,61702425);西南大学国家人才建设项目(SWU116007)


Safe Reinforcement Learning Algorithm and Its Application in Intelligent Control for CPS
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [37]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    信息物理系统(cyber-physical system,CPS)的安全控制器设计是一个热门研究方向,现有基于形式化方法的安全控制器设计存在过度依赖模型、可扩展性差等问题.基于深度强化学习的智能控制可处理高维非线性复杂系统和不确定性系统,正成为非常有前景的CPS控制技术,但是缺乏对安全性的保障.针对强化学习控制在安全性方面的不足,围绕一个工业油泵控制系统典型案例,开展安全强化学习算法和智能控制应用研究.首先,形式化了工业油泵控制的安全强化学习问题,搭建了工业油泵仿真环境;随后,通过设计输出层结构和激活函数,构造了神经网络形式的油泵控制器,使得油泵开关时间的线性不等式约束得到满足;最后,为了更好地权衡安全性和最优性控制目标,基于增广拉格朗日乘子法设计实现了新型安全强化学习算法.在工业油泵案例上的对比实验表明,该算法生成的控制器在安全性和最优性上均超越了现有同类算法.在进一步评估中,所生成神经网络控制器以90%的概率通过了严格形式化验证;同时,与理论最优控制器相比实现了低至2%的最优目标值损失.所提方法有望推广至更多应用场景,实例研究的方案有望为安全智能控制和形式化验证领域其他学者提供借鉴.

    Abstract:

    The problem of safe controller design for cyber-physical systems (CPS) is a hot research topic. The existing safe controller design based on formal methods has problems such as excessive reliance on system models and poor scalability. Intelligent control based on deep reinforcement learning can handle high-dimensional nonlinear complex systems and uncertain systems, and is becoming a very promising CPS control technology, but it lacks safety guarantees. This study addresses the safety issues of reinforcement learning control by focusing on a case study of a typical industrial oil pump control system, and carries out research in designing new safe reinforcement learning algorithm and applying the algorithm in intelligent control scenario. First, the safe reinforcement learning problem of the industrial oil pump is formulated, and simulation environment of the oil pump is built. Then, by designing the structure and activation function of the output layer, the neural network type oil pump controller is constructed to satisfy the linear inequality constraints of the oil pump switching time. Finally, in order to better balance the safety and optimality control objectives, a new safe reinforcement learning algorithm is designed based on the augmented Lagrange multiplier method. Comparative experiment on the industrial oil pump shows that the controller generated by the proposed algorithm surpasses existing algorithms in the same category, both in safety and optimality. In further evaluation, the neural network controllers generated in this study pass rigorous formal verification with probability of 90%. Meanwhile, compared with the theoretically optimal controller, neural network controllers achieve a loss of optimal objective value as low as 2%. The method proposed in this study is expected to be extended to more application scenarios, and the case study scheme is expected to be referenced by other researchers in the field of intelligent control and formal verification.

    参考文献
    [1] Kochdumper N, Gruber F, Schürmann B, Gaßmann V, Klischat M, Schürmann B, Althoff M. AROC:A toolbox for automatic arrival set optimization controller synthesis. In:Bogomolov S, Jungers R, eds. Proc. of the 24th ACM Int'l Conf. on Hybrid Systems:Computation and Control (HSCC 2021). New York:ACM, 2021. 1-6.
    [2] Bai YJ, Gan T, Jiao L, Xue B, Zhan NJ. Switching controller synthesis for time-delayed hybrid systems. SCIENTIA SINICA Mathematica, 2021, 51(1):97-114(in Chinese with English abstract).
    [3] Yang ZF, Zhang YD, Lin W, Zeng X, Tang XC, Zeng ZB, Liu ZM. An iterative scheme of safe reinforcement learning for nonlinear systems via barrier certificate generation. In:Silva A, Leino KRM, eds. Proc. of the 33rd Int'l Conf. on Computer Aided Verification (CAV 2021). Cham:Springer, 2021. 467-490.
    [4] Zhao HJ, Zhan NJ, Kapur D. Synthesizing switching controllers for hybrid systems by generating invariants. In:Liu ZM, Woodcock J, Zhu HB, eds. Proc. of the Theories of Programming and Formal Methods. Berlin, Heidelberg:Springer, 2013. 354-373.
    [5] Jin XY, An J, Zhan BH, Zhan NJ, Zhang MM. Inferring switched nonlinear dynamical systems. Formal Aspects of Computing, 2021, 33(3):385-406.
    [6] Bao WM, Qi ZQ, Zhang Y. Thoughts on the development of intelligent control technology. SCIENTIA SINICA Informationis, 2020, 50(8):1267-1272(in Chinese with English abstract).
    [7] Cassez F, Jessen JJ, Larsen KG, Raskin JF, Reynier PA. Automatic synthesis of robust and optimal controllers-An industrial case study. In:Majumdar R, Tabuada P, eds. Proc. of the 12th Int'l Conf. on Hybrid Systems:Computation and Control (HSCC 2009). Berlin, Heidelberg:Springer, 2009. 90-104.
    [8] Jha S, Seshia SA, Tiwari A. Synthesis of optimal switching logic for hybrid systems. In:Baruah S, Fischmeister S, eds. Proc. of the 9th ACM Int'l Conf. on Embedded Software (EMSOFT 2011). New York:ACM, 2011. 107-116.
    [9] Zhao HJ, Zhan NJ, Kapur D, Larsen KG. A "hybrid" approach for synthesizing optimal controllers of hybrid systems:A case study of the oil pump industrial example. In:Giannakopoulou D, Méry D, eds. Proc. of the 18th Int'l Symp. on Formal Methods. Berlin, Heidelberg:Springer, 2012. 471-485.
    [10] Bacci G, Bouyer P, Fahrenberg U, Larsen KG, Markey N, Reynier PA. Optimal and robust controller synthesis using energy timed automata with uncertainty. In:Havelund K, Peleska J, Roscoe B, Vink E, eds. Proc. of the 22nd Int'l Symp. on Formal Methods. Cham:Springer, 2018. 203-221.
    [11] Bacci G, Bouyer P, Fahrenberg U, Larsen KG, Markey N, Reynier PA. Optimal and robust controller synthesis using energy timed automata with uncertainty. Formal Aspects of Computing, 2021, 33(1):3-25.
    [12] Liu YS, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning:A survey. In:Zhou ZH, ed. Proc. of the 30th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2021). 2021. 4508-4515.
    [13] Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In:Precup D, Teh YW, eds. Proc. of the 34th Int'l Conf. on Machine Learning (ICML 2017). 2017. 22-31.
    [14] Tessler C, Mankowitz DJ, Mannor S. Reward constrained policy optimization. In:Proc. of the 7th Int'l Conf. on Learning Representations (ICLR 2019). 2019. 1-15.
    [15] Calian DA, Mankowitz DJ, Zahavy T, Xu ZW, Oh J, Levine N, Mann TA. Balancing constraints and rewards with meta-gradient D4PG. In:Proc. of the 9th Int'l Conf. on Learning Representations (ICLR 2021). 2021. 1-10.
    [16] Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence (AAAI 2018). Palo Alto:AAAI, 2018. 2669-2678.
    [17] Sibai H, Potok M, Mitra S. Safe reinforcement learning for control systems:A hybrid systems perspective and case study. In:Prabhakar P, Ozay N, eds. Proc. of the ACM Hybrid Systems Computation and Control (HSCC 2019). New York:ACM, 2019. 1-9.
    [18] Jansen N, Könighofer B, Junges S, Serban A, Bloem R. Safe reinforcement learning using probabilistic shields. In:Konnov I, Kovács L, eds. Proc. of the 31st Int'l Conf. on Concurrency Theory (CONCUR 2020). Dagstuhl Publishing, 2020. 1-16.
    [19] Turchetta M, Kolobov A, Shah S, Krause A, Agarwal A. Safe reinforcement learning via curriculum induction. In:Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems 33(NeurIPS 2020). 2020. 1-12.
    [20] Simão TD, Laroche R, Combes RT. Safe policy improvement with an estimated baseline policy. In:An B, Yorke-Smith N, Seghrouchni EF, Sukthankar G, eds. Proc. of the 19th Int'l Conf. on Autonomous Agents and Multi-agent Systems (AAMAS 2020). 2020. 1269-1277.
    [21] Laroche R, Trichelair P, Combes RT. Safe policy improvement with baseline bootstrapping. In:Chaudhuri K, Salakhutdinov R, eds. Proc. of the 36th Int'l Conf. on Machine Learning (ICML 2019). 2019. 1-10.
    [22] Simão TD, Spaan MTJ. Structure learning for safe policy improvement. In:Kraus S, ed. Proc. of the 28th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2019). 2019. 3453-3459.
    [23] Saunders W, Sastry G, Stuhlmueller A, Evans O. Trial without error:towards safe reinforcement learning via human intervention. In:Dastani M, Sukthankar G, André E, Koenig S, eds. Proc. of the 17th Int'l Conf. on Autonomous Agents and Multi-agent Systems (AAMAS 2018). 2018. 2067-2069.
    [24] Fulton N, Platzer A. Safe reinforcement learning via formal methods:Toward safe control through proof and learning. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence (AAAI 2018). 2018. 6485-6492.
    [25] Deshmukh JV, Kapinski JP, Yamaguchi T, Prokhorov D. Learning deep neural network controllers for dynamical systems with safety guarantees. In:Pan D, ed. Proc. of the 2019 IEEE/ACM Int'l Conf. on Computer-aided Design (ICCAD 2019). IEEE, 2019. 1-7.
    [26] Chow Y, Nachum O, Faust A, Dueñez-Guzman E, Ghavamzadeh M. Safe policy learning for continuous control. In:Kober J, Ramos F, Tomlin C, eds. Proc. of the 2020 Conf. on Robot Learning (CoRL 2020). 2020. 801-821.
    [27] Berkenkamp F, Turchetta M, Schoellig AP, Krause A. Safe model-based reinforcement learning with stability guarantees. In:Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems 30(NIPS 2017). 2017. 1-11.
    [28] Choi J, Castañeda F, Tomlin CJ, Sreenath K. Reinforcement learning for safety-critical control under model uncertainty, using control Lyapunov functions and control barrier functions. In:Proc. of the Robotics:Science and Systems 2020(RSS 2020). 2020. 1-9.
    [29] Cheng R, Orosz G, Murray RM, Burdick JW. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In:Proc. of the 33rd AAAI Conf. on Artificial Intelligence (AAAI 2019). 2019. 3387-3395.
    [30] Sutton RS, Barto AG. Reinforcement Learning:An Introduction. 2nd ed., Cambridge:MIT Press, 2018. 47-71.
    [31] Bertsekas DP. Constrained Optimization and Lagrange Multiplier Methods. Belmont:Athena Scientific, 1982. 96-156.
    [32] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In:Proc. of the 4th Int'l Conf. on Learning Representations (ICLR 2016). 2016. 1-14.
    [33] Conn AR, Gould NIM, Toint PL. Lancelot:A Fortran Package for Large-scale Nonlinear Optimization. Berlin, Heidelberg:Springer, 1992. 128-132.
    [34] Gao S, Kong S, Clarke EM. dReal:An SMT solver for nonlinear theories over the reals. In:Bonacina MP, ed. Proc. of the 24th Int'l Conf. on Automated Deduction (CADE 2013). Berlin, Heidelberg:Springer, 2013. 208-214.
    附中文参考文献:
    [2] 白云军,甘庭,焦莉,薛白,詹乃军.时延混成系统的切换控制器合成.中国科学:数学, 2021, 51(1):97-114.
    [6] 包为民,祁振强,张玉.智能控制技术发展的思考.中国科学:信息科学, 2020, 50(8):1267-1272.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

赵恒军,李权忠,曾霞,刘志明.安全强化学习算法及其在CPS智能控制中的应用.软件学报,2022,33(7):2538-2561

复制
分享
文章指标
  • 点击次数:1723
  • 下载次数: 4516
  • HTML阅读次数: 3432
  • 引用次数: 0
历史
  • 收稿日期:2021-09-05
  • 最后修改日期:2021-10-14
  • 在线发布日期: 2022-01-28
  • 出版日期: 2022-07-06
文章二维码
您是第19701282位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号