安全强化学习算法及其在CPS智能控制中的应用

doi:10.13328/j.cnki.jos.006588

微信服务号

微信订阅号

2025年3月14日 23:23 星期五

首页 > 过刊浏览>2022年第33卷第7期 >2538-2561. DOI:10.13328/j.cnki.jos.006588

PDF HTML阅读 XML下载导出引用引用提醒

安全强化学习算法及其在CPS智能控制中的应用
DOI:
                        10.13328/j.cnki.jos.006588
                    
CSTR:
                        
                    
作者:
                        赵恒军赵恒军
西南大学计算机与信息科学学院 软件学院, 重庆 400715;西南大学软件研究与创新中心, 重庆 400715
在期刊界中查找
在百度中查找
在本站中查找
李权忠李权忠
西南大学计算机与信息科学学院 软件学院, 重庆 400715;西南大学软件研究与创新中心, 重庆 400715
在期刊界中查找
在百度中查找
在本站中查找
曾霞曾霞
西南大学计算机与信息科学学院 软件学院, 重庆 400715;西南大学软件研究与创新中心, 重庆 400715
在期刊界中查找
在百度中查找
在本站中查找
刘志明刘志明
西北工业大学 智能嵌入式软件研究中心, 陕西 西安 710129;西南大学软件研究与创新中心, 重庆 400715
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:赵恒军(1985-),男,博士,讲师,CCF专业会员,主要研究领域为信息物理系统,形式化方法;
曾霞(1987-),女,博士,讲师,主要研究领域为信息物理系统,数值符号计算;
李权忠(1995-),男,硕士生,主要研究领域为强化学习,智能控制;
刘志明(1961-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为软件理论与方法.
通讯作者:曾霞,E-mail:xzeng0712@swu.edu.cn
中图分类号:TP311
基金项目:国家自然科学基金(61902325,62032019,61972385,61732019,61702425);西南大学国家人才建设项目(SWU116007)

Safe Reinforcement Learning Algorithm and Its Application in Intelligent Control for CPS

Author:

ZHAO Heng-Jun
ZHAO Heng-Jun
College of Computer and Information Science College of Software, Southwest University, Chongqing 400715, China;Centre for Research and Innovation in Software Engineering, Southwest University, Chongqing 400715, China
在期刊界中查找
在百度中查找
在本站中查找
LI Quan-Zhong
LI Quan-Zhong
College of Computer and Information Science College of Software, Southwest University, Chongqing 400715, China;Centre for Research and Innovation in Software Engineering, Southwest University, Chongqing 400715, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG Xia
ZENG Xia
College of Computer and Information Science College of Software, Southwest University, Chongqing 400715, China;Centre for Research and Innovation in Software Engineering, Southwest University, Chongqing 400715, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Zhi-Ming
LIU Zhi-Ming
Centre for Intelligent and Embedded Software, Northwestern Polytechnical University, Xi’an 710129, China;Centre for Research and Innovation in Software Engineering, Southwest University, Chongqing 400715, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [37]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

信息物理系统(cyber-physical system,CPS)的安全控制器设计是一个热门研究方向,现有基于形式化方法的安全控制器设计存在过度依赖模型、可扩展性差等问题.基于深度强化学习的智能控制可处理高维非线性复杂系统和不确定性系统,正成为非常有前景的CPS控制技术,但是缺乏对安全性的保障.针对强化学习控制在安全性方面的不足,围绕一个工业油泵控制系统典型案例,开展安全强化学习算法和智能控制应用研究.首先,形式化了工业油泵控制的安全强化学习问题,搭建了工业油泵仿真环境;随后,通过设计输出层结构和激活函数,构造了神经网络形式的油泵控制器,使得油泵开关时间的线性不等式约束得到满足;最后,为了更好地权衡安全性和最优性控制目标,基于增广拉格朗日乘子法设计实现了新型安全强化学习算法.在工业油泵案例上的对比实验表明,该算法生成的控制器在安全性和最优性上均超越了现有同类算法.在进一步评估中,所生成神经网络控制器以90%的概率通过了严格形式化验证;同时,与理论最优控制器相比实现了低至2%的最优目标值损失.所提方法有望推广至更多应用场景,实例研究的方案有望为安全智能控制和形式化验证领域其他学者提供借鉴.

关键词:强化学习;智能控制;信息物理系统;安全验证;工业油泵

Abstract:

The problem of safe controller design for cyber-physical systems (CPS) is a hot research topic. The existing safe controller design based on formal methods has problems such as excessive reliance on system models and poor scalability. Intelligent control based on deep reinforcement learning can handle high-dimensional nonlinear complex systems and uncertain systems, and is becoming a very promising CPS control technology, but it lacks safety guarantees. This study addresses the safety issues of reinforcement learning control by focusing on a case study of a typical industrial oil pump control system, and carries out research in designing new safe reinforcement learning algorithm and applying the algorithm in intelligent control scenario. First, the safe reinforcement learning problem of the industrial oil pump is formulated, and simulation environment of the oil pump is built. Then, by designing the structure and activation function of the output layer, the neural network type oil pump controller is constructed to satisfy the linear inequality constraints of the oil pump switching time. Finally, in order to better balance the safety and optimality control objectives, a new safe reinforcement learning algorithm is designed based on the augmented Lagrange multiplier method. Comparative experiment on the industrial oil pump shows that the controller generated by the proposed algorithm surpasses existing algorithms in the same category, both in safety and optimality. In further evaluation, the neural network controllers generated in this study pass rigorous formal verification with probability of 90%. Meanwhile, compared with the theoretically optimal controller, neural network controllers achieve a loss of optimal objective value as low as 2%. The method proposed in this study is expected to be extended to more application scenarios, and the case study scheme is expected to be referenced by other researchers in the field of intelligent control and formal verification.

Key words:reinforcement learning;intelligent control;cyber-physical system;safety verification;industrial oil pump

参考文献

[1] Kochdumper N, Gruber F, Schürmann B, Gaßmann V, Klischat M, Schürmann B, Althoff M. AROC:A toolbox for automatic arrival set optimization controller synthesis. In:Bogomolov S, Jungers R, eds. Proc. of the 24th ACM Int'l Conf. on Hybrid Systems:Computation and Control (HSCC 2021). New York:ACM, 2021. 1-6.

[2] Bai YJ, Gan T, Jiao L, Xue B, Zhan NJ. Switching controller synthesis for time-delayed hybrid systems. SCIENTIA SINICA Mathematica, 2021, 51(1):97-114(in Chinese with English abstract).

[3] Yang ZF, Zhang YD, Lin W, Zeng X, Tang XC, Zeng ZB, Liu ZM. An iterative scheme of safe reinforcement learning for nonlinear systems via barrier certificate generation. In:Silva A, Leino KRM, eds. Proc. of the 33rd Int'l Conf. on Computer Aided Verification (CAV 2021). Cham:Springer, 2021. 467-490.

[4] Zhao HJ, Zhan NJ, Kapur D. Synthesizing switching controllers for hybrid systems by generating invariants. In:Liu ZM, Woodcock J, Zhu HB, eds. Proc. of the Theories of Programming and Formal Methods. Berlin, Heidelberg:Springer, 2013. 354-373.

[5] Jin XY, An J, Zhan BH, Zhan NJ, Zhang MM. Inferring switched nonlinear dynamical systems. Formal Aspects of Computing, 2021, 33(3):385-406.

[6] Bao WM, Qi ZQ, Zhang Y. Thoughts on the development of intelligent control technology. SCIENTIA SINICA Informationis, 2020, 50(8):1267-1272(in Chinese with English abstract).

[7] Cassez F, Jessen JJ, Larsen KG, Raskin JF, Reynier PA. Automatic synthesis of robust and optimal controllers-An industrial case study. In:Majumdar R, Tabuada P, eds. Proc. of the 12th Int'l Conf. on Hybrid Systems:Computation and Control (HSCC 2009). Berlin, Heidelberg:Springer, 2009. 90-104.

[8] Jha S, Seshia SA, Tiwari A. Synthesis of optimal switching logic for hybrid systems. In:Baruah S, Fischmeister S, eds. Proc. of the 9th ACM Int'l Conf. on Embedded Software (EMSOFT 2011). New York:ACM, 2011. 107-116.

[9] Zhao HJ, Zhan NJ, Kapur D, Larsen KG. A "hybrid" approach for synthesizing optimal controllers of hybrid systems:A case study of the oil pump industrial example. In:Giannakopoulou D, Méry D, eds. Proc. of the 18th Int'l Symp. on Formal Methods. Berlin, Heidelberg:Springer, 2012. 471-485.

[10] Bacci G, Bouyer P, Fahrenberg U, Larsen KG, Markey N, Reynier PA. Optimal and robust controller synthesis using energy timed automata with uncertainty. In:Havelund K, Peleska J, Roscoe B, Vink E, eds. Proc. of the 22nd Int'l Symp. on Formal Methods. Cham:Springer, 2018. 203-221.

[11] Bacci G, Bouyer P, Fahrenberg U, Larsen KG, Markey N, Reynier PA. Optimal and robust controller synthesis using energy timed automata with uncertainty. Formal Aspects of Computing, 2021, 33(1):3-25.

[12] Liu YS, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning:A survey. In:Zhou ZH, ed. Proc. of the 30th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2021). 2021. 4508-4515.

[13] Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In:Precup D, Teh YW, eds. Proc. of the 34th Int'l Conf. on Machine Learning (ICML 2017). 2017. 22-31.

[14] Tessler C, Mankowitz DJ, Mannor S. Reward constrained policy optimization. In:Proc. of the 7th Int'l Conf. on Learning Representations (ICLR 2019). 2019. 1-15.

[15] Calian DA, Mankowitz DJ, Zahavy T, Xu ZW, Oh J, Levine N, Mann TA. Balancing constraints and rewards with meta-gradient D4PG. In:Proc. of the 9th Int'l Conf. on Learning Representations (ICLR 2021). 2021. 1-10.

[16] Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence (AAAI 2018). Palo Alto:AAAI, 2018. 2669-2678.

[17] Sibai H, Potok M, Mitra S. Safe reinforcement learning for control systems:A hybrid systems perspective and case study. In:Prabhakar P, Ozay N, eds. Proc. of the ACM Hybrid Systems Computation and Control (HSCC 2019). New York:ACM, 2019. 1-9.

[18] Jansen N, Könighofer B, Junges S, Serban A, Bloem R. Safe reinforcement learning using probabilistic shields. In:Konnov I, Kovács L, eds. Proc. of the 31st Int'l Conf. on Concurrency Theory (CONCUR 2020). Dagstuhl Publishing, 2020. 1-16.

[19] Turchetta M, Kolobov A, Shah S, Krause A, Agarwal A. Safe reinforcement learning via curriculum induction. In:Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems 33(NeurIPS 2020). 2020. 1-12.

[20] Simão TD, Laroche R, Combes RT. Safe policy improvement with an estimated baseline policy. In:An B, Yorke-Smith N, Seghrouchni EF, Sukthankar G, eds. Proc. of the 19th Int'l Conf. on Autonomous Agents and Multi-agent Systems (AAMAS 2020). 2020. 1269-1277.

[21] Laroche R, Trichelair P, Combes RT. Safe policy improvement with baseline bootstrapping. In:Chaudhuri K, Salakhutdinov R, eds. Proc. of the 36th Int'l Conf. on Machine Learning (ICML 2019). 2019. 1-10.

[22] Simão TD, Spaan MTJ. Structure learning for safe policy improvement. In:Kraus S, ed. Proc. of the 28th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2019). 2019. 3453-3459.

[23] Saunders W, Sastry G, Stuhlmueller A, Evans O. Trial without error:towards safe reinforcement learning via human intervention. In:Dastani M, Sukthankar G, André E, Koenig S, eds. Proc. of the 17th Int'l Conf. on Autonomous Agents and Multi-agent Systems (AAMAS 2018). 2018. 2067-2069.

[24] Fulton N, Platzer A. Safe reinforcement learning via formal methods:Toward safe control through proof and learning. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence (AAAI 2018). 2018. 6485-6492.

[25] Deshmukh JV, Kapinski JP, Yamaguchi T, Prokhorov D. Learning deep neural network controllers for dynamical systems with safety guarantees. In:Pan D, ed. Proc. of the 2019 IEEE/ACM Int'l Conf. on Computer-aided Design (ICCAD 2019). IEEE, 2019. 1-7.

[26] Chow Y, Nachum O, Faust A, Dueñez-Guzman E, Ghavamzadeh M. Safe policy learning for continuous control. In:Kober J, Ramos F, Tomlin C, eds. Proc. of the 2020 Conf. on Robot Learning (CoRL 2020). 2020. 801-821.

[27] Berkenkamp F, Turchetta M, Schoellig AP, Krause A. Safe model-based reinforcement learning with stability guarantees. In:Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems 30(NIPS 2017). 2017. 1-11.

[28] Choi J, Castañeda F, Tomlin CJ, Sreenath K. Reinforcement learning for safety-critical control under model uncertainty, using control Lyapunov functions and control barrier functions. In:Proc. of the Robotics:Science and Systems 2020(RSS 2020). 2020. 1-9.

[29] Cheng R, Orosz G, Murray RM, Burdick JW. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In:Proc. of the 33rd AAAI Conf. on Artificial Intelligence (AAAI 2019). 2019. 3387-3395.

[30] Sutton RS, Barto AG. Reinforcement Learning:An Introduction. 2nd ed., Cambridge:MIT Press, 2018. 47-71.

[31] Bertsekas DP. Constrained Optimization and Lagrange Multiplier Methods. Belmont:Athena Scientific, 1982. 96-156.

[32] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In:Proc. of the 4th Int'l Conf. on Learning Representations (ICLR 2016). 2016. 1-14.

[33] Conn AR, Gould NIM, Toint PL. Lancelot:A Fortran Package for Large-scale Nonlinear Optimization. Berlin, Heidelberg:Springer, 1992. 128-132.

[34] Gao S, Kong S, Clarke EM. dReal:An SMT solver for nonlinear theories over the reals. In:Bonacina MP, ed. Proc. of the 24th Int'l Conf. on Automated Deduction (CADE 2013). Berlin, Heidelberg:Springer, 2013. 208-214.

附中文参考文献:

[2] 白云军,甘庭,焦莉,薛白,詹乃军.时延混成系统的切换控制器合成.中国科学:数学, 2021, 51(1):97-114.

[6] 包为民,祁振强,张玉.智能控制技术发展的思考.中国科学:信息科学, 2020, 50(8):1267-1272.

引用本文

赵恒军,李权忠,曾霞,刘志明.安全强化学习算法及其在CPS智能控制中的应用.软件学报,2022,33(7):2538-2561

复制

文章指标

点击次数:1723
下载次数: 4516
HTML阅读次数: 3432
引用次数: 0

历史

收稿日期:2021-09-05
最后修改日期:2021-10-14
录用日期:
在线发布日期: 2022-01-28
出版日期: 2022-07-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码