因果时空语义驱动的深度强化学习抽象建模方法
作者:
通讯作者:

杜德慧,E-mail:dhdu@sei.ecnu.edu.cn

中图分类号:

TP18


Causal-spatiotemporal-semantics-driven Abstraction Modeling Method for Deep Reinforcement Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [34]
  • |
  • 相似文献
  • | | |
  • 文章评论
    摘要:

    随着智能信息物理融合系统(intelligent cyber-physical system, ICPS)的快速发展, 智能技术在感知、决策、规控等方面的应用日益广泛. 其中, 深度强化学习因其在处理复杂的动态环境方面的高效性, 已被广泛用于ICPS的控制组件中. 然而, 由于运行环境的开放性和ICPS系统的复杂性, 深度强化学习在学习过程中需要对复杂多变的状态空间进行探索, 这极易导致决策生成时效率低下和泛化性不足等问题. 目前对于该问题的常见解决方法是将大规模的细粒度马尔可夫决策过程(Markov decision process, MDP)抽象为小规模的粗粒度马尔可夫决策过程, 从而简化模型的计算复杂度并提高求解效率. 但这些方法尚未考虑如何保证原状态的时空语义信息、聚类抽象的系统空间和真实系统空间之间的语义一致性问题. 针对以上问题, 提出基于因果时空语义的深度强化学习抽象建模方法. 首先, 提出反映时间和空间价值变化分布的因果时空语义, 并在此基础上对状态进行双阶段语义抽象以构建深度强化学习过程的抽象马尔可夫模型; 其次, 结合抽象优化技术对抽象模型进行调优, 以减少抽象状态与相应具体状态之间的语义误差; 最后, 结合车道保持、自适应巡航、交叉路口会车等案例进行了大量的实验, 并使用验证器PRISM对模型进行评估分析, 结果表明所提出的抽象建模技术在模型的抽象表达能力、准确性及语义等价性方面具有较好的效果.

    Abstract:

    With the rapid advancement of intelligent cyber-physical system (ICPS), intelligent technologies are increasingly utilized in components such as perception, decision-making, and control. Among these, deep reinforcement learning (DRL) has gained wide application in ICPS control components due to its effectiveness in managing complex and dynamic environments. However, the openness of the operating environment and the inherent complexity of ICPS necessitate the exploration of highly dynamic state spaces during the learning process. This often results in inefficiencies and poor generalization in decision-making. A common approach to address these issues is to abstract large-scale, fine-grained Markov decision processes (MDPs) into smaller-scale, coarse-grained MDPs, thus reducing computational complexity and enhancing solution efficiency. Nonetheless, existing methods fail to adequately ensure consistency between the spatiotemporal semantics of the original states, the abstracted system space, and the real system space. To address these challenges, this study proposes a causal spatiotemporal semantic-driven abstraction modeling method for deep reinforcement learning. First, causal spatiotemporal semantics are introduced to capture the distribution of value changes across time and space. Based on these semantics, a two-stage semantic abstraction process is applied to the states, constructing an abstract MDP model for the deep reinforcement learning process. Subsequently, abstraction optimization techniques are employed to fine-tune the abstract model, minimizing semantic discrepancies between the abstract states and their corresponding detailed states. Finally, extensive experiments are conducted on scenarios including lane-keeping, adaptive cruise control, and intersection crossing. The proposed model is evaluated and analyzed using the PRISM verifier. The results indicate that the proposed abstraction modeling technique demonstrates superior performance in abstraction expressiveness, accuracy, and semantic equivalence.

    参考文献
    [1] Radanliev P, De Roure D, van Kleek M, Santos O, Ani U. Artificial intelligence in cyber physical systems. AI & Society, 2021, 36(3): 783–796.
    [2] Li SE. Deep reinforcement learning. In: Li SE, ed. Reinforcement Learning for Sequential Decision and Optimal Control. Singapore: Springer, 2023. 365–402. [doi: 10.1007/978-981-19-7784-8_10]
    [3] Junges S, Spaan MTJ. Abstraction-refinement for hierarchical probabilistic models. In: Proc. of the 34th Int’l Conf. on Computer Aided Verification. Haifa: Springer, 2022. 102–123. [doi: 10.1007/978-3-031-13185-1_6]
    [4] Devidze R, Kamalaruban P, Singla A. Exploration-guided reward shaping for reinforcement learning under sparse rewards. In: Proc. of the 36th Int’l Conf. on Neural Information Processing System. New Orleans: Curran Associates Inc., 2022. 422.
    [5] Kulkarni TD, Narasimhan K, Saeedi A, Tenenbaum J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Proc. of the 29th Int’l Conf. on Neural Information Processing Systems. Barcelona, 2016. 3675–3683.
    [6] Li LH, Walsh TJ, Littman ML. Towards a unified theory of state abstraction for MDPs. 2006. http://anytime.cs.umass.edu/aimath06/proceedings/P21.pdf
    [7] Castro PS. Scalable methods for computing state similarity in deterministic Markov decision processes. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI Press, 2020. 10069–10076. [doi: 10.1609/aaai.v34i06.6564]
    [8] Rafati J, Noelle D. Unsupervised subgoal discovery method for learning hierarchical representations. 2019. http://rafati.net/papers/Rafati-Noelle-2019-SPiRL.pdf
    [9] Abel D. A theory of abstraction in reinforcement learning. arXiv:2203.00397, 2022.
    [10] Abel D, Arumugam D, Lehnert L, Littman ML. State abstractions for lifelong reinforcement learning. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholmsmässan: PMLR, 2018. 10–19.
    [11] Altman E. Constrained Markov Decision Processes. New York: Routledge, 2021. [doi: 10.1201/9781315140223]
    [12] Andreas J, Klein D, Levine S. Modular multitask reinforcement learning with policy sketches. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: PMLR, 2017. 166–175.
    [13] Oh J, Singh S, Lee H, Kohli P. Zero-shot task generalization with multi-task deep reinforcement learning. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: PMLR, 2017. 2661–2670.
    [14] Zhang TR, Guo SQ, Tan T, Hu XL, Chen F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. 2020. 21579–21590.
    [15] Allen C, Parikh N, Gottesman O, Konidaris G. Learning Markov state abstractions for deep reinforcement learning. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. 2021. 8229–8241.
    [16] Taïga AA, Courville A, Bellemare MG. Approximate exploration through state abstraction. arXiv:1808.09819, 2018.
    [17] Taylor JJ, Precup D, Panagaden P. Bounding performance loss in approximate MDP homomorphisms. In: Proc. of the 22nd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2008. 1649–1656.
    [18] Feng S, Sun HW, Yan XT, Zhu HJ, Zou ZX, Shen SY, Liu HX. Dense reinforcement learning for safety validation of autonomous vehicles. Nature, 2023, 615(7953): 620–627.
    [19] Abel D, Umbanhowar N, Khetarpal K, Arumugam D, Precup D, Littman ML. Value preserving state-action abstractions. In: Proc. of the 23rd Int’l Conf. on Artificial Intelligence and Statistics. Palermo: PMLR, 2020. 1639–1650.
    [20] Song JY, Xie X, Ma L. SIEGE: A semantics-guided safety enhancement framework for AI-enabled cyber-physical systems. IEEE Trans. on Software Engineering, 2023, 49(8): 4058–4080.
    [21] Guo SQ, Yan Q, Su X, Hu XL, Chen F. State-temporal compression in reinforcement learning with the reward-restricted geodesic metric. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5572–5589.
    [22] Bacon PL, Harb J, Precup D. The option-critic architecture. In: Proc. of the 31st AAAI Conf. on Artificial Intelligence. San Francisco: AAAI Press, 2017. 1726–1734. [doi: 10.1609/aaai.v31i1.10916]
    [23] Pearl J, Mackenzie D. The Book of Why: The New Science of Cause and Effect. New York: Basic Books, 2018.
    [24] Sondhi A, Shojaie A. The reduced PC-algorithm: Improved causal structure learning in large random networks. Journal of Machine Learning Research, 2019, 20(164): 1–31.
    [25] Entner D, Hoyer PO. On causal discovery from time series data using FCI. In: Proc. of the 5th European Workshop on Probabilistic Graphical Models. Helsinki, 2010.
    [26] Huang BW, Lu CC, Liu LQ, Hernández-Lobato JM, Glymour C, Schölkopf B, Zhang K. Action-sufficient state representation learning for control with structural constraints. In: Proc. of the 39th Int’l Conf. on Machine Learning. Baltimore: PMLR, 2022. 9260–9279.
    [27] Wang ZZ, Xiao XS, Xu ZF, Zhu YK, Stone P. Causal dynamics learning for task-independent state abstraction. In: Proc. of the 39th Int’l Conf. on Machine Learning. Baltimore: PMLR, 2022. 23151–23180.
    [28] Jin P, Tian JX, Zhi DP, Wen XJ, Zhang M. Trainify: A CEGAR-driven training and verification framework for safe deep reinforcement learning. In: Proc. of the 34th Int’l Conf. on Computer Aided Verification. Cham: Springer, 2022. 193–218. [doi: 10.1007/978-3-031-13185-1_10]
    [29] 胡奇英, 刘建庸. 马尔可夫决策过程引论. 西安: 西安电子科技大学出版社, 2000.
    Hu QY, Liu JY. An Introduction to Markov Decision Processes. Xi’an: Xidian University Press, 2000 (in Chinese).
    [30] Dosovitskiy A, Ros G, Codevilla F, Lopez A, Koltun V. In: Proc. of the 1st Annual Conf. on Robot Learning. PMLR, 2017. 1–16.
    [31] Huang Z, Shen X, Xing J, Liu TL, Tian XM, Li HQ. Revisiting knowledge distillation: An inheritance and exploration framework. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 3579–3588. [doi: 10.1109/CVPR46437.2021.00358]
    [32] Nachum O, Gu SX, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. In: Proc. of the 31st Advances in Neural Information Processing Systems. Montréal, 2018. 3307–3317.
    [33] Reimann J, Mansion N, Haydon J, Bray B, Chattopadhyay A, Sato S, Waga M, André É, Hasuo I, Ueda N, Yokoyama Y. Temporal logic formalisation of ISO 34502 critical scenarios: Modular construction with the RSS safety distance. In: Proc. of the 39th ACM/SIGAPP Symp. on Applied Computing. Avila: ACM, 2024. 186–195.
    相似文献
    引证文献
引用本文

田丽丽,杜德慧,聂基辉,陈逸康,李荥达.因果时空语义驱动的深度强化学习抽象建模方法.软件学报,2025,36(8):1-18

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-26
  • 最后修改日期:2024-10-14
  • 在线发布日期: 2024-12-10
文章二维码
您是第19872793位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号