Factored Multi-agent Centralised Policy Gradient with Parameterized Action Space and Its Application
Author:
Affiliation:

Clc Number:

TP18

  • Article
  • | |
  • Metrics
  • |
  • Reference [35]
  • |
  • Related [1]
  • | | |
  • Comments
    Abstract:

    In recent years, multi-agent reinforcement learning methods have demonstrated excellent decision-making capabilities and broad application prospects in successful cases such as AlphaStar, AlphaDogFight, and AlphaMosaic. In the multi-agent decision-making system in a real-world environments, the decision-making space of its task is often a parameterized action space with both discrete and continuous action variables. The complex structure of this type of action space makes traditional multi-agent reinforcement learning algorithms no longer applicable. Therefore, researching for parameterized action spaces holds important significance in real-world application. This study proposes a factored multi-agent centralised policy gradients algorithm for parameterized action space in multi-agent settings. By utilizing the factored centralised policy gradient algorithm, effective coordination among multi-agent is ensured. After that, the output of the dual-headed policy in the parameterized deep deterministic policy gradient algorithm is employed to achieve effective coupling in the parameterized action space. Experimental results under different parameter settings in the hybrid predator-prey scenario show that the algorithm has good performance on classic multi-agent parameterized action space collaboration tasks. Additionally, the algorithm’s effectiveness and feasibility is validated in a multi-cruise-missile collaborative penetration tasks with complex and high dynamic properties.

    Reference
    [1] 梁星星, 冯旸赫, 马扬, 程光权, 黄金才, 王琦, 周玉珍, 刘忠. 多Agent深度强化学习综述. 自动化学报, 2020, 46(12): 2537–2557.
    Liang XX, Feng YH, Ma Y, Cheng GQ, Huang JC, Wang Q, Zhou YZ, Liu Z. Deep multi-agent reinforcement learning: A survey. Acta Automatica Sinica, 2020, 46(12): 2537–2557 (in Chinese with English abstract).
    [2] Chai JJ, Li WF, Zhu YH, Zhao DB, Ma Z, Sun KW, Ding JSY. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Trans. on Neural Networks and Learning Systems, 2023, 34(4): 2093–2104.
    [3] Nguyen TT, Nguyen ND, Nahavandi S. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE Trans. on Cybernetics, 2020, 50(9): 3826–3839.
    [4] Gupta JK, Egorov M, Kochenderfer MJ. Cooperative multi-agent control using deep reinforcement learning. In: Proc. of the 2017 Workshops on Autonomous Agents and Multiagent Systems. São Paulo: Springer, 2017. 66–83. [doi: 10.1007/978-3-319-71682-4_5]
    [5] Zhang LX, Zhang RX, Wu T, Weng R, Han MH, Zhao Y. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE Trans. on Neural Networks and Learning Systems, 2021, 32(12): 5435–5444.
    [6] Tan T, Bao F, Deng Y, Jin A, Dai QH, Wang J. Cooperative deep reinforcement learning for large-scale traffic grid signal control. IEEE Trans. on Cybernetics, 2020, 50(6): 2687–2700.
    [7] Kraemer L, Banerjee B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 2016, 190: 82–94.
    [8] Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 4292–4301.
    [9] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6382–6393.
    [10] Masson W, Ranchod P, Konidaris GD. Reinforcement learning with parameterized actions. In: Proc. of the 30th AAAI Conf. on Artificial Intelligence. Phoenix: AAAI, 2016. 1934–1940. [doi: 10.1609/aaai.v30i1.10226]
    [11] Hausknecht MJ, Stone P. Deep reinforcement learning in parameterized action space. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan, 2016. 1–12.
    [12] Xiong JC, Wang Q, Yang ZR, Sun P, Zheng Y, Fu HB, Zhang T, Liu J, Liu H. Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space. arXiv:1810.06394, 2018.
    [13] Fan Z, Su R, Zhang WN, Yu Y. Hybrid actor-critic reinforcement learning in parameterized action space. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence. Macao: AAAI, 2019. 2279–2285.
    [14] Li BY, Tang HY, Zheng Y, Hao JY, Li PY, Wang Z, Meng ZP, Wang L. HyAR: Addressing discrete-continuous action reinforcement learning via hybrid action representation. In: Proc. of the 10th Int’l Conf. on Learning Representations. OpenReview.net, 2022. 1–22.
    [15] 崔雅萌, 王会霞, 郑春胜, 胡瑞光. 高速飞行器追逃博弈决策技术. 指挥与控制学报, 2021, 7(4): 403-414.
    Cui YM, Wang HX, Zheng CS, Hu RG. Pursuit-evasion game decision technology of highspeed vehicles. Journal of Command and Control, 2021, 7(4): 403-414 (in Chinese with English abstract).
    [16] 张维明, 黄松平, 黄金才, 朱承, 丁兆云. 多域作战及其指挥控制问题探析. 指挥信息系统与技术, 2020, 11(1): 1–6.
    Zhang WM, Huang SP, Huang JC, Zhu C, Ding ZY. Analysis on multi-domain operation and its command and control problems. Command Information System and Technology, 2020, 11(1): 1–6 (in Chinese with English abstract).
    [17] 高昂, 董志明, 叶红兵, 宋敬华, 郭齐胜. 基于深度强化学习的巡飞弹突防控制决策. 兵工学报, 2021, 42(5): 1101–1110.
    Gao A, Dong ZM, Ye HB, Song JH, Guo QS. Loitering munition penetration control decision based on deep reinforcement learning. Acta Armamentarii, 2021, 42(5): 1101–1110 (in Chinese with English abstract).
    [18] 梁星星, 冯旸赫, 黄金才, 王琦, 马扬, 刘忠. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4): 948–966. http://www.jos.org.cn/1000-9825/5930.htm
    Liang XX, Feng YH, Huang JC, Wang Q, Ma Y, Liu Z. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 948–966 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5930.htm
    [19] Oliehoek FA, Amato C. A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016. [doi: 10.1007/978-3-319-28929-8]
    [20] Fu HT, Tang HY, Hao JY, Lei ZH, Chen YF, Fan CJ. Deep multi-agent reinforcement learning with discrete-continuous hybrid action spaces. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence. Macao: AAAI, 2019. 2329–2335.
    [21] Hua HZ, Zhao RW, Wen GX, Wu KG. A further exploration of deep multi-agent reinforcement learning with hybrid action space. In: Proc. of the 32nd Int’l Conf. on Artificial Neural Networks. Heraklion: Springer, 2023. 1–12. [doi: 10.1007/978-3-031-44223-0_1]
    [22] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan, 2016. 1–14.
    [23] Delalleau O, Peter M, Alonso E, Logut A. Discrete and continuous action representation for practical RL in video games. arXiv:1912.11077, 2019.
    [24] Peng B, Rashid T, de Witt CAS, Kamienny PA, Torr PHS, Böhmer W, Whiteson S. FACMAC: Factored multi-agent centralised policy gradients. In: Proc. of the 35th Conf. on Neural Information Processing Systems. 2021. 12208–12221.
    [25] Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI, 2018. 2974–2982. [doi: 10.1609/aaai.v32i1.11794]
    [26] Jang E, Gu SX, Poole B. Categorical reparameterization with Gumbel-Softmax. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017. 1–13.
    [27] Mahajan A, Rashid T, Samvelyan M, Whiteson S. MAVEN: Multi-agent variational exploration. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 684.
    [28] Rashid T, Farquhar G, Peng B, Whiteson S. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 855.
    [29] Samvelyan M, Rashid T, de Witt CS, Farquhar G, Nardelli N, Rudner TGJ, Hung CM, Torr PHS, Foerster J, Whiteson S. The StarCraft multi-agent challenge. In: Proc. of the 18th Int’l Conf. on Autonomous Agents and MultiAgent Systems. Montreal: Int’l Foundation for Autonomous Agents and Multiagent Systems, 2019. 2186–2188.
    [30] Huang S, Ontañón S. A closer look at invalid action masking in policy gradient algorithms. In: Proc. of the 35th Int’l Florida Artificial Intelligence Research Society Conf. Beach, 2022.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

田树聪,谢愈,张远龙,周正春,高阳.面向参数化动作空间的多智能体中心化策略梯度分解及其应用.软件学报,2025,36(2):590-607

Copy
Share
Article Metrics
  • Abstract:593
  • PDF: 2003
  • HTML: 206
  • Cited by: 0
History
  • Received:July 11,2023
  • Revised:October 09,2023
  • Online: July 17,2024
You are the first2033145Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063