Counterfactual Regret Advantage-based Self-play Approach for Mixed Cooperative-competitive Multi-agent Systems

doi:10.13328/j.cnki.jos.006832

微信服务号

微信订阅号

2025-4-24- 12

Home > Archive>Volume 35, Issue 2, 2024 >739-757. DOI:10.13328/j.cnki.jos.006832

PDF HTML XML Export Cite reminder

Counterfactual Regret Advantage-based Self-play Approach for Mixed Cooperative-competitive Multi-agent Systems
DOI:
                        10.13328/j.cnki.jos.006832
                    
Author:
                        ZHANG Ming-YueZHANG Ming-Yue
College of Computer and Information Science & School of Software, Southwest University, Chongqing 400715, China;School of Computer Science, Peking University, Beijing 100871, China;Key Lab of High Confidence Software Technologies(Peking University), Ministry of Education, Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
JIN ZhiJIN Zhi
School of Computer Science, Peking University, Beijing 100871, China;Key Lab of High Confidence Software Technologies(Peking University), Ministry of Education, Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LIU KunLIU Kun
School of Computer Science, Peking University, Beijing 100871, China;Key Lab of High Confidence Software Technologies(Peking University), Ministry of Education, Beijing 100871, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The mixed cooperative-competitive multi-agent system consists of controlled target agents and uncontrolled external agents. The target agents cooperate with each other and compete with external agents, so as to deal with the dynamic changes in the environment and the external agents and complete tasks. In order to train the target agents and make them learn the optimal policy for completing the tasks, the existing work proposes two kinds of solutions: (1) focusing on the cooperation between target agents, viewing the external agents as a part of the environment, and leveraging the multi-agent-reinforcement learning to train the target agents; but these approaches cannot handle the uncertainty of or dynamic changes in the external agents’ policy; (2) focusing on the competition between target agents and external agents, modeling the competition as two-player games, and using a self-play approach to train the target agents; these approaches are only suitable for cases where there is one target agent and external agent, and they are difficult to be extended to a system consisting of multiple target agents and external agents. This study combines the two kinds of solutions and proposes a counterfactual regret advantage-based self-play approach. Specifically, first, based on the counterfactual regret minimization and counterfactual multi-agent policy gradient, the study designs a counterfactual regret advantage-based policy gradient approach for making the target agent update the policy more accurately. Second, in order to deal with the dynamic changes in the external agents’ policy during the self-play process, the study leverages imitation learning, which takes the external agents’ historical decision-making trajectories as training data and imitates the external agents’ policy, so as to explicitly model the external agents’ behaviors. Third, based on the counterfactual regret advantage-based policy gradient and the modeling of external agents’ behaviors, this study designs a self-play training approach. This approach can obtain the optimal joint policy for training multiple target agents when the external agents’ policy is uncertain or dynamically changing. The study also conducts a set of experiments on the cooperative electromagnetic countermeasure, including three typical mixed cooperative-competitive tasks. The experimental results demonstrate that compared with other approaches, the proposed approach has an improvement of at least 78% in the self-game effect.

Key words:multi-agent reinforcement learning;counterfactual regret minimization;self-play;dynamic decision-making

Get Citation

张明悦,金芝,刘坤.合作-竞争混合型多智能体系统的虚拟遗憾优势自博弈方法.软件学报,2024,35(2):739-757

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:June 19,2022
Revised:September 01,2022
Adopted:
Online: July 19,2023
Published: February 06,2024

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History