合作-竞争混合型多智能体系统的虚拟遗憾优势自博弈方法
作者单位:

北京大学

基金项目:

国家自然科学基金项目(面上项目,重点项目,重大项目)


Counterfactual Regret Advantage-based Self-Play Approach for Mixed Cooperative- Competitive Multiage
Author:
Affiliation:

Peking University

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    合作-竞争混合型多智能体系统由受控的目标智能体和不受控的外部智能体组成.目标智能体之间互相合作,同外部智能体展开竞争,应对环境和外部智能体的动态变化,最终完成指定的任务.针对如何训练目标智能体使他们获得完成任务的最优策略的问题,现有工作从两个方面展开:(1)仅关注目标智能体间的合作,将外部智能体视为环境的一部分,利用多智能体强化学习来训练目标智能体.这种方法难以应对外部智能体策略未知或者动态改变的情况;(2)仅关注目标智能体和外部智能体间的竞争,将竞争建模为双人博弈,采用自博弈的方法训练目标智能体.这种方法主要针对单个目标智能体和单个外部智能体的情况,难以扩展到由多个目标智能体和多个外部智能体组成的系统中.本文结合这两类研究,提出了一种基于虚拟遗憾优势的自博弈方法.具体地,本文首先以虚拟遗憾最小化和虚拟多智能体策略梯度为基础,设计了虚拟遗憾优势策略梯度方法,使目标智能体能更准确地更新策略;然后,引入模仿学习,以外部智能体的历史决策轨迹作为示教数据,模仿外部智能体的策略,显式地建模外部智能体的行为,来应对自博弈过程中外部智能体策略的动态变化;最后,以虚拟遗憾优势策略梯度和外部智能体行为建模为基础,设计了一种自博弈训练方法,该方法能够在外部智能体策略未知或者动态变化的情况下,为多个目标智能体训练出最优的联合策略.本文以协同电磁对抗为研究案例,设计了具有合作-竞争混合特征的三个典型任务.实验结果表明,同其他方法相比,本文的方法在自博弈效果方面有至少78%的提升.

    Abstract:

    The mixed cooperative-competitive multi-agent system consists of target agents that are learning-based and therefore controllable agents and external agents that are not controllable by the system. The target agents cooperate with each other, compete with external agents, deal with the dynamics of the environment and the external agents, and complete the tasks. To deal with the problem of how to train the target agents so that they can learn the optimal policy for completing the tasks, the existing work proposes two kinds of solutions: (1)focusing on the cooperation between target agents, which views the external agents as a part of the environment, and leverages the multi-agent-reinforcement learning to train the target agents; but these approaches cannot handle the uncertainty or dynamics of the external agents’ policies; (2) focusing on the competition between target agents and external agents, which models the competition as two-player games, and uses self-play to train the target agents; these approaches only solve the problem where there is one target agent and one external agent, but it is difficult to extend these approaches to a multi-agent system consisting of multiple target agents and multiple external agents. This paper combines the two kinds of solutions, and proposes a counterfactual regret advantage-based self-play approach. Specifically, first, based on the counterfactual regret minimization and counterfactual multi-agent policy gradient, we design a counterfactual regret advantage policy gradient for improving the accuracy of the policy evaluation. Second, in order to deal with the dynamics of external agents’ policy during the self-play process, we leverage the imitation learning, which takes the external agents’ history trajectors as training data and imitates the external agents’ policy, to explicitly model the external agents’ behaviors. Third, based on the counterfactual regret advantage policy gradient and the imitation of external agents, we design a self-play approach for learning an optimal joint policy for target agents under the uncertainty or dynamics of external agents’ policy. We conduct a set of experiments on the joint electromagnetic countermeasure, including three typical mixed cooperative-competitive tasks. The experimental results demonstrate that compared with other baseline approaches, the proposed approach has at least 78% performance improvement.

    参考文献
    相似文献
    引证文献
引用本文
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-06-19
  • 最后修改日期:2022-09-21
  • 录用日期:2022-11-15
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号