基于预测编码的样本自适应行动策略规划
作者:
作者单位:

作者简介:

梁星星(1992-),男,博士生,主要研究领域为智能规划,军事智能博弈对抗;
张龙飞(1988-),男,博士生,主要研究领域为机器学习,深度强化学习;
马扬(1993-),男,博士生,主要研究领域为网络嵌入,链路预测,图神经网络;
廖世江(1989-),男,主要研究领域为军事智能博弈对抗;
冯旸赫(1985-),男,博士,副教授,主要研究领域为因果发现与推理,主动学习,强化学习;
刘忠(1968-),男,博士,教授,博士生导师,主要研究领域为多智能体系统;
张驭龙(1988-),男,博士生,主要研究领域为信息系统,强化学习,智能博弈.

通讯作者:

冯旸赫,E-mail:fengyanghe@nudt.edu.cn;刘忠,E-mail:liuzhong@nudt.edu.cn

中图分类号:

基金项目:

国家自然科学基金(71701205)


Sample Adaptive Policy Planning Based on Predictive Coding
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    军事行动、反恐突击等强对抗场景中,实时信息的碎片化、不确定性对制定具有博弈优势的弹性行动方案提出了更高的要求,研究具有自学习能力的智能行动策略规划方法已成为编队级强对抗任务的核心问题.针对复杂场景下行动策略规划状态表征困难、数据效率低下等问题,提出了基于预测编码的样本自适应行动策略规划方法.利用自编码模型压缩表示任务的原始状态空间,通过任务环境的状态转移样本,在低维度状态空间中使用混合密度分布网络对任务环境的动态模型进行学习,获得了表征环境动态性的预测编码;基于预测编码展开行动策略规划研究,利用时间差分敏感的样本自适应方法对状态评估值函数进行预测,改善了数据效率,加速了算法收敛.为了验证算法的有效性,基于全国兵棋推演大赛机机挑战赛的想定,构建了包含大赛获奖选手操作策略的5种规则智能体,利用消融实验验证编码方式、样本采样策略等不同因子组合对算法的影响,并使用Elo评分机制对各个智能体进行排序;实验结果表明:基于预测编码的样本自适应算法——MDN-AF得分排序最高,对战平均胜率为71%,其中大比分获胜局占比为67.6%,而且学习到了自主波次划分、补充侦察策略、“蛇形”打击策略、轰炸机靠后突袭等4种长时行动策略.该算法框架应用于2020年全国兵棋推演大赛的智能体开发,并获得了全国一等奖.

    Abstract:

    With the development of intelligent warfare, the fragmentation and uncertainty of real-time information in highly competitive scenarios such as military operations and anti-terrorism assault put forward higher requirements for making flexible policy with game advantages. The research of intelligent policy learning method with self-learning ability has become the core issue of formation-level tasks. Faced with difficulties in state representation and low data utilization efficiency, a sample adaptive policy learning method is proposed based on predictive coding. The auto-encoder model is applied to compress the original task state space, and the predictive coding of the dynamic environment is obtained through the state transition samples of the environment combined with the autoregressive model using the mixed density distribution network, which improves the capacity of the task state representation. Temporal difference error is utilized by the predictive-coding-based sample adaptive method to predict the value function, which improves the data efficiency and accelerates the convergence of the algorithm. To verify its effectiveness, a typical air combat scenario is constructed based on the previous national wargame competition platforms, where five specially designed rule-based agents are included by the contestants. The ablation experiments are implemented to verify the influence of different factors with regard to coding strategies and sampling policies while the Elo scoring mechanism is adopted to rank the agents. Experimental results confirm that MDN-AF, the sample adaptive algorithm based on predictive coding,reaches the highest score with an average winning rate of 71%, 67.6% of which are easy wins. Moreover, it has learned four kinds of interpretable long-term strategies including autonomous wave division, supplementary reconnaissance, “snake” strike and bomber-in-the-rear formation. In addition, the agent applying this algorithm framework has won the national first prize of 2020 National Wargame Competition.

    参考文献
    相似文献
    引证文献
引用本文

梁星星,马扬,冯旸赫,张驭龙,张龙飞,廖世江,刘忠.基于预测编码的样本自适应行动策略规划.软件学报,2022,33(4):1477-1500

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-05-23
  • 最后修改日期:2021-07-16
  • 录用日期:
  • 在线发布日期: 2021-10-26
  • 出版日期: 2022-04-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号