基于预测编码的样本自适应行动策略规划
作者:
作者简介:

梁星星(1992-),男,博士生,主要研究领域为智能规划,军事智能博弈对抗;
张龙飞(1988-),男,博士生,主要研究领域为机器学习,深度强化学习;
马扬(1993-),男,博士生,主要研究领域为网络嵌入,链路预测,图神经网络;
廖世江(1989-),男,主要研究领域为军事智能博弈对抗;
冯旸赫(1985-),男,博士,副教授,主要研究领域为因果发现与推理,主动学习,强化学习;
刘忠(1968-),男,博士,教授,博士生导师,主要研究领域为多智能体系统;
张驭龙(1988-),男,博士生,主要研究领域为信息系统,强化学习,智能博弈.

通讯作者:

冯旸赫,E-mail:fengyanghe@nudt.edu.cn;刘忠,E-mail:liuzhong@nudt.edu.cn

基金项目:

国家自然科学基金(71701205)


Sample Adaptive Policy Planning Based on Predictive Coding
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [40]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    军事行动、反恐突击等强对抗场景中,实时信息的碎片化、不确定性对制定具有博弈优势的弹性行动方案提出了更高的要求,研究具有自学习能力的智能行动策略规划方法已成为编队级强对抗任务的核心问题.针对复杂场景下行动策略规划状态表征困难、数据效率低下等问题,提出了基于预测编码的样本自适应行动策略规划方法.利用自编码模型压缩表示任务的原始状态空间,通过任务环境的状态转移样本,在低维度状态空间中使用混合密度分布网络对任务环境的动态模型进行学习,获得了表征环境动态性的预测编码;基于预测编码展开行动策略规划研究,利用时间差分敏感的样本自适应方法对状态评估值函数进行预测,改善了数据效率,加速了算法收敛.为了验证算法的有效性,基于全国兵棋推演大赛机机挑战赛的想定,构建了包含大赛获奖选手操作策略的5种规则智能体,利用消融实验验证编码方式、样本采样策略等不同因子组合对算法的影响,并使用Elo评分机制对各个智能体进行排序;实验结果表明:基于预测编码的样本自适应算法——MDN-AF得分排序最高,对战平均胜率为71%,其中大比分获胜局占比为67.6%,而且学习到了自主波次划分、补充侦察策略、“蛇形”打击策略、轰炸机靠后突袭等4种长时行动策略.该算法框架应用于2020年全国兵棋推演大赛的智能体开发,并获得了全国一等奖.

    Abstract:

    With the development of intelligent warfare, the fragmentation and uncertainty of real-time information in highly competitive scenarios such as military operations and anti-terrorism assault put forward higher requirements for making flexible policy with game advantages. The research of intelligent policy learning method with self-learning ability has become the core issue of formation-level tasks. Faced with difficulties in state representation and low data utilization efficiency, a sample adaptive policy learning method is proposed based on predictive coding. The auto-encoder model is applied to compress the original task state space, and the predictive coding of the dynamic environment is obtained through the state transition samples of the environment combined with the autoregressive model using the mixed density distribution network, which improves the capacity of the task state representation. Temporal difference error is utilized by the predictive-coding-based sample adaptive method to predict the value function, which improves the data efficiency and accelerates the convergence of the algorithm. To verify its effectiveness, a typical air combat scenario is constructed based on the previous national wargame competition platforms, where five specially designed rule-based agents are included by the contestants. The ablation experiments are implemented to verify the influence of different factors with regard to coding strategies and sampling policies while the Elo scoring mechanism is adopted to rank the agents. Experimental results confirm that MDN-AF, the sample adaptive algorithm based on predictive coding,reaches the highest score with an average winning rate of 71%, 67.6% of which are easy wins. Moreover, it has learned four kinds of interpretable long-term strategies including autonomous wave division, supplementary reconnaissance, “snake” strike and bomber-in-the-rear formation. In addition, the agent applying this algorithm framework has won the national first prize of 2020 National Wargame Competition.

    参考文献
    [1] Mcdermott DV, Hendler JA. Planning: What it is, what it could be, an introduction to the special issue on planning and scheduling. Artificial Intelligence, 1995, 76(1-2): 1-16.
    [2] Fikes RE, Nilsson NJJAI. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 1971, 2(3-4): 189-208.
    [3] Fang C, Franck M, Min X, et al. An integrated framework for risk response planning under resource constraints in large engineering projects. IEEE Trans. on Engineering Management, 2013, 60(3): 627-639.
    [4] Feng Y, Cai ZY, Wang XH, et al. A plan recognizing algorithm based on fuzzy cognitive plan map. Int’l Journal of Performability Engineering, 2017, 13(7): 1094-1100.
    [5] Wilkins DE. Planning and reacting in uncertain and dynamic environments. Journal of Experimental & Theoretical Artificial Intelligence, 1995, 7(1): 121-152.
    [6] Currie K, Tate AJAI. O-plan: The open planning architecture. Artificial Intelligence, 1991, 52(1): 49-86.
    [7] Myers K. PASSAT: A user-centric planning framework. In: Proc. of the 3rd Int’l NASA Workshop on Planning and Scheduling. Washington: NASA, 2003.
    [8] Erol K, Hendler J, Nau DS. Semantics for hierarchical task-network planning. Maryland: University of Maryland at College Park, 1994.
    [9] Shao TH, Zhang HJ, Cheng K, et al. Review of replanning in hierarchical task network. Systems Engineering and Electronics, 2020, 12(42): 2833-2846 (in Chinese with English abstract).
    [10] Yz HMN, David WAZ, Len BZ, et al. HICAP: An interactive case-based planning architecture and its application to noncombatant evacuation operations. In: Proc. of the 6th National Conf. on Artificial Intelligence & the 11th Innovative Applications of Artificial Intelligence Conf. on Innovative Applications of Artificial Intelligence. Orlando: ACM, 2008.
    [11] Mulvehill A, Caroli J. JADE: A tool for rapid crisis action planning. In: Proc. of the 5th Int’l Command and Control Research and Technology Symp. Stockholm: MIT, 2000.
    [12] Yu WH, Han JS. Research on maritime search and rescue case-based system. Microcomputer Applications, 2011, 27(4): 13-15+4 (in Chinese with English abstract).
    [13] Zhang XD, Wang T, Zhang L. Demand analysis of oil support in military operations with case-based reasoning method. Journal of Military Transportation University, 2018, 20(6): 14-17 (in Chinese with English abstract).
    [14] Rao AS, Georgeff MP. BDI agents: From theory to practice. In: Proc. of the 1st Int’l Conf. on Multiagent Systems. California: AAAI, 1995.
    [15] Holliday P. SWARMM—A mobility modelling tool for tactical military networks. In: Proc. of the 2008 IEEE Military Communications Conf. California: IEEE, 2008.
    [16] Fugere J, LaBoissonniere F, Liang Y. An approach to design autonomous agents within ModSAF. In: Proc. of the 1999 IEEE Int’l Conf. on Systems, Man, and Cybernetics. Tokyo: IEEE, 1999.
    [17] Wooldridge M. An Introduction to MultiAgent Systems. New Jersey: Wiley Publishing, 2009.
    [18] Li H, Chang GC, Sun P. Operational plan making based on procedure reasoning system. Electronics Optics & Control, 2008(10): 51-54 (in Chinese with English abstract).
    [19] Liang XX, Feng YH, Ma Y, et al. Deep multi-agent reinforcement learning: A survey. Acta Automatica Sinica, 2020, 46(12): 2537-2557 (in Chinese with English abstract).
    [20] Liang XX, Feng YH, Huang JC, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 948-966 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5930.htm [doi: 10.13328/j.cnki.jos.005930]
    [21] Chinese Institute of Command and Control. In: Proc. of the 4th National Wargaming Competition in 2020. 2020 (in Chinese with English abstract). http://www.ciccwargame.com/
    [22] Hand DJ. Who’s #1? The science of rating and ranking. Journal of Applied Statistics, 2012, 39(10): 81-83.
    [23] Richard SS, Andrew GB. Reinforcement Learning: An Introduction. 2nd ed., Massachusetts: MIT, 2017.
    [24] Schulman J, Levine S, Abbeel P, etal. Trust region policy optimization. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lile: JMLR.org, 2015. 1889-1897.
    [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms. arXiv: 1707.06347, 2017.
    [26] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529-533.
    [27] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR 80, 2018.
    [28] Kingma DP, Welling M. Auto-encoding variational Bayes. arXiv: 1312.6114v10, 2013.
    [29] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans.on Neural Networks, 1994, 5(2): 157-166.
    [30] Ha D, Eck D. A neural representation of sketch drawings. arXiv: 1704.03477, 2017.
    [31] Wang ZY, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay. arXiv: 1611.01224, 2016.
    [32] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. arXiv: 1511.05952, 2015.
    附中文参考文献:
    [9] 邵天浩, 张宏军, 程恺, 戴成友, 余晓晗, 张可. 层次任务网络中的重新规划研究综述. 系统工程与电子技术, 2020, 42(12): 2833-2846.
    [12] 于卫红, 韩俊松. 海上搜救案例库系统的研究. 微型电脑应用, 2011, 27(4): 13-15, 4.
    [13] 张晓东, 汪涛, 张磊. 基于案例推理的军事行动油料保障需求分析. 军事交通学院学报, 2018, 20(6): 14-17.
    [18] 李皓, 常国岑, 孙鹏. 采用过程推理系统的作战方案生成研究. 电光与控制, 2008(10): 51-54.
    [19] 梁星星, 冯旸赫, 马扬, 程光权, 黄金才, 王琦, 周玉珍, 刘忠. 多Agent深度强化学习综述. 自动化学报, 2020, 46(12): 2537-2557.
    [20] 梁星星, 冯旸赫, 黄金才, 王琦, 马扬, 刘忠. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4): 948-966. http://www.jos.org.cn/1000-9825/5930.htm [doi: 10.13328/j.cnki.jos.005930]
    [21] 中国指挥与控制协会. 2020第四届全国兵棋推演大赛. 2020. http://www.ciccwargame.com/
    引证文献
引用本文

梁星星,马扬,冯旸赫,张驭龙,张龙飞,廖世江,刘忠.基于预测编码的样本自适应行动策略规划.软件学报,2022,33(4):1477-1500

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-05-23
  • 最后修改日期:2021-07-16
  • 在线发布日期: 2021-10-26
  • 出版日期: 2022-04-06
文章二维码
您是第19936684位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号