Sample Adaptive Policy Planning Based on Predictive Coding
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [40]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    With the development of intelligent warfare, the fragmentation and uncertainty of real-time information in highly competitive scenarios such as military operations and anti-terrorism assault put forward higher requirements for making flexible policy with game advantages. The research of intelligent policy learning method with self-learning ability has become the core issue of formation-level tasks. Faced with difficulties in state representation and low data utilization efficiency, a sample adaptive policy learning method is proposed based on predictive coding. The auto-encoder model is applied to compress the original task state space, and the predictive coding of the dynamic environment is obtained through the state transition samples of the environment combined with the autoregressive model using the mixed density distribution network, which improves the capacity of the task state representation. Temporal difference error is utilized by the predictive-coding-based sample adaptive method to predict the value function, which improves the data efficiency and accelerates the convergence of the algorithm. To verify its effectiveness, a typical air combat scenario is constructed based on the previous national wargame competition platforms, where five specially designed rule-based agents are included by the contestants. The ablation experiments are implemented to verify the influence of different factors with regard to coding strategies and sampling policies while the Elo scoring mechanism is adopted to rank the agents. Experimental results confirm that MDN-AF, the sample adaptive algorithm based on predictive coding,reaches the highest score with an average winning rate of 71%, 67.6% of which are easy wins. Moreover, it has learned four kinds of interpretable long-term strategies including autonomous wave division, supplementary reconnaissance, “snake” strike and bomber-in-the-rear formation. In addition, the agent applying this algorithm framework has won the national first prize of 2020 National Wargame Competition.

    Reference
    [1] Mcdermott DV, Hendler JA. Planning: What it is, what it could be, an introduction to the special issue on planning and scheduling. Artificial Intelligence, 1995, 76(1-2): 1-16.
    [2] Fikes RE, Nilsson NJJAI. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 1971, 2(3-4): 189-208.
    [3] Fang C, Franck M, Min X, et al. An integrated framework for risk response planning under resource constraints in large engineering projects. IEEE Trans. on Engineering Management, 2013, 60(3): 627-639.
    [4] Feng Y, Cai ZY, Wang XH, et al. A plan recognizing algorithm based on fuzzy cognitive plan map. Int’l Journal of Performability Engineering, 2017, 13(7): 1094-1100.
    [5] Wilkins DE. Planning and reacting in uncertain and dynamic environments. Journal of Experimental & Theoretical Artificial Intelligence, 1995, 7(1): 121-152.
    [6] Currie K, Tate AJAI. O-plan: The open planning architecture. Artificial Intelligence, 1991, 52(1): 49-86.
    [7] Myers K. PASSAT: A user-centric planning framework. In: Proc. of the 3rd Int’l NASA Workshop on Planning and Scheduling. Washington: NASA, 2003.
    [8] Erol K, Hendler J, Nau DS. Semantics for hierarchical task-network planning. Maryland: University of Maryland at College Park, 1994.
    [9] Shao TH, Zhang HJ, Cheng K, et al. Review of replanning in hierarchical task network. Systems Engineering and Electronics, 2020, 12(42): 2833-2846 (in Chinese with English abstract).
    [10] Yz HMN, David WAZ, Len BZ, et al. HICAP: An interactive case-based planning architecture and its application to noncombatant evacuation operations. In: Proc. of the 6th National Conf. on Artificial Intelligence & the 11th Innovative Applications of Artificial Intelligence Conf. on Innovative Applications of Artificial Intelligence. Orlando: ACM, 2008.
    [11] Mulvehill A, Caroli J. JADE: A tool for rapid crisis action planning. In: Proc. of the 5th Int’l Command and Control Research and Technology Symp. Stockholm: MIT, 2000.
    [12] Yu WH, Han JS. Research on maritime search and rescue case-based system. Microcomputer Applications, 2011, 27(4): 13-15+4 (in Chinese with English abstract).
    [13] Zhang XD, Wang T, Zhang L. Demand analysis of oil support in military operations with case-based reasoning method. Journal of Military Transportation University, 2018, 20(6): 14-17 (in Chinese with English abstract).
    [14] Rao AS, Georgeff MP. BDI agents: From theory to practice. In: Proc. of the 1st Int’l Conf. on Multiagent Systems. California: AAAI, 1995.
    [15] Holliday P. SWARMM—A mobility modelling tool for tactical military networks. In: Proc. of the 2008 IEEE Military Communications Conf. California: IEEE, 2008.
    [16] Fugere J, LaBoissonniere F, Liang Y. An approach to design autonomous agents within ModSAF. In: Proc. of the 1999 IEEE Int’l Conf. on Systems, Man, and Cybernetics. Tokyo: IEEE, 1999.
    [17] Wooldridge M. An Introduction to MultiAgent Systems. New Jersey: Wiley Publishing, 2009.
    [18] Li H, Chang GC, Sun P. Operational plan making based on procedure reasoning system. Electronics Optics & Control, 2008(10): 51-54 (in Chinese with English abstract).
    [19] Liang XX, Feng YH, Ma Y, et al. Deep multi-agent reinforcement learning: A survey. Acta Automatica Sinica, 2020, 46(12): 2537-2557 (in Chinese with English abstract).
    [20] Liang XX, Feng YH, Huang JC, et al. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4): 948-966 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5930.htm [doi: 10.13328/j.cnki.jos.005930]
    [21] Chinese Institute of Command and Control. In: Proc. of the 4th National Wargaming Competition in 2020. 2020 (in Chinese with English abstract). http://www.ciccwargame.com/
    [22] Hand DJ. Who’s #1? The science of rating and ranking. Journal of Applied Statistics, 2012, 39(10): 81-83.
    [23] Richard SS, Andrew GB. Reinforcement Learning: An Introduction. 2nd ed., Massachusetts: MIT, 2017.
    [24] Schulman J, Levine S, Abbeel P, etal. Trust region policy optimization. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lile: JMLR.org, 2015. 1889-1897.
    [25] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms. arXiv: 1707.06347, 2017.
    [26] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529-533.
    [27] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR 80, 2018.
    [28] Kingma DP, Welling M. Auto-encoding variational Bayes. arXiv: 1312.6114v10, 2013.
    [29] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans.on Neural Networks, 1994, 5(2): 157-166.
    [30] Ha D, Eck D. A neural representation of sketch drawings. arXiv: 1704.03477, 2017.
    [31] Wang ZY, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay. arXiv: 1611.01224, 2016.
    [32] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. arXiv: 1511.05952, 2015.
    附中文参考文献:
    [9] 邵天浩, 张宏军, 程恺, 戴成友, 余晓晗, 张可. 层次任务网络中的重新规划研究综述. 系统工程与电子技术, 2020, 42(12): 2833-2846.
    [12] 于卫红, 韩俊松. 海上搜救案例库系统的研究. 微型电脑应用, 2011, 27(4): 13-15, 4.
    [13] 张晓东, 汪涛, 张磊. 基于案例推理的军事行动油料保障需求分析. 军事交通学院学报, 2018, 20(6): 14-17.
    [18] 李皓, 常国岑, 孙鹏. 采用过程推理系统的作战方案生成研究. 电光与控制, 2008(10): 51-54.
    [19] 梁星星, 冯旸赫, 马扬, 程光权, 黄金才, 王琦, 周玉珍, 刘忠. 多Agent深度强化学习综述. 自动化学报, 2020, 46(12): 2537-2557.
    [20] 梁星星, 冯旸赫, 黄金才, 王琦, 马扬, 刘忠. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4): 948-966. http://www.jos.org.cn/1000-9825/5930.htm [doi: 10.13328/j.cnki.jos.005930]
    [21] 中国指挥与控制协会. 2020第四届全国兵棋推演大赛. 2020. http://www.ciccwargame.com/
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

梁星星,马扬,冯旸赫,张驭龙,张龙飞,廖世江,刘忠.基于预测编码的样本自适应行动策略规划.软件学报,2022,33(4):1477-1500

Copy
Share
Article Metrics
  • Abstract:2149
  • PDF: 5949
  • HTML: 3552
  • Cited by: 0
History
  • Received:May 23,2021
  • Revised:July 16,2021
  • Online: October 26,2021
  • Published: April 06,2022
You are the first2036684Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063