Research and Development on Deep Hierarchical Reinforcement Learning
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [159]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Deep hierarchical reinforcement learning (DHRL) is an important research field in deep reinforcement learning (DRL). It focuses on sparse reward, sequential decision, and weak transfer ability problems, which are difficult to be solved by classic DRL. DHRL decomposes complex problems and constructs a multi-layered structure for DRL strategies based on hierarchical thinking. By using temporal abstraction, DHRL combines lower-level actions to learn semantic higher-level actions. In recent years, with the development of research, DHRL has been able to make breakthroughs in many domains and shows a strong performance. It has been applied to visual navigation, natural language processing, recommendation system and video description generation fields in real world. In this study, the theoretical basis of hierarchical reinforcement learning (HRL) is firstly introduced. Secondly, the key technologies of DHRL are described, including hierarchical abstraction techniques and common experimental environments. Thirdly, taking the option-based deep hierarchical reinforcement learning framework (O-DHRL) and the subgoal-based deep hierarchical reinforcement learning framework (G-DHRL) as the main research objects, those research status and development trend of various algorithms are analyzed and compared in detail. In addition, a number of DHRL applications in real world are discussed. Finally, DHRL is prospected and summarized.

    Reference
    [1] Sutton RS, Barto AG. Reinforcement Learning:An Introduction. Cambridge, 2018.
    [2] Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Cambridge, 2016.
    [3] Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P, Xü J. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1):1-27 (in Chinese with English abstract). 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述. 计算机学报, 2018, 41(1):1-27.
    [4] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533.
    [5] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In:Proc. of the Int'l Conf. on Learning Representations. 2016.
    [6] Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J. Reinforcement learning through asynchronous advantage actor-critic on a GPU. In:Proc. of the Int'l Conf. on Learning Representations. 2017.
    [7] Lai J, Wei JY, Chen XL. Overview of hierarchical reinforcement learning. Computer Engineering and Applications, 2021, 57(3):72-79 (in Chinese with English abstract). 赖俊, 魏竞毅, 陈希亮. 分层强化学习综述. 计算机工程与应用, 2021, 57(3):72-79.
    [8] Sutton RS, Precup D, Singh S. Between MDPs and semi-MDPs:A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112(1-2):181-211.[doi:10.1016/s0004-3702(99)00052-1]
    [9] Kulkarni TD, Narasimhan K, Saeedi A, Tenenbaum J. Hierarchical deep reinforcement learning:Integrating temporal abstraction and intrinsic motivation. In:Advances in Neural Information Processing Systems. MIT, 2016. 3675-3683.
    [10] Tang H, Hao J, Lv T, Chen Y, Zhang Z, Jia H, Ren C, Zheng Y, Meng Z, Fan C. Hierarchical deep multiagent reinforcement learning with temporal abstraction. arXiv:1809.09332, 2018.
    [11] Chen J, Wang Z, Tomizuka M. Deep hierarchical reinforcement learning for autonomous driving with distinct behaviors. In:Proc. of the IEEE Intelligent Vehicles Symp. IEEE, 2018. 1239-1244.[doi:10.1109/ivs.2018.8500368]
    [12] Liu J, Pan F, Luo L. Gochat:Goal-oriented chatbots with hierarchical reinforcement learning. In:Proc. of the Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM, 2020. 1793-1796.[doi:10.1145/3397271.3401250]
    [13] Zhao D, Zhang L, Zhang B, Zheng L, Bao Y, Yan W. MAHRL:Multi-goals abstraction based deep hierarchical reinforcement learning for recommendations. In:Proc. of the Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM, 2020. 871-880.[doi:10.1145/3397271.3401170]
    [14] Wang X, Chen W, Wu J, Wang YF, Wang WY. Video captioning via hierarchical reinforcement learning. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, 2018. 4213-4222.[doi:10.1109/cvpr.2018.00443]
    [15] Schaul T, Horgan D, Gregor K, Silver D. Universal value function approximators. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2015. 1312-1320.
    [16] Bacon PL, Harb J, Precup D. The option-critic architecture. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2017. 1726-1734.
    [17] Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, Mcgrew B, Tobin J, Pieter Abbeel O, Zaremba W. Hindsight experience replay. In:Advances in Neural Information Processing Systems. MIT, 2017. 5048-5058.
    [18] Thrun S, Schwartz A. Finding structure in reinforcement learning. In:Advances in Neural Information Processing Systems. MIT, 1995. 385-392.
    [19] Mahadevan S, Marchalleck N, Das TK, Gosavi A. Self-improving factory simulation using continuous-time average-reward reinforcement learning. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 1997. 202-210.
    [20] Hauskrecht M, Meuleau N, Kaelbling LP, Dean T, Boutilier C. Hierarchical solution of markov decision processes using macro-actions. In:Proc. of the Conf. on Uncertainty in Artificial Intelligence. AUAI, 1998. 220-229.
    [21] Yang WY, Bai CJ, Cai C, Zhao YN, Liu P. Survey on sparse reward in deep reinforcement learning. Computer Science, 2020, 47(3):182-191 (in Chinese with English abstract). 杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述. 计算机科学, 2020, 47(3):182-191.
    [22] Watkins CJ, Dayan P. Q-learning. Machine Learning, 1992, 8(3-4):279-292.
    [23] Hasselt H. Double q-learning. In:Advances in Neural Information Processing Systems. MIT, 2010. 2613-2621.
    [24] Rafati J, Noelle DC. Learning representations in model-free hierarchical reinforcement learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 10009-10010.[doi:10.1609/aaai.v33i01.330110009]
    [25] Murphy KP. A survey of POMDP solution techniques. Environment, 2000, 2:3.
    [26] Ye X, Yang Y. Hierarchical and partially observable goal-driven policy learning with goals relational graph. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE, 2021. 14101-14110.[doi:10.1109/cvpr46437.2021.01388]
    [27] Lee Y, Sun SH, Somasundaram S, Hu ES, Lim JJ. Composing complex skills by learning transition policies. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [28] Mousavi SS, Schukat M, Howley E. Deep reinforcement learning:An overview. In:Proc. of the SAI Intelligent Systems Conf. Springer, 2016. 426-440.
    [29] Eysenbach B, Gupta A, Ibarz J, Levine S. Diversity is all you need:Learning skills without a reward function. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [30] Fox R, Krishnan S, Stoica I, Goldberg K. Multi-level discovery of deep options. arXiv:1703.08294, 2017.
    [31] Mankowitz DJ, Mann TA, Mannor S. Iterative hierarchical optimization for misspecified problems (IHOMP). arXiv:1602.03348, 2016.
    [32] Osa T, Sugiyama M. Hierarchical policy search via return-weighted density estimation. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2017. 3860-3867.
    [33] Campos Camúñez V, Trott A, Xiong C, Socher R, Giró Nieto X, Torres Viñals J. Explore, discover and learn:Unsupervised discovery of state-covering skills. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2020. 1-17.
    [34] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. OpenAI gym. arXiv:1606.01540, 2016.
    [35] Hasselt HV, Guez A, Silver D. Deep reinforcement learning with double q-learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2016. 2094-2100.
    [36] Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K, Erhan D, Finn C, Kozakowski P, Levine S. Model-based reinforcement learning for Atari. In:Proc. of the Int'l Conf. on Learning Representations. 2020.
    [37] Reddy S, Dragan AD, Levine S. SQIL:Imitation learning via reinforcement learning with sparse rewards. In:Proc. of the Int'l Conf. on Learning Representations. 2019.
    [38] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In:Proc. of the Int'l Conf. on Learning Representations. 2016.
    [39] Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D. Rainbow:Combining improvements in deep reinforcement learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2018. 3215-3222.
    [40] Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P. Benchmarking deep reinforcement learning for continuous control. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2016. 1329-1338.
    [41] Kreidieh AR, Berseth G, Trabucco B, Parajuli S, Levine S, Bayen AM. Inter-level cooperation in hierarchical reinforcement learning. arXiv:1912.02368, 2019.
    [42] Nachum O, Gu SS, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. In:Advances in Neural Information Processing Systems. MIT, 2018. 3303-3313.
    [43] Tian Q, Wang G, Liu J, Wang D, Kang Y. Independent skill transfer for deep reinforcement learning. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence, 2019. 2901-2907.[doi:10.24963/ijcai.2020/401]
    [44] Gregor K, Rezende DJ, Wierstra D. Variational intrinsic control. arXiv:1611.07507, 2016.
    [45] Manoharan A, Ramesh R, Ravindran B. Option encoder:A framework for discovering a policy basis in reinforcement learning. In:Proc. of the Machine Learning and Knowledge Discovery in Databases. Springer, 2020. 509-524.[doi:10.1007/978-3-030-67661-2_30]
    [46] Zahavy T, Hasidim A, Kaplan H, Mansour Y. Planning in hierarchical reinforcement learning:Guarantees for using local policies. In:Proc. of the Algorithmic Learning Theory. Springer, 2020. 906-934.
    [47] Li C, Xia F, Martin-Martin R, Savarese S. HRL4in:Hierarchical reinforcement learning for interactive navigation with mobile manipulators. In:Proc. of the Conf. on Robot Learning. PMLR, 2020. 603-616.
    [48] Florensa C, Duan Y, Abbeel P. Stochastic neural networks for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Learning Representations. 2017.[doi:10.1002/rcm.765]
    [49] Konidaris G. Constructing abstraction hierarchies using a skill-symbol loop. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence. 2016. 1648.
    [50] Lyu D, Yang F, Liu B, Gustafson S. SDRL:Interpretable and data-efficient deep reinforcement learning leveraging symbolic planning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 2970-2977.[doi:10.1609/aaai.v33i01.33012970]
    [51] Kempka M, Wydmuch M, Runc G, Toczek J, Jaśkowski W. Vizdoom:A doom-based ai research platform for visual reinforcement learning. In:Proc. of the IEEE Conf. on Computational Intelligence and Games. IEEE, 2016. 1-8.[doi:10.1109/cig.2016.7860433]
    [52] Brittain M, Wei P. Hierarchical reinforcement learning with deep nested agents. arXiv:1805.07008, 2018.
    [53] Khetarpal K, Klissarov M, Chevalier-Boisvert M, Bacon PL, Precup D. Options of interest:Temporal abstraction with interest functions. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2020. 4444-4451.[doi:10.1609/aaai.v34i04.5871]
    [54] Beattie C, Leibo JZ, Teplyashin D, Ward T, Wainwright M, Küttler H, Lefrancq A, Green S, Valdés V, Sadik A. Deepmind lab. arXiv:1612.03801, 2016.
    [55] Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K. Reinforcement learning with unsupervised auxiliary tasks. In:Proc. of the Int'l Conf. on Learning Representations. 2017.
    [56] Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2016. 1928-1937.
    [57] Li S, Wang R, Tang M, Zhang C. Hierarchical reinforcement learning with advantage-based auxiliary rewards. In:Advances in Neural Information Processing Systems. MIT, 2019. 1409-1419.
    [58] Nachum O, Gu S, Lee H, Levine S. Near-optimal representation learning for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [59] Li AC, Florensa C, Clavera I, Abbeel P. Sub-policy adaptation for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Learning Representations. 2019.
    [60] Dietterich TG. The MAXQ method for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 1998. 118-126.
    [61] Sohn S, Oh J, Lee H. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. In:Proc. of the Int'l Conf. on Neural Information Processing Systems. Springer, 2018. 7156-7166.
    [62] Esteban D, Rozo L, Caldwell DG. Hierarchical reinforcement learning for concurrent discovery of compound and composable policies. In:Proc. of the IEEE Int'l Conf. on Intelligent Robots and Systems. IEEE, 2019. 1818-1825.[doi:10.1109/iros40897.2019.8968149]
    [63] Kokel H, Manoharan A, Natarajan S, Ravindran B, Tadepalli P. Reprel:Integrating relational planning and reinforcement learning for effective abstraction. In:Proc. of the Int'l Conf. on Automated Planning and Scheduling. AAAI, 2021. 533-541.
    [64] Machado MC, Bellemare MG, Bowling M. A Laplacian framework for option discovery in reinforcement learning. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2017. 2295-2304.
    [65] Co-Reyes JD, Liu Y, Gupta A, Eysenbach B, Abbeel P, Levine S. Self-consistent trajectory autoencoder:Hierarchical reinforcement learning with trajectory embeddings. arXiv:1806.02813, 2018.
    [66] Frans K, Ho J, Chen X, Abbeel P, Schulman J. Meta learning shared hierarchies. In:Proc. of the Int'l Conf. on Learning Representations. 2017.
    [67] Baumli K, Warde-Farley D, Hansen S, Mnih V. Relative variational intrinsic control. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2021. 6732-6740.
    [68] Dukkipati A, Banerjee R, Ayyagari RS, Udaybhai DP. Stay alive with many options:A reinforcement learning approach for autonomous navigation. arXiv:2102.00168, 2021.
    [69] Jain A, Khetarpal K, Precup D. Safe option-critic:Learning safety in the option-critic architecture. The Knowledge Engineering Review, 2021, 36.[doi:10.1017/s0269888921000035]
    [70] Zhang J, Yu H, Xu W. Hierarchical reinforcement learning by discovering intrinsic options. In:Proc. of the Int'l Conf. on Learning Representations. 2021.
    [71] Ghazanfari B, Mozayani N. Extracting bottlenecks for reinforcement learning agent by holonic concept clustering and attentional functions. Expert Systems with Applications, 2016, 54:61-77.[doi:10.1016/j.eswa.2016.01.030]
    [72] Guo X, Zhai Y. K-means clustering based reinforcement learning algorithm for automatic control in robots. Int'l Journal of Simulation:Systems, 2016, 17:24.[doi:10.5013/ijssst.a.17.24.06]
    [73] Levy A, Konidaris G, Platt R, Saenko K. Learning multi-level hierarchies with hindsight. In:Proc. of the Int'l Conf. on Learning Representations. 2017.
    [74] Dilokthanakul N, Kaplanis C, Pawlowski N, Shanahan M. Feature control as intrinsic motivation for hierarchical reinforcement learning. IEEE Trans. on Neural Networks and Learning Systems, 2019, 30(11):3409-3418.[doi:10.1109/tnnls.2019.2891792]
    [75] Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, Kavukcuoglu K. Feudal networks for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2017. 3540-3549.
    [76] Szepesvari C, Sutton RS, Modayil J, Bhatnagar S. Universal option models. In:Advances in Neural Information Processing Systems. MIT, 2014. 990-998.
    [77] Jothimurugan K, Bastani O, Alur R. Abstract value iteration for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Artificial Intelligence and Statistics. PMLR, 2021. 1162-1170.
    [78] Chane-Sane E, Schmid C, Laptev I. Goal-conditioned reinforcement learning with imagined subgoals. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2021. 1430-1440.
    [79] Nachum O, Tang H, Lu X, Gu S, Lee H, Levine S. Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv:1909.10618, 2019.
    [80] Sutton RS, Precup D, Singh SP. Intra-option learning about temporally abstract actions. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 1998. 556-564.
    [81] Harb J, Bacon PL, Klissarov M, Precup D. When waiting is not an option:Learning options with a deliberation cost. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2018. 3165-3172.
    [82] Konda VR, Tsitsiklis JN. Actor-critic algorithms. In:Advances in Neural Information Processing Systems. MIT, 2000. 1008-1014.
    [83] Zhu F, Zhu HJ, Liu Q, Chen DH, Fu YC. True online natural actor-critic algorithm for the continuous space problem. Ruan Jian Xue Bao/Journal of Software, 2018, 29(2):267-282 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5251.htm[doi:10.13328/j.cnki.jos.005251] 朱斐, 朱海军, 刘全, 陈冬火, 伏玉琛. 一种解决连续空间问题的真实在线自然梯度AC算法. 软件学报, 2018, 29(2):267-282. http://www.jos.org.cn/1000-9825/5251.htm[doi:10.13328/j.cnki.jos.005251].
    [84] Riemer M, Liu M, Tesauro G. Learning abstract options. In:Proc. of the Int'l Conf. on Neural Information Processing Systems. Springer, 2018. 10445-10455.
    [85] Riemer M, Cases I, Rosenbaum C, Liu M, Tesauro G. On the role of weight sharing during deep option learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2020. 5519-5526.[doi:10.1609/aaai.v34i04.6003]
    [86] Osa T, Tangkaratt V, Sugiyama M. Hierarchical reinforcement learning via advantage-weighted information maximization. In:Proc. of the Int'l Conf. on Learning Representations. 2019.
    [87] Hou Z, Zhang K, Wan Y, Li D, Fu C, Yu H. Off-policy maximum entropy reinforcement learning:Soft actor-critic with advantage weighted mixture policy (SAC-awmp). arXiv:2002.02829, 2020.
    [88] Kamat A, Precup D. Diversity-enriched option-critic. arXiv:2011.02565, 2020.
    [89] Li C, Ma X, Zhang C, Yang J, Xia L, Zhao Q. Soac:The soft option actor-critic architecture. arXiv:2006.14363, 2020.
    [90] Klissarov M, Precup D. Flexible option learning. In:Advances in Neural Information Processing Systems. MIT, 2021.
    [91] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2017. 1126-1135.
    [92] Zhao KL, Zhan XL, Wang YZ. Survey on few-shot learning. Ruan Jian Xue Bao/Journal of Software, 2021, 32(2):349-369 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6138.htm[doi:10.13328/j.cnki.jos.006138] 赵凯琳, 靳小龙, 王元卓. 小样本学习研究综述. 软件学报, 2021, 32(2):349-369. http://www.jos.org.cn/1000-9825/6138.htm[doi:10.13328/j.cnki.jos.006138].
    [93] Song Y, Wang J, Lukasiewicz T, Xu Z, Xu M. Diversity-driven extensible hierarchical reinforcement learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 4992-4999.[doi:10.1609/aaai.v33i01.33014992]
    [94] Song S, Weng J, Su H, Yan D, Zou H, Zhu J. Playing FPS games with environment-aware hierarchical reinforcement learning. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence. 2019. 3475-3482.
    [95] Klyubin AS, Polani D, Nehaniv CL. Empowerment:A universal agent-centric measure of control. In:Proc. of the IEEE Congress on Evolutionary Computation. IEEE, 2005. 128-135.[doi:10.1109/cec.2005.1554676]
    [96] Salge C, Glackin C, Polani D. Empowerment—An Introduction. Epping, 2014. https://arxiv.org/pdf/1310.1863.pdf
    [97] Savinov N, Raichuk A, Marinier R, Vincent D, Pollefeys M, Lillicrap T, Gelly S. Episodic curiosity through reachability. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [98] Burda Y, Edwards H, Pathak D, Storkey A, Darrell T, Efros AA. Large-scale study of curiosity-driven learning. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [99] Kumar NM. Empowerment-driven exploration using mutual information estimation. arXiv preprint arXiv:1810.05533, 2018.
    [100] Dai S, Xu W, Hofmann A, Williams B. An empowerment-based solution to robotic manipulation tasks with sparse rewards. In:Proc. of the Robotics:Science and Systems. 2021.[doi:10.15607/rss.2021.xvii.001]
    [101] Achiam J, Edwards H, Amodei D, Abbeel P. Variational option discovery algorithms. arXiv:1807.10299, 2018.
    [102] Lee L, Eysenbach B, Parisotto E, Xing E, Levine S, Salakhutdinov R. Efficient exploration via state marginal matching. arXiv:1906.05274, 2019.
    [103] Van Den Oord A, Vinyals O, Kavukcuoglu K. Neural discrete representation learning. In:Advances in Neural Information Processing Systems. MIT, 2017. 6309-6318.
    [104] Trott A, Zheng S, Xiong C, Socher R. Keeping your distance:Solving sparse reward tasks using self-balancing shaped rewards. In:Advances in Neural Information Processing Systems. MIT, 2019. 10376-10386.
    [105] Dhariwal P, Hesse C, Klimov O, Nichol A, PlapPert M. Openai baselines. 2017. https://github.com/openai/baselines
    [106] Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2018. 1582-1591.
    [107] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic:Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2018. 1861-1870.
    [108] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In:Proc. of the Int'l Conf. on Machine Learning. ACM, 2015. 1889-1897.
    [109] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
    [110] Furelos-Blanco D, Law M, Russo A, Broda K, Jonsson A. Induction of subgoal automata for reinforcement learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2020. 3890-3897.[doi:10.1609/aaai.v34i04.5802]
    [111] Chen D, Yan Q, Guo S, Yang Z, Su X, Chen F. Learning effective subgoals with multi-task hierarchical reinforcement learning. In:Proc. of the Scaling-up Reinforcement Learning Workshop. 2019.
    [112] Dayan P, Hinton GE. Feudal reinforcement learning. In:Advances in Neural Information Processing Systems. MIT, 1993. 271-278.
    [113] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.
    [114] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In:Proc. of the Int'l Conf. on Learning Representations. 2015.
    [115] Wang R, Yu R, An B, Rabinovich Z. I2HRL:Interactive influence-based hierarchical reinforcement learning. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence. 2020. 3131-3138.
    [116] Li R, Cai Z, Huang T, Zhu W. Anchor:The achieved goal to replace the subgoal for hierarchical reinforcement learning. Knowledge-based Systems, 2021, 225:107128.[doi:10.1016/j.knosys.2021.107128]
    [117] Zhang T, Guo S, Tan T, Hu X, Chen F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. In:dvances in Neural Information Processing Systems. MIT, 2020. 85-114.
    [118] Fu H, Tang H, Hao J, Liu W, Chen C. MGHRL:Meta goal-generation for hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Distributed Artificial Intelligence. Springer, 2020. 29-39.[doi:10.1007/978-3-030-64096-5_3]
    [119] Li S, Zheng L, Wang J, Zhang C. Learning subgoal representations with slow dynamics. In:Proc. of the Int'l Conf. on Learning Representations. 2020.
    [120] He Z, Gu C, Xu R, Wu K. Automatic curriculum generation by hierarchical reinforcement learning. In:Proc. of the Int'l Conf. on Neural Information Processing Systems. Springer, 2020. 202-213.[doi:10.1007/978-3-030-63833-7_17]
    [121] Röder F, Eppe M, Nguyen PD, Wermter S. Curious hierarchical actor-critic reinforcement learning. In:Proc. of the Int'l Conf. on Artificial Neural Networks. Springer, 2020. 408-419.[doi:10.1007/978-3-030-61616-8_33]
    [122] Friston K, Mattout J, Kilner J. Action understanding and active inference. Biological Cybernetics, 2011, 104(1):137-160.[doi:10.1007/s00422-011-0424-z]
    [123] Zhou X, Bai T, Gao Y, Han Y. Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning. Sensors, 2019, 19(7):1576.[doi:10.3390/s19071576]
    [124] Jain D, Iscen A, Caluwaerts K. Hierarchical reinforcement learning for quadruped locomotion. In:Proc. of the IEEE Int'l Conf. on Intelligent Robots and Systems. IEEE, 2019. 7551-7557.[doi:10.1109/iros40897.2019.8967913]
    [125] Li T, Lambert N, Calandra R, Meier F, Rai A. Learning generalizable locomotion skills with hierarchical reinforcement learning. In:Proc. of the IEEE Int'l Conf. on Robotics and Automation. IEEE, 2020. 413-419.[doi:10.1109/icra40945.2020.9196642]
    [126] Budzianowski P, Ultes S, Su PH, Mrkšić N, Wen TH, Casanueva I, Rojas-Barahona L, Gašić M. Sub-domain modelling for dialogue management with hierarchical reinforcement learning. In:Proc. of the Annual SIGdial Meeting on Discourse and Dialogue. ACL, 2017. 86-92.[doi:10.18653/v1/w17-5512]
    [127] Saha T, Gupta D, Saha S, Bhattacharyya P. Towards integrated dialogue policy learning for multiple domains and intents using hierarchical deep reinforcement learning. Expert Systems with Applications, 2020, 162:113650.[doi:10.1016/j.eswa.2020.113650]
    [128] Saha T, Saha S, Bhattacharyya P. Towards sentiment aided dialogue policy learning for multi-intent conversations using hierarchical reinforcement learning. PloS ONE, 2020, 15(7):1-28.[doi:10.1371/journal.pone.0235367]
    [129] Yu L, Zhang W, Wang J, Yu Y. SeqGan:Sequence generative adversarial nets with policy gradient. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2017. 2852-2858.
    [130] Ghandeharioun A, Shen JH, Jaques N, Ferguson C, Jones N, Lapedriza A, Picard R. Approximating interactive human evaluation with self-play for open-domain dialog systems. In:Proc. of the Annual Conf. on Neural Information Processing Systems. MIT, 2019. 13658-13669.
    [131] Saleh A, Jaques N, Ghandeharioun A, Shen J, Picard R. Hierarchical reinforcement learning for open-domain dialog. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2020. 8741-8748.[doi:10.1609/aaai.v34i05.6400]
    [132] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In:Proc. of the Int'l Conf. on Neural Information Processing Systems. Springer, 2017. 6000-6010.
    [133] Tang X, Chen Y, Li X, Liu J, Ying Z. A reinforcement learning approach to personalized learning recommendation systems. British Journal of Mathematical and Statistical Psychology, 2019, 72(1):108-135.[doi:10.1111/bmsp.12144]
    [134] Wang P, Fan Y, Xia L, Zhao WX, Niu S, Huang J. Kerl:A knowledge-guided reinforcement learning model for sequential recommendation. In:Proc. of the Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM, 2020. 209-218.[doi:10.1145/3397271.3401134]
    [135] Zhang J, Hao B, Chen B, Li C, Chen H, Sun J. Hierarchical reinforcement learning for course recommendation in Moocs. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 435-442.[doi:10.1609/aaai.v33i01.3301435]
    [136] Wang X, Wang Y, Guo L, Xu L, Gao B, Liu F, Li W. Exploring clustering-based reinforcement learning for personalized book recommendation in digital library. Information, 2021, 12(5):198.[doi:10.3390/info12050198]
    [137] Zhang Y, Ai Q, Chen X, Croft WB. Joint representation learning for top-n recommendation with heterogeneous information sources. In:Proc. of the ACM on Conf. on Information and Knowledge Management. ACM, 2017. 1449-1458.[doi:10.1145/3132847.3132892]
    [138] Xie R, Zhang S, Wang R, Xia F, Lin L. Hierarchical reinforcement learning for integrated recommendation. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2021. 4521-4528.
    [139] Shetty R, Laaksonen J. Frame-and segment-level features and candidate pool evaluation for video caption generation. In:Proc. of the ACM Int'l Conf. on Multimedia. ACM, 2016. 1073-1076.[doi:10.1145/2964284.2984062]
    [140] Huang Q, Gan Z, Celikyilmaz A, Wu D, Wang J, He X. Hierarchically structured reinforcement learning for topically coherent visual story generation. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 8465-8472.[doi:10.1609/aaai.v33i01.33018465]
    [141] Huang TH, Ferraro F, Mostafazadeh N, Misra I, Agrawal A, Devlin J, Girshick R, He X, Kohli P, Batra D. Visual storytelling. In:Proc. of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. ACL, 2016. 1233-1239.
    [142] Chen Y, Tao L, Wang X, Yamasaki T. Weakly supervised video summarization by hierarchical reinforcement learning. In:Proc. of the ACM Multimedia Asia. ACM, 2019. 1-6.[doi:10.1145/3338533.3366583]
    [143] Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-term memory. In:Proc. of the European Conf. on Computer Vision. Springer, 2016. 766-782.[doi:10.1007/978-3-319-46478-7_47]
    [144] Zhou K, Qiao Y, Xiang T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2018. 7582-7589.
    [145] Sun T, Shao Y, Li X, Liu P, Yan H, Qiu X, Huang X. Learning sparse sharing architectures for multiple tasks. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2020. 8936-8943.[doi:10.1609/aaai.v34i05.6424]
    [146] Subramanian S, Trischler A, Bengio Y, Pal CJ. Learning general purpose distributed sentence representations via large scale multi-task learning. In:Proc. of the Int'l Conf. on Learning Representations. 2018.
    [147] Ruder S, Bingel J, Augenstein I, Søgaard A. Latent multi-task architecture learning. In:Proc. of the AAAI Conf. on Artificial Intelligence. AAAI, 2019. 4822-4829.[doi:10.1609/aaai.v33i01.33014822]
    [148] Søgaard A, Goldberg Y. Deep multi-task learning with low level tasks supervised at lower layers. In:Proc. of the Annual Meeting of the Association for Computational Linguistics. ACL, 2016. 231-235.[doi:10.18653/v1/p16-2038]
    [149] Oord AVD, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
    [150] Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A. Reinforcement learning with augmented data. In:Advances in Neural Information Processing Systems. MIT, 2020.
    [151] Qureshi AH, Johnson JJ, Qin Y, Henderson T, Boots B, Yip MC. Composing task-agnostic policies with deep reinforcement learning. In:Proc. of the Int'l Conf. on Learning Representations. 2020.
    [152] Liang XX, Feng YH, Huang JC, Wang Q, Ma Y, Liu Z. Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model. Ruan Jian Xue Bao/Journal of Software, 2020, 31(4):948-966 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5930.htm[doi:10.13328/j.cnki.jos.005930]
    附中文参考文献:
    [3] 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述. 计算机学报, 2018, 41(1):1-27.
    [7] 赖俊, 魏竞毅, 陈希亮. 分层强化学习综述. 计算机工程与应用, 2021, 57(3):72-79.
    [21] 杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述. 计算机科学, 2020, 47(3):182-191.
    [83] 朱斐, 朱海军, 刘全, 陈冬火, 伏玉琛. 一种解决连续空间问题的真实在线自然梯度AC算法. 软件学报, 2018, 29(2):267-282. http://www.jos.org.cn/1000-9825/5251.htm[doi:10.13328/j.cnki.jos.005251]
    [92] 赵凯琳, 靳小龙, 王元卓. 小样本学习研究综述. 软件学报, 2021, 32(2):349-369. http://www.jos.org.cn/1000-9825/6138.htm[doi:10.13328/j.cnki.jos.006138]
    [152] 梁星星, 冯旸赫, 黄金才, 王琦, 马扬, 刘忠. 基于自回归预测模型的深度注意力强化学习方法. 软件学报, 2020, 31(4):948-966. http://www.jos.org.cn/1000-9825/5930.htm[doi:10.13328/j.cnki.jos.005930]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

黄志刚,刘全,张立华,曹家庆,朱斐.深度分层强化学习研究与发展.软件学报,2023,34(2):733-760

Copy
Share
Article Metrics
  • Abstract:3422
  • PDF: 6608
  • HTML: 5496
  • Cited by: 0
History
  • Received:August 02,2021
  • Revised:March 30,2022
  • Online: July 22,2022
  • Published: February 06,2023
You are the first2035245Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063