Survey on Inverse Reinforcement Learning

doi:10.13328/j.cnki.jos.006671

微信服务号

微信订阅号

2025-4-24- 15

Home > Archive>Volume 34, Issue 10, 2023 >4772-4803. DOI:10.13328/j.cnki.jos.006671

PDF HTML XML Export Cite reminder

Survey on Inverse Reinforcement Learning
DOI:
                        10.13328/j.cnki.jos.006671
                    
Author:
                        ZHANG Li-HuaZHANG Li-Hua
School of Computer Science & Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LIU QuanLIU Quan
School of Computer Science & Technology, Soochow University, Suzhou 215006, China;Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou 215006, China;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (Jilin University), Changchun 130012, China;Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
HUANG Zhi-GangHUANG Zhi-Gang
School of Computer Science & Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHU FeiZHU Fei
School of Computer Science & Technology, Soochow University, Suzhou 215006, China;Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [156]

Related [20]

Cited by

Materials

Comments

Abstract:

Inverse reinforcement learning (IRL), also known as inverse optimal control (IOC), is an important research method of reinforcement learning and imitation learning. IRL solves a reward function from expert samples, and the optimal strategy is then solved to imitate expert strategies. In recent years, fruitful achievements have been yielded by IRL in imitation learning, with widespread application in vehicle navigation, path recommendation, and robotic optimal control. First, this study presents the theoretical basis of IRL. Then, from the perspective of reward function construction methods, IRL algorithms based on linear and non-linear reward functions are analyzed. The algorithms include maximum marginal IRL, maximum entropy IRL, maximum entropy deep IRL, and generative adversarial imitation learning. In addition, frontier research directions of IRL are reviewed to compare and analyze relevant representative algorithms containing IRL with incomplete expert demonstrations, multi-agent IRL, IRL with sub-optimal expert demonstrations, and guiding IRL. Finally, the primary challenges of IRL and future developments in its theoretical and application significance are summarized.

Key words:inverse reinforcement learning (IRL);imitation learning;generative adversarial imitation learning;inverse optimal control (IOC);reinforcement learning (RL)

Reference

[1] Neu G, Szepesvári C. Apprenticeship learning using inverse reinforcement learning and gradient methods. In: Proc. of the 23rd Conf. on Uncertainty in Artificial Intelligence. Vancouver: ACM, 2007. 295-302.

[2] Kretzschmar H, Spies M, Sprunk C, Burgard W. Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research, 2016, 35(11): 1289-1307. [doi: 10.1177/0278364915619772]

[3] Kim B, Pineau J. Socially adaptive path planning in human environments using inverse reinforcement learning. International Journal of Social Robotics, 2016, 8(1): 51-66. [doi: 10.1007/s12369-015-0310-2]

[4] Kuefler A, Morton J, Wheeler T, Kochenderfer M. Imitating driver behavior with generative adversarial networks. In: Proc. of the 2017 IEEE Intelligent Vehicles Symp. Los Angeles: IEEE, 2017. 204-211.

[5] Bogert K, Doshi P. Multi-robot inverse reinforcement learning under occlusion with state transition estimation. In: Proc. of the 2015 Int’l Conf. on Autonomous Agents and Multiagent Systems. Istanbul: ACM, 2015. 1837-1838.

[6] Vogel A, Ramachandran D, Gupta R, Raux A. Improving hybrid vehicle fuel efficiency using inverse reinforcement learning. In: Proc. of the 2012 AAAI Conf. on Artificial Intelligence. Toronto: ACM, 2012. 384-390.

[7] Ziebart BD, Ratliff N, Gallagher G, Mertz C, Peterson K, Bagnell JA, Hebert M, Dey AK, Srinivasa S. Planning-based prediction for pedestrians. In: Proc. of the 2009 IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems. St. Louis: IEEE, 2009. 3931-3936.

[8] Levine S, Koltun V. Continuous inverse optimal control with locally optimal examples. In: Proc. of the 29th Int’l Conf. on Machine Learning. Edinburgh: ACM, 2012. 475-482.

[9] Finn C, Levine S, Abbeel P. Guided cost learning: Deep inverse optimal control via policy optimization. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: ACM, 2016. 49-58.

[10] Eysenbach B, Gupta A, Ibarz J, Levine S. Diversity is all you need: Learning skills without a reward function. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019.

[11] Merel J, Tassa Y, Tb D, Srinivasan S, Lemmon J, Wang ZY, Wayne G, Heess N. Learning human behaviors from motion capture by adversarial imitation. arXiv:1707.02201, 2017.

[12] Collins S, Ruina A, Tedrake R, Wisse M. Efficient bipedal robots based on passive-dynamic walkers. Science, 2005, 307(5712): 1082-1085. [doi: 10.1126/science.1107799]

[13] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013, 32(11): 1238-1274. [doi: 10.1177/0278364913495721]

[14] Tesauro G. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 1994, 6(2): 215-219. [doi: 10.1162/neco.1994.6.2.215]

[15] 刘全, 闫岩, 朱斐, 吴文, 张琳琳. 一种带探索噪音的深度循环Q网络. 计算机学报, 2019, 42(7): 1588-1604. [doi: 10.11897/SP.J.1016.2019.01588]

Liu Q, Yan Y, Zhu F, Wu W, Zhang LL. A deep recurrent q network with exploratory noise. Chinese Journal of Computers, 2019, 42(7): 1588-1604 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2019.01588]

[16] Ipek E, Mutlu O, Martínez JF, Caruana R. Self-optimizing memory controllers: A reinforcement learning approach. ACM SIGARCH Computer Architecture News, 2008, 36(3): 39-50. [doi: 10.1145/1394608.1382172]

[17] 梁天新, 杨小平, 王良, 韩镇远. 基于强化学习的金融交易系统研究与发展. 软件学报, 2019, 30(3): 845-864. http://www.jos.org.cn/1000-9825/5689.htm

Liang TX, Yang XP, Wang L, Han ZY. Review on financial trading system based on reinforcement learning. Ruan Jian Xue Bao/Journal of Software, 2019, 30(3): 845-864 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5689.htm

[18] 杨世贵, 王媛媛, 刘韦辰, 姜徐, 赵明雄, 方卉, 杨宇, 刘迪. 基于强化学习的温度感知多核任务调度. 软件学报, 2021, 32(8): 2408-2424. http://www.jos.org.cn/1000-9825/6190.htm

Yang SG, Wang YY, Liu WC, Jiang X, Zhao MX, Fang H, Yang Y, Liu D. Temperature-aware task scheduling on multicores based on reinforcement learning. Ruan Jian Xue Bao/Journal of Software, 2021, 32(8): 2408-2424 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6190.htm

[19] 傅启明, 刘全, 王辉, 肖飞, 于俊, 李娇. 一种基于线性函数逼近的离策略Q(λ)算法. 计算机学报, 2014, 37(3): 677-686. [doi: 10.3724/SP.J.1016.2013.00677]

Fu QM, Liu Q, Wang H, Xiao F, Yu J, Li J. A novel off policy Q(λ) algorithm based on linear function approximation. Chinese Journal of Computers, 2014, 37(3): 677-686 (in Chinese with English abstract). [doi: 10.3724/SP.J.1016.2013.00677]

[20] Argall BD, Chernova S, Veloso M, Browning B. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 2009, 57(5): 469-483. [doi: 10.1016/j.robot.2008.10.024]

[21] Siciliano B, Khatib O. Springer Handbook of Robotics. Berlin: Springer, 2008. 1371-1394.

[22] Maeda GJ, Neumann G, Ewerton M, Lioutikov R, Kroemer O, Peters J. Probabilistic movement primitives for coordination of multiple human-robot collaborative tasks. Autonomous Robots, 2017, 41(3): 593-612. [doi: 10.1007/s10514-016-9556-2]

[23] Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang JK, Zhang X, Zhao JK, Zieba K. End to end learning for self-driving cars. arXiv:1604.07316, 2016.

[24] Ross S, Bagnell D. Efficient reductions for imitation learning. In: Proc. of the 13th Int’l Conf. on Artificial Intelligence and Statistics. Sardinia: JMLR.org, 2010. 661-668.

[25] Ross S, Gordon G, Bagnell D. A reduction of imitation learning and structured prediction to no-regret online learning. In: Proc. of the 14th Int’l Conf. on Artificial Intelligence and Statistics. Fort Lauderdale: JMLR.org, 2011. 627-635.

[26] Boyd S, El Ghaoui L, Feron E, Balakrishnan V. Linear Matrix Inequalities in System and Control Theory. Philadelphia: SIAM, 1994. 153-154.

[27] Dvijotham K, Todorov E. Inverse optimal control with linearly-solvable MDPs. In: Proc. of the 27th Int’l Conf. on Machine Learning. Haifa: ACM, 2010. 335-342.

[28] Russell S. Learning agents for uncertain environments (extended abstract). In: Proc. of the 11th Annual Conf. on Computational Learning Theory. Wisconsin: ACM, 1998. 101-103.

[29] Cardamone L, Loiacono D, Lanzi PL. Learning drivers for TORCS through imitation using supervised methods. In: Proc. of the 2019 IEEE Symp. on Computational Intelligence and Games. Milan: IEEE, 2009. 148-155.

[30] Osa T, Sugita N, Mitsuishi M. Online trajectory planning and force control for automation of surgical tasks. IEEE Transactions on Automation Science and Engineering, 2018, 15(2): 675-691. [doi: 10.1109/TASE.2017.2676018]

[31] Todorov E, Erez T, Tassa Y. MuJoCo: A physics engine for model-based control. In: Proc. of the 2012 IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems. Vilamoura-Algarve: IEEE, 2012. 5026-5033.

[32] Osa T, Esfahani AMG, Stolkin R, Lioutikov R, Peters J, Neumann G. Guiding trajectory optimization by demonstrated distributions. IEEE Robotics and Automation Letters, 2017, 2(2): 819-826. [doi: 10.1109/LRA.2017.2653850]

[33] Sermanet P, Xu K, Levine S. Unsupervised perceptual rewards for imitation learning. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017.

[34] Liu YX, Gupta A, Abbeel P, Levine S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In: Proc. of the 2018 IEEE Int’l Conf. on Robotics and Automation. Brisbane: IEEE, 2018. 1118-1125.

[35] Ng AY, Russell SJ. Algorithms for inverse reinforcement learning. In: Proc. of the 17th Int’l Conf. on Machine Learning. Stanford: ACM, 2000. 663-670.

[36] Abbeel P, Ng AY. Apprenticeship learning via inverse reinforcement learning. In: Proc. of the 21st Int’l Conf. on Machine Learning. Banff: ACM, 2004. 1-8.

[37] Ratliff ND, Bagnell JA, Zinkevich MA. Maximum margin planning. In: Proc. of the 23rd Int’l Conf. on Machine Learning. Pittsburgh: ACM, 2006. 729-736.

[38] Ziebart BD, Maas A, Bagnell JA, Dey AK. Maximum entropy inverse reinforcement learning. In: Proc. of the 23rd AAAI Conf. on Artificial Intelligence. Chicago: ACM, 2008. 1433-1438.

[39] Boularias A, Kober J, Peters J. Relative entropy inverse reinforcement learning. In: Proc. of the 14th Int’l Conf. on Artificial Intelligence and Statistics. Fort Lauderdale: JMLR.org, 2011. 182-189.

[40] Shiarlis K, Messias J, Whiteson S. Inverse reinforcement learning from failure. In: Proc. of the 2016 Int’l Conf. on Autonomous Agents & Multiagent Systems. Singapore: ACM, 2016. 1060-1068.

[41] Kalakrishnan M, Pastor P, Righetti L, Schaal S. Learning objective functions for manipulation. In: Proc. of the 2013 IEEE Int’l Conf. on Robotics and Automation. Karlsruhe: IEEE, 2013. 1331-1336.

[42] Arenz O, Abdulsamad H, Neumann G. Optimal control and inverse optimal control by distribution matching. In: Proc. of the 2016 IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems. Daejeon: IEEE, 2016. 4046-4053.

[43] Vroman MC. Maximum likelihood inverse reinforcement learning [Ph.D. Thesis]. New Brunswick: The State University of New Jersey, 2014.

[44] Zheng JC, Liu SY, Ni LM. Robust Bayesian inverse reinforcement learning with sparse behavior noise. In: Proc. of the 28th AAAI Conf. on Artificial Intelligence. Québec City: ACM, 2014. 2198-2205.

[45] Qiao QF, Beling PA. Inverse reinforcement learning with Gaussian process. In: Proc. of the 2011 American Control Conf. San Francisco: IEEE, 2011. 113-118.

[46] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: ACM, 2014. 2672-2680.

[47] Ho J, Ermon S. Generative adversarial imitation learning. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: ACM, 2016. 4565-4573.

[48] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017.

[49] Baram N, Anschel O, Caspi I, Mannor S. End-to-end differentiable adversarial imitation learning. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: ACM, 2017. 390-399.

[50] Blondé L, Kalousis A. Sample-efficient imitation learning via generative adversarial nets. In: Proc. of the 22nd Int’l Conf. on Artificial Intelligence and Statistics. Naha: PMLR, 2019. 3138-3148.

[51] Jing MX, Ma XJ, Huang WB, Sun FC, Yang C, Fang B, Liu HP. Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proc. of the 2020 AAAI Conf. on Artificial Intelligence. New York: AAAI, 2020. 5109-5116.

[52] Brown D, Goo W, Nagarajan P, Niekum S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 783-792.

[53] Gao Y, Xu HZ, Lin J, Yu F, Levine S, Darrell T. Reinforcement learning from imperfect demonstrations. In: Proc. of the 2018 Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.

[54] Odom P, Natarajan S. Active advice seeking for inverse reinforcement learning. In: Proc. of the 29th AAAI Conf. on Artificial Intelligence. Austin: AAAI, 2015. 4186-4187.

[55] Brown DS, Niekum S. Machine teaching for inverse reinforcement learning: Algorithms and applications. In: Proc. of the 2019 AAAI Conf. on Artificial Intelligence. Honolulu: AAAI, 2019. 7749-7758.

[56] Haug L, Tschiatschek S, Singla A. Teaching inverse reinforcement learners via features and demonstrations. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montréal: ACM, 2018. 8464-8473.

[57] Kamalaruban P, Devidze R, Cevher V, Singla A. Interactive teaching algorithms for inverse reinforcement learning. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence. Macao: IJCAI.org, 2019. 2692-2700.

[58] Liu WY, Dai B, Humayun A, Tay C, Yu C, Smith LB, Rehg JM, Song L. Iterative machine teaching. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: PMLR, 2017. 2149-2158.

[59] Liu WY, Dai B, Li XG, Liu Z, Rehg J, Song L. Towards black-box iterative machine teaching. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: PMLR, 2018. 3141-3149.

[60] Rhinehart N, Kitani KM. First-person activity forecasting with online inverse reinforcement learning. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 3696-3705.

[61] 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 徐进. 深度强化学习综述. 计算机学报, 2018, 41(1): 1-27. [doi: 10.11897/SP.J.1016.2018.00001]

Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P, Xu J. A survey on deep reinforcement learning. Chinese Journal of Computers, 2018, 41(1): 1-27 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2018.00001]

[62] 刘潇, 刘书洋, 庄韫恺, 高阳. 强化学习可解释性基础问题探索和方法综述. 软件学报, 2023, 34(5): 2300-2316. http://www.jos.org.cn/1000-9825/6485.htm

Liu X, Liu SY, Zhuang YK, Gao Y. Explainable reinforcement learning: Basic problems exploration and method survey. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5): 2300-2316 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6485.htm

[63] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed., Cambridge: MIT Press, 2018. 18-18.

[64] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529-533. [doi: 10.1038/nature14236]

[65] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: ACM, 2016. 1928-1937.

[66] Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lille: ACM, 2015. 1889-1897.

[67] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.

[68] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan: ICLR, 2016.

[69] Zhou ZY, Bloem M, Bambos N. Infinite time horizon maximum causal entropy inverse reinforcement. IEEE Transactions on Automatic Control, 2018, 63(9): 2787-2802. [doi: 10.1109/TAC.2017.2775960]

[70] Bloem M, Bambos N. Infinite time horizon maximum causal entropy inverse reinforcement learning. In: Proc. of the 53rd IEEE Conf. on Decision and Control. Los Angeles: IEEE, 2014. 4911-4916.

[71] Silver D, Bagnell JA, Stentz A. Learning from demonstration for autonomous navigation in complex unstructured terrain. The International Journal of Robotics Research, 2010, 29(12): 1565-1592. [doi: 10.1177/0278364910369715]

[72] Ziebart BD. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Pittsburgh: Carnegie Mellon University?ProQuest Dissertations Publishing, 2010. 76-77.

[73] Kitani KM, Ziebart BD, Bagnell JA, Hebert M. Activity forecasting. In: Proc. of the 12th European Conf. on Computer Vision. Florence: Springer, 2012. 201-214.

[74] Amit R, Matari M. Learning movement sequences from demonstration. In: Proc. of the 2nd Int’l Conf. on Development and Learning. Cambridge: IEEE, 2002. 203-208.

[75] Dudík M, Schapire RE. Maximum entropy distribution estimation with generalized regularization. In: Proc. of the 19th Int’l Conf. on Computational Learning Theory. Pittsburgh: Springer, 2006. 123-138.

[76] Jaynes ET. Information theory and statistical mechanics. Physical Review, 1957, 106(4): 620-630. [doi: 10.1103/PhysRev.106.620]

[77] Amari SI. Information Geometry and Its Applications. Springer, 2016. 44-45.

[78] Scobee DRR, Sastry SS. Maximum likelihood constraint inference for inverse reinforcement learning. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.

[79] Malik S, Anwar U, Aghasi A, Ahmed A. Inverse constrained reinforcement learning. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 7390-7399.

[80] Klein E, Geist M, Piot B, Pietquin O. Inverse reinforcement learning through structured classification. In: Proc. of the 25th Int’l Conf. on Neural Information Processing Systems. Red Hook: ACM, 2012. 1007-1015.

[81] Klein E, Piot B, Geist M, Pietquin O. A cascaded supervised learning approach to inverse reinforcement learning. In: Proc. of the 2013 Joint European Conf. on Machine Learning and Knowledge Discovery in Databases. Prague: Springer, 2013. 1-16.

[82] Tsochantaridis I, Joachims T, Hofmann T, Altun Y. Large margin methods for structured and interdependent output variables. The Journal of Machine Learning Research, 2005, 6: 1453-1484.

[83] Taskar B, Chatalbashev V, Koller D, Guestrin C. Learning structured prediction models: A large margin approach. In: Proc. of the 22nd Int’l Conf. on Machine Learning. Bonn: ACM, 2005. 896-903.

[84] Bogdanovic M, Markovikj D, Denil M, De Freitas N. Deep apprenticeship learning for playing video games. In: Proc. of the 2015 Workshops at the AAAI Conf. on Artificial Intelligence. Austin: AAAI, 2015.

[85] Syed U, Schapire RE. A game-theoretic approach to apprenticeship learning. In: Proc. of the 20th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2008. 1449-1456.

[86] Deisenroth MP, Rasmussen CE, Peters J. Gaussian process dynamic programming. Neurocomputing, 2009, 72(7-9): 1508-1524. [doi: 10.1016/j.neucom.2008.12.019]

[87] Levine S, Popović Z, Koltun V. Nonlinear inverse reinforcement learning with Gaussian processes. In: Proc. of the 24th Int’l Conf. on Neural Information Processing Systems. Granada: ACM, 2011. 19-27.

[88] Rasmussen CE, Kuss M. Gaussian processes in reinforcement learning. In: Proc. of the 2004 Advances in Neural Information Processing Systems. Vancouver: MIT Press, 2004. 751-758.

[89] Engel Y, Mannor S, Meir R. Reinforcement learning with Gaussian processes. In: Proc. of the 22nd Int’l Conf. on Machine Learning. Bonn: ACM, 2005. 201-208.

[90] Levine S, Popović Z, Koltun V. Feature construction for inverse reinforcement learning. In: Proc. of the 23rd Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2010. 1342-1350.

[91] Fahad M, Chen Z, Guo Y. Learning how pedestrians navigate: A deep inverse reinforcement learning approach. In: Proc. of the 2018 IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems. Madrid: IEEE, 2018. 819-826.

[92] Wulfmeier M, Rao D, Wang DZ, Ondruska P, Posner I. Large-scale cost function learning for path planning using deep inverse reinforcement learning. The International Journal of Robotics Research, 2017, 36(10): 1073-1087. [doi: 10.1177/0278364917722396]

[93] Snyman JA. Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-based Algorithms. New York: Springer, 2005. 33-53.

[94] Uchibe E. Model-free deep inverse reinforcement learning by logistic regression. Neural Processing Letters, 2018, 47(3): 891-905. [doi: 10.1007/s11063-017-9702-7]

[95] Uchibe E. Deep inverse reinforcement learning by logistic regression. In: Proc. of the 23rd Int’l Conf. on Neural Information Processing. Kyoto: Springer, 2016. 23-31.

[96] Wang ZY, Schaul T, Hessel M, Van Hasselt H, Lanctot M, De Freitas N. Dueling network architectures for deep reinforcement learning. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: ACM, 2016. 1995-2003.

[97] Peng XB, Kanazawa A, Toyer S, Abbeel P, Levine S. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019.

[98] 林嘉豪, 章宗长, 姜冲, 郝建业. 基于生成对抗网络的模仿学习综述. 计算机学报, 2020, 43(2): 326-351. [doi: 10.11897/SP.J.1016.2020.00326]

Lin JH, Zhang ZZ, Jiang C, Hao JY. A survey of imitation learning based on generative adversarial nets. Chinese Journal of Computers, 2020, 43(2): 326-351 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2020.00326]

[99] Xu T, Li ZN, Yu Y. Error bounds of imitating policies and environments. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 15737-15749.

[100] Lin JH, Zhang ZZ. ACGAIL: Imitation learning about multiple intentions with auxiliary classifier GANs. In: Proc. of the 15th Pacific Rim Int’l Conf. on Artificial Intelligence. Nanjing: Springer, 2018. 321-334.

[101] Li YZ, Song JM, Ermon S. InfoGAIL: Interpretable imitation learning from visual demonstrations. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: ACM, 2017. 3815-3825.

[102] Wang ZY, Merel J, Reed S, Wayne G, de Freitas N, Heess N. Robust imitation of diverse behaviors. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: ACM, 2017. 5320-5329.

[103] Fei C, Wang B, Zhuang YZ, Zhang ZZ, Hao JY, Zhang HB, Ji XW, Liu WL. Triple-GAIL: A multi-modal imitation learning framework with generative adversarial nets. In: Proc. of the 29th Int’l Joint Conf. on Artificial Intelligence. Yokohama: IJCAI.org, 2020. 2929-2935.

[104] Dadashi R, Hussenot L, Geist M, Pietquin O. Primal Wasserstein imitation learning. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.

[105] Anderson BDO, Moore JB. Optimal Control: Linear Quadratic Methods. New York: Dover Publications, 2007. 262-268.

[106] Yoshua B, Lecun Y. Scaling learning algorithms towards AI. In: Bottou L, Chapelle O, DeCoste D, Weston J, eds. Large-scale Kernel Machines. Cambridge: MIT Press, 2007. 1-41.

[107] Arora S, Doshi P. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 2021, 297: 103500. [doi: 10.1016/j.artint.2021.103500]

[108] Reddy S, Dragan AD, Levine S. SQIL: Imitation learning via reinforcement learning with sparse rewards. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2019.

[109] Wu YH, Charoenphakdee N, Bao H, Tangkaratt V, Sugiyama M. Imitation learning from imperfect demonstration. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 6818-6827.

[110] Jacq A, Geist M, Paiva A, Pietquin O. Learning from a learner. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 2990-2999.

[111] Ramponi G, Drappo G, Restelli M. Inverse reinforcement learning from a gradient-based learner. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 2458-2468.

[112] Choi J, Kim KE. Inverse reinforcement learning in partially observable environments. The Journal of Machine Learning Research, 2011, 12: 691-730.

[113] Boularias A, Krömer O, Peters J. Structured apprenticeship learning. In: Proc. of the 2012 Joint European Conf. on Machine Learning and Knowledge Discovery in Databases. Bristol: Springer, 2012. 227-242.

[114] Torabi F, Warnell G, Stone P. Recent advances in imitation learning from observation. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence. Macao: IJCAI.org, 2019. 6325-6331.

[115] Hanna JP, Stone P. Grounded action transformation for robot learning in simulation. In: Proc. of the 31st AAAI Conf. on Artificial Intelligence. San Francisco: AAAI, 2017. 3834-3840.

[116] Edwards A, Sahni H, Schroecker Y, Isbell C. Imitating latent policies from observation. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 1755-1763.

[117] Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics, 1951, 22(1): 79-86. [doi: 10.1214/aoms/1177729694]

[118] Yang C, Ma XJ, Huang WB, Sun FC, Liu HP, Huang JZ, Gan C. Imitation learning from observations by minimizing inverse dynamics disagreement. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2019. 239-249.

[119] Fu J, Singh A, Ghosh D, Yang L, Levine S. Variational inverse control with events: A general framework for data-driven reward definition. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: ACM, 2018. 8547-8556.

[120] Ibarz B, Leike J, Pohlen T, Irving G, Legg S, Amodei D. Reward learning from human preferences and demonstrations in Atari. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: ACM, 2018. 8022-8034.

[121] Jiang SY, Pang JC, Yu Y. Offline imitation learning with a misspecified simulator. In: Proc. of the 34th Int’l Conf. Neural Information Processing Systems. Vancouver: ACM, 2020. 8510-8520.

[122] Liu FC, Ling Z, Mu TZ, Su H. State alignment-based imitation learning. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.

[123] Gupta JK, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. In: Proc. of the 2017 Int’l Conf. on Autonomous Agents and Multiagent Systems. São Paulo: Springer, 2017. 66-83.

[124] Waugh K, Ziebart BD, Bagnell JA. Computational rationalization: The inverse equilibrium problem. arXiv:1308.3506, 2013.

[125] Kuleshov V, Schrijvers O. Inverse game theory: Learning utilities in succinct games. In: Proc. of the 11th Int’l Conf. on Web and Internet Economics. Amsterdam: Springer, 2015. 413-427.

[126] Song JM, Ren HY, Sadigh D, Ermon S. Multi-agent generative adversarial imitation learning. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: ACM, 2018. 7461-7472.

[127] Wang XY, Klabjan D. Competitive multi-agent inverse reinforcement learning with sub-optimal demonstrations. In: Proc. of the 35th Int’l Conf. on Machine Learning. Stockholm: ICML, 2018. 5143-5151.

[128] Yu LT, Song JM, Ermon S. Multi-agent adversarial inverse reinforcement learning. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 7194-7201.

[129] Hadfield-Menell D, Dragan A, Abbeel P, Russell S. Cooperative inverse reinforcement learning. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: ACM, 2016. 3916-3924.

[130] Zhang XY, Zhang KQ, Miehling E, Başar T. Non-cooperative inverse reinforcement learning. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2019. 9487-9497.

[131] Luo YD, Schulte O, Poupart P. Inverse reinforcement learning for team sports: Valuing actions and players. In: Proc. of the 29th Int’l Joint Conf. on Artificial Intelligence. Yokohama: ACM, 2020. 3356-3363.

[132] Ng AY, Harada D, Russell SJ. Policy invariance under reward transformations: Theory and application to reward shaping. In: Proc. of the 16th Int’l Conf. on Machine Learning. San Francisco: ACM, 1999. 278-287.

[133] Piot B, Geist M, Pietquin O. Boosted and reward-regularized classification for apprenticeship learning. In: Proc. of the 2014 Int’l Conf. on Autonomous Agents and Multi-agent Systems. Paris: ACM, 2014. 1249-1256.

[134] Metelli AM, Pirotta M, Restelli M. Compatible reward inverse reinforcement learning. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: ACM, 2017. 2050-2059.

[135] Judah K, Fern A, Tadepalli P, Goetschalckx R. Imitation learning with demonstrations and shaping rewards. In: Proc. of the 28th AAAI Conf. on Artificial Intelligence. Québec: ACM, 2014. 1891-1896.

[136] Jena R, Liu CL, Sycara K. Augmenting GAIL with BC for sample efficient imitation learning. In: Proc. of the 2020 Conf. on Robot Learning. Cambridge: PMLR, 2020. 80-90.

[137] Brantley K, Sun W, Henaff M. Disagreement-regularized imitation learning. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.

[138] Finn C, Christiano P, Abbeel P, Levine S. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv:1611.03852, 2016.

[139] Fu J, Luo K, Levine S. Learning robust rewards with adverserial inverse reinforcement learning. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.

[140] Kostrikov I, Agrawal KK, Dwibedi D, Levine S, Tompson J. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019.

[141] Ghasemipour SKS, Zemel RS, Gu SX. A divergence minimization perspective on imitation learning methods. In: Proc. of the 3rd Conf. on Robot Learning. Osaka: PMLR, 2020. 1259-1277.

[142] Ni TW, Sikchi HS, Wang YF, Gupta T, Lee L, Eysenbach B. f-IRL: Inverse reinforcement learning via state marginal matching. In: Proc. of the 4th Conf. on Robot Learning. Cambridge: PMLR, 2020. 529-551.

[143] Zhang X, Li YH, Zhang ZM, Zhang ZL. f-GAIL: Learning f-divergence for generative adversarial imitation learning. In: Proc. of the 34th Conf. on Neural Information Processing Systems. 2020.

[144] Balakrishnan S, Nguyen QP, Low BKH, Soh H. Efficient exploration of reward functions in inverse reinforcement learning via Bayesian optimization. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 4187-4198.

[145] Liu MH, He TR, Xu MK, Zhang WN. Energy-based imitation learning. In: Proc. of the 20th Int’l Conf. on Autonomous Agents and Multiagent Systems. ACM, 2021. 809-817.

[146] Lee D, Srinivasan S, Doshi-Velez F. Truly batch apprenticeship learning with deep successor features. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence. Macao: IJCAI.org, 2019. 5909-5915.

[147] Liu MH, Zhao HY, Yang ZY, Shen J, Zhang WN, Zhao L, Liu TY. Curriculum offline imitating learning. In: Proc. of the 34th Advances in Neural Information Processing Systems. NeurIPS, 2021. 6266-6277.

[148] Zweig A, Bruna J. Provably efficient third-person imitation from offline observation. In: Proc. of the 36th Conf. on Uncertainty in Artificial Intelligence. AUAI Press, 2020. 1228-1237.

[149] Jarrett D, Bica I, van der Schaar M. Strictly batch imitation learning by energy-based distribution matching. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 7354-7365.

Get Citation

张立华,刘全,黄志刚,朱斐.逆向强化学习研究综述.软件学报,2023,34(10):4772-4803

Copy

Article Metrics

Abstract:4301
PDF: 8374
HTML: 5252
Cited by: 0

History

Received:November 05,2021
Revised:December 15,2021
Adopted:
Online: May 24,2022
Published: October 06,2023

You are the first2038115Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History