Abstract:Deep reinforcement learning combines the representation ability of deep learning with the decision-making ability of reinforcement learning, which has induced great research interest due to its remarkable effect in complex control tasks. This study classifies the model-free deep reinforcement learning methods into Q-value function methods and policy gradient methods by considering whether the Bellman equation is used, and introduces the two kinds of methods from the aspects of model structure, optimization process, and evaluation, respectively. Toward the low sample efficiency problem in deep reinforcement learning, this study illustrates that the over- estimation problem in Q-value function methods and the unbiased sampling constraint in policy gradient methods are the main factors that affect the sample efficiency according to model structure. Then, from the perspectives of enhancing the exploration efficiency and improving the sample exploitation rate, this study summarizes various feasible optimization methods according to the recent research hotspots and trends, analyzes advantages together with existing problems of related methods, and compares them according to the scope of application and optimization effect. Finally, this study proposes to enhance the generality of optimization methods, explore migration of optimization mechanisms between the two kinds of methods, and improve theoretical completeness as future research directions.