Abstract:Reinforcement learning (RL) deals with long-term reward maximization problems via learning correct short-term decisions from on previous experience. It has been revealed that reward shaping, which provides simpler and easier reward functions to replace the actual environmental reward, is an effective way to guide and accelerate reinforcement learning. However, building a shaping reward requires either domain knowledge or demonstrations from an optimal policy, both involve participation of human experts that is costly. This work investigates whether it is possible to automatically learn a better shaping reward along with an RL process. RL algorithms commonly sample a lot of trajectories throughout the learning process. Those passive samples, though containing many failed attempts, may provide useful information for building a shaping reward function. A policy-invariance condition for reward shaping is introduced as a more effective way to handle noisy examples, followed by the RFPotential approach to learn a shaping reward from massive examples efficiently. Empirical studies on various RL algorithms and domains show that RFPotential can accelerate the RL process.