Abstract:Given a natural language sentence as the query, the task of video moment retrieval aims to localize the most relevant video moment in a long untrimmed video. Based on the rich visual, text, and audio information contained in the video, how to fully understand the visual information provided in the video and utilize the text information provided by the query sentence to enhance the generalization and robustness of model, and how to align and interact cross-modal information are crucial challenges of the video moment retrieval. This study systematically sorts out the work in the field of video moment retrieval, and divides them into ranking-based methods and localization-based methods. Thereinto, the ranking-based methods can be further divided into the methods of presetting candidate clips, and the methods of generating candidate clips with guidance; the localization-based methods can be divided into one-time localization methods and iterative localization ones. The datasets and evaluation metrics of this fieldf are also summarized and the latest advances are reviewed. Finally, the related extension task is introduced, e.g., moment localization from video corpus, and the survey is concluded with a discussion on promising trends.