Abstract:Image captioning is of great theoretical significance and application value, which has attracted wide attention in computer vision and natural language processing. The existing attention mechanism-based image captioning methods integrate the current word and visual cues at the same moment to generate the target word, but they neglect the visual relevance and contextual information, which results in a difference between the generated caption and the ground truth. To address this problem, this paper presents the visual relevance and context dual attention (VRCDA) method. The visual relevance attention incorporates the attention vector of the previous moment into the traditional visual attention to ensure visual relevance, and the context attention is used to obtain much complete semantic information from the global context for better use of the context. In this way, the final image caption is generated via visual relevance and context information. The experiments on the MSCOCO and Flickr30k benchmark datasets demonstrate that VRCDA can effectively describe the image semantics, and compared with several state-of-the-art methods of image captioning, VRCDA can yield superior performance in all evaluation metrics.