Abstract:Fine-grained video categorization is a highly challenging task to discriminate similar subcategories that belong to the same basic-level category. Due to the significant advances in fine-grained image categorization and expensive cost of labeling video data, it is intuitive to adapt the knowledge learned from image to video in an unsupervised manner. However, there is a clear gap to directly apply the models learned from image to recognize the fine-grained instances in video, due to domain distinction and modality distinction between image and video. Therefore, this study proposes the unsupervised discriminative adaptation network (UDAN), which transfers the ability of discrimination localization from image to video. A progressive pseudo labeling strategy is adopted to iteratively guide UDAN to approximate the distribution of the target video data. To verify the effectiveness of the proposed UDAN approach, adaptation tasks between image and video are performed, adapting the knowledge learned from CUB-200-2011/Cars-196 datasets (image) to YouTube Birds/YouTube Cars datasets (video). Experimental results illustrate the advantage of the proposed UDAN approach for unsupervised fine-grained video categorization.