Abstract:Deep learning models have yielded impressive results in many tasks. However, the success hinges on the availability of a large number of labeled samples for model training, and deep learning models tend to perform poorly in scenarios where labeled samples are scarce. In recent years, few-shot learning (FSL) has been proposed to study how to learn quickly from a small number of samples and has achieved good performance mainly by the use of meta-learning for model training. Nevertheless, two issues exist: 1) Existing FSL methods usually manage to recognize novel classes solely with the visual features of samples, without integrating information from other modalities. 2) By following the paradigm of meta-learning, a model aims at learning generic and transferable knowledge from massive similar few-shot tasks, which inevitably leads to a generalized feature space and insufficient and inaccurate representation of sample features. To tackle the two issues, this study introduces pre-training and multimodal learning techniques into the FSL process and proposes a new multimodal-guided local feature selection strategy for few-shot learning. Specifically, model pre-training is first conducted on known classes with abundant samples to greatly improve the feature representation ability of the model. Then, in the meta-learning stage, the pre-trained model is further optimized by meta-learning to improve its transferability or its adaptability to the few-shot environment. Meanwhile, the local feature selection is carried out on the basis of visual features and textual features of samples to enhance the ability to represent sample features and avoid sharp degradation of the model’s representation ability. Finally, the resultant sample features are utilized for FSL. The experiments on three benchmark datasets, namely, MiniImageNet, CIFAR-FS, and FC-100, demonstrate that the proposed FSL method can achieve better results.