Abstract:As a crucial subtask in natural language processing (NLP), named entity recognition (NER) aims to extract the import information from text, which can help many downstream tasks such as machine translation, text generation, knowledge graph construction, and multi-modal data fusion to deeply understand the complex semantic information of the text and effectively complete these tasks. In practice, due to time and labor costs, NER suffers from annotated data scarcity, known as few-shot NER. Although few-shot NER methods based on text have achieved sound generalization performance, the semantic information that the model can extract is still limited due to the few samples, which leads to the poor prediction effect of the model. To this end, this study proposes a few-shot NER model based on the multi-modal dataset fusion, which provides additional semantic information with multi-modal data for the first time, to help the model prediction and can further effectively improve the effect of multimodal data fusion and modeling. This method converts image information into text information as auxiliary modality information, which effectively solves the problem of poor modality alignment caused by the inconsistent granularity of semantic information contained in text and images. In order to effectively consider the label dependencies in few-shot NER, this study uses the CRF framework and introduces the state-of-the-art meta-learning methods as the emission module and the transition module, respectively. To alleviate the negative impact of noisy samples in the auxiliary modal samples, this study proposes a general denoising network based on the idea of meta-learning. The denoising network can measure the variability of the samples and evaluate the beneficial extent of each sample to the model. Finally, this study conducts extensive experiments on real unimodal and multimodal data sets. The experimental results show the outstanding generalization performance of the proposed method, where the proposed method outperforms the state-of-the-art methods by 10 F1 scores in the 1-shot setting.