Abstract:With the development of generative adversarial networks, synthesizing images from textual descriptions has become an active research area. However, text descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers or birds, etc. Few studies have been conducted for generating Chinese paintings with Chinese descriptions. The text-to-image generation task often requires a large number of labeled image-text pairs, which is expensive and boring. With the advance of multimodal pre-training, we can guide the image generation process in an optimized way, which significantly reduces the demand on annotated datasets and computational resources. In this paper, we propose a multi-domain VQGAN model to generate Chinese paintings in multiple domains. Further, a multimodal pre-training model WenLan is used to calculate the distance loss between the generated images and the text descriptions. The semantic consistency between the image and text is achieved by optimizing the hidden space variables as the input of multi-domain VQGAN. Ablation study is conducted to compare different variants of our multi-domain VQGAN in terms of the FID and R-precisoin metrics. We also conduct user study to further show the effectiveness of our proposed model. The extensive results demonstrate that our proposed multi-domain VQGAN model outperforms all the competitors in terms of image quality and text-image semantic consistency.