Abstract:Text-to-image generation achieves excellent visual results but suffers from the problem of insufficient detail representation. This study proposes the conditional semantic augmentation generative adversarial network (CSA-GAN). The model first encodes the text and processes it using conditional semantic augmentation. It then extracts the intermediate features of the generator for up-sampling and generates the image mask through a two-layer convolutional neural network (CNN). Finally, the text coding is sent to two perceptrons for processing and fusing with the mask, so as to fully integrate the image spatial and text semantics features to improve the detail representation. In order to verify the quality of the generated images of this model, quantitative and qualitative analyses are conducted on different datasets. This study employs inception score (IS) and Frechet inception distance (FID) metrics to quantitatively evaluate the image clarity, diversity, and natural realism of the images. The qualitative analyses include the visualization of the generated images and the analysis of specific modules of the ablation experiment. The results show that the proposed model is superior to the state-of-the-art works in recent years. This fully verifies that the proposed method has better performance and can optimize the expression of main feature details in the image generation process.