[关键词]
[摘要]
文本到图像生成取得了视觉上的优异效果, 但存在细节表达不足的问题. 于是提出基于条件语义增强的生成对抗模型(conditional semantic augmentation generative adversarial network, CSA-GAN). 所提模型首先将文本进行编码, 使用条件语义增强对其进行处理. 之后, 提取生成器的中间特征进行上采样, 再通过两层CNN生成图像的掩码. 最后将文本编码送入两个感知器处理后和掩码进行融合, 充分融合图像空间特征和文本语义, 以提高细节表达. 为了验证所提模型的生成图像的质量, 在不同的数据集上进行定量分析、定性分析. 使用IS (inception score)、FID (Frechet inception distance)指标对图像清晰度, 多样性和图像的自然真实程度进行定量评估. 定性分析包括可视化生成的图像, 消融实验分析具体模块等. 结果表明: 所提模型均优于近年来同类最优工作. 这充分验证所提出的方法具有更优性能, 同时能够优化图像生成过程中一些主体特征细节的表达.
[Key word]
[Abstract]
Text-to-image generation achieves excellent visual results but suffers from the problem of insufficient detail representation. This study proposes the conditional semantic augmentation generative adversarial network (CSA-GAN). The model first encodes the text and processes it using conditional semantic augmentation. It then extracts the intermediate features of the generator for up-sampling and generates the image mask through a two-layer convolutional neural network (CNN). Finally, the text coding is sent to two perceptrons for processing and fusing with the mask, so as to fully integrate the image spatial and text semantics features to improve the detail representation. In order to verify the quality of the generated images of this model, quantitative and qualitative analyses are conducted on different datasets. This study employs inception score (IS) and Frechet inception distance (FID) metrics to quantitatively evaluate the image clarity, diversity, and natural realism of the images. The qualitative analyses include the visualization of the generated images and the analysis of specific modules of the ablation experiment. The results show that the proposed model is superior to the state-of-the-art works in recent years. This fully verifies that the proposed method has better performance and can optimize the expression of main feature details in the image generation process.
[中图分类号]
[基金项目]
国家自然科学基金(62102070,U20B2063,62220106008);四川省科技计划(2023NSFSC1392)