基于多域VQGAN的文本生成国画方法研究
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61976220,61832017);北京高等学校卓越青年科学家计划项目(BJJWZYJH012019100020098)


Text-to-Chinese-painting method based on multi-domain VQGAN
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着生成式对抗网络的出现,从文本描述合成图像最近成为一个活跃的研究领域.然而,目前文本描述往往使用英文,生成的对象也大多是人脸和花鸟等,专门针对中文和中国画的研究较少.同时,文本生成图像任务往往需要大量标注好的图像文本对,制作数据集的代价昂贵.随着多模态预训练的出现与推进,使得我们能够以一种优化的方式来指导生成对抗网络的生成过程,大大减少了对数据集和计算资源的需求.本文提出一种多域VQGAN模型来同时生成多种域的中国画,并利用多模态预训练模型WenLan来计算生成图像和文本描述之间的距离损失,通过优化输入多域VQGAN的隐空间变量来达到图片与文本语义一致的效果.我们对模型进行了消融实验,详细比较了不同结构的多域VQGAN的FID及R-precisoin指标,并进行了用户调查研究.结果表示,使用完整的多域VQGAN模型在图像质量和文本图像语义一致性上均超过原VQGAN模型的生成结果.

    Abstract:

    With the development of generative adversarial networks, synthesizing images from textual descriptions has become an active research area. However, text descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers or birds, etc. Few studies have been conducted for generating Chinese paintings with Chinese descriptions. The text-to-image generation task often requires a large number of labeled image-text pairs, which is expensive and boring. With the advance of multimodal pre-training, we can guide the image generation process in an optimized way, which significantly reduces the demand on annotated datasets and computational resources. In this paper, we propose a multi-domain VQGAN model to generate Chinese paintings in multiple domains. Further, a multimodal pre-training model WenLan is used to calculate the distance loss between the generated images and the text descriptions. The semantic consistency between the image and text is achieved by optimizing the hidden space variables as the input of multi-domain VQGAN. Ablation study is conducted to compare different variants of our multi-domain VQGAN in terms of the FID and R-precisoin metrics. We also conduct user study to further show the effectiveness of our proposed model. The extensive results demonstrate that our proposed multi-domain VQGAN model outperforms all the competitors in terms of image quality and text-image semantic consistency.

    参考文献
    相似文献
    引证文献
引用本文

孙泽龙,杨国兴,温静远,费楠益,卢志武,文继荣.基于多域VQGAN的文本生成国画方法研究.软件学报,2023,34(5):0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-04-16
  • 最后修改日期:2022-05-29
  • 录用日期:
  • 在线发布日期: 2022-09-20
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号