Text-to-Chinese-painting Method Based on Multi-domain VQGAN

doi:10.13328/j.cnki.jos.006769

微信服务号

微信订阅号

2025-5-2- 5

Home > Archive>Volume 34, Issue 5, 2023 >2116-2133. DOI:10.13328/j.cnki.jos.006769

PDF HTML XML Export Cite reminder

Text-to-Chinese-painting Method Based on Multi-domain VQGAN
DOI:
                        10.13328/j.cnki.jos.006769
                    
Author:
                        SUN Ze-LongSUN Ze-Long
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
YANG Guo-XingYANG Guo-Xing
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WEN Jing-YuanWEN Jing-Yuan
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
FEI Nan-YiFEI Nan-Yi
School of Information, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LU Zhi-WuLU Zhi-Wu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WEN Ji-RongWEN Ji-Rong
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [59]

Cited by

Materials

Comments

Abstract:

With the development of generative adversarial networks (GANs), synthesizing images from textual descriptions has become an active research area. However, textual descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image generation often requires an enormous number of labeled image-text pairs, and the cost of dataset production is high. With the advance in multimodal pre-training, the GAN generation process can be guided in an optimized way, which significantly reduces the demand for datasets and computational resources. In this study, a multi-domain vector quatization generative adversarial network (VQGAN) model is proposed to simultaneously generate Chinese paintings in multiple domains. Furthermore, a multimodal pre-trained model WenLan is used to calculate the distance loss between generated images and textual descriptions. The semantic consistency between images and texts is achieved by optimization of the hidden space variables input into multi-domain VQGAN. Finally, an ablation experiment is conducted to compare different variants of multi-domain VQGAN in terms of the FID and R-precision metrics, and a user investigation is carried out. The results demonstrate that the complete multi-domain VQGAN model outperforms the original VQGAN model in terms of image quality and text-image semantic consistency.

Key words:text-to-image generation;multi-domain generation;Chinese painting generation

Reference

[1] Kosslyn SM, Ganis G, Thompson WL. Neural foundations of imagery. Nature Reviews Neuroscience, 2001, 2(9): 635–642. [doi: 10.1038/35090055]

[2] Zhang H, Xu T, Li HS, Zhang ST, Wang XG, Huang XL, Metaxas D. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 5907–5915.

[3] Chen ZL, Wang C, Wu HM, Shang K, Wang J. DMGAN: Discriminative metric-based generative adversarial networks. Knowledge-Based Systems, 2020, 192: 105370. [doi: 10.1016/j.knosys.2019.105370]

[4] Xu T, Zhang PC, Huang QY, Zhang H, Gan Z, Huang XL, He XD. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1316–1324.

[5] Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I. Zero-shot text-to-image generation. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 8821–8831.

[6] Ding M, Yang ZY, Hong WY, Zheng WD, Zhou C, Yin D, Lin JY, Zou X, Shao Z, Yang HX, Tang J. CogView: Mastering text-to-image generation via transformers. In: Proc. of the 34th Advances in Neural Information Processing Systems. Curran Associates Inc., 2021. 19822–19835.

[7] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.

[8] van den Oord A, Vinyals O, Kavukcuoglu K. Neural discrete representation learning. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6309–6318.

[9] Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 12873–12883.

[10] Huo YQ, Zhang ML, Liu GZ, et al. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv:2103.06561, 2021.

[11] Fei NY, Lu ZW, Gao YZ, Yang GX, Huo YQ, Wen JY, Lu HY, Song RH, Gao X, Xiang T, Sun H, Wen JR. WenLan 2.0: Make AI imagine via a multimodal foundation model. arXiv:2110.14378, 2021.

[12] Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D. DRAW: A recurrent neural network for image generation. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lille: PMLR, 2015. 1462–1471.

[13] 吴昊, 徐丹. 数字图像合成技术综述. 中国图象图形学报, 2012, 17(11): 1333–1346. [doi: 10.11834/jig.20121101]

Wu H, Xu D. Survey of digital image compositing. Journal of Image and Graphics, 2012, 17(11): 1333–1346 (in Chinese with English abstract). [doi: 10.11834/jig.20121101]

[14] Reed SE, Akata Z, Yan XC, Logeswaran L, Schiele B, Lee H. Generative adversarial text to image synthesis. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York City: PMLR, 2016. 1060–1069.

[15] Reed S, Akata Z, Mohan S, Tenka S, Schiele B, Lee H. Learning what and where to draw. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: Curran Associates Inc., 2016. 217–225.

[16] 胡涛, 李金龙. 基于单阶段GANs的文本生成图像模型. 信息技术与网络安全, 2021, 40(6): 50–55. [doi: 10.19358/j.issn.2096-5133.2021.06.009]

Hu T, Li JL. Text to image generation based on single-stage GANs. Information Technology and Network Security, 2021, 40(6): 50–55 (in Chinese with English abstract). [doi: 10.19358/j.issn.2096-5133.2021.06.009]

[17] Zhang H, Xu T, Li HS, Zhang ST, Wang XG, Huang XL, Metaxas DN. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962. [doi: 10.1109/TPAMI.2018.2856256]

[18] Bodla N, Hua G, Chellappa R. Semi-supervised FusedGAN for conditional image generation. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 689–704.

[19] Zhang ZZ, Xie YP, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6199–6208.

[20] Gao LL, Chen DY, Song JK, Xu X, Zhang DX, Shen HT. Perceptual pyramid adversarial networks for text-to-image synthesis. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conf. and the 9th AAAI Symp. on Educational Advances in Artificial Intelligence. Honolulu: AAAI Press, 2019. 1019.

[21] Lai WS, Huang JB, Ahuja N, Yang MH. Deep laplacian pyramid networks for fast and accurate super-resolution. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 624–632.

[22] Qiao TT, Zhang J, Xu DQ, Tao DC. MirrorGAN: Learning text-to-image generation by redescription. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 1505–1514.

[23] 谈馨悦, 何小海, 王正勇, 罗晓东, 卿粼波. 基于Transformer交叉注意力的文本生成图像技术. 计算机科学, 2022, 49(2): 107–115. [doi: 10.11896/jsjkx.210600085]

Tan XY, He XH, Wang ZY, Luo XD, Qing LB. Text-to-image generation technology based on Transformer cross attention. Computer Science, 2022, 49(2): 107–115 (in Chinese with English abstract). [doi: 10.11896/jsjkx.210600085]

[24] Creswell A, Bharath AA. Inverting the generator of a generative adversarial network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(7): 1967–1974. [doi: 10.1109/TNNLS.2018.2875194]

[25] Abdal R, Qin YP, Wonka P. Image2StyleGAN: How to embed images into the StyleGAN latent space? In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 4432–4441.

[26] Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 4401–4410.

[27] Abdal R, Qin YP, Wonka P. Image2StyleGAN++: How to edit the embedded images? In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 8296–8305.

[28] Voynov A, Babenko A. Unsupervised discovery of interpretable directions in the GAN latent space. In: Proc. of the 37th Int’l Conf. on Machine Learning. PMLR, 2020. 9786–9796.

[29] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

[30] Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1): 503–528. [doi: 10.1007/BF01589116]

[31] Hansen N, Ostermeier A. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 2001, 9(2): 159–195. [doi: 10.1162/106365601750190398]

[32] Zhu JY, Krähenbühl P, Shechtman E, Efros AA. Generative visual manipulation on the natural image manifold. In: Proc. of the 14th European Conf. on Computer Vision. Amsterdam: Springer, 2016. 597–613.

[33] Huh M, Zhang R, Zhu JY, Paris S, Hertzmann A. Transforming and projecting images into class-conditional generative networks. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 17–34.

[34] Guan SY, Tai Y, Ni BB, Zhu FD, Huang FY, Yang XK. Collaborative learning for faster StyleGAN embedding. arXiv:2007.01758, 2020.

[35] Kingma DP, Welling M. Auto-encoding variational bayes. arXiv:1312.6114, 2013.

[36] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2672–2680.

[37] Chen YC, Li LJ, Yu LC, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu JJ. UNITER: UNiversal image-text representation learning. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 104–120.

[38] Li XJ, Yin X, Li CY, Zhang PC, Hu XW, Zhang L, Wang LJ, Hu HD, Dong L, Wei FR, Choi Y, Gao JF. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 121–137.

[39] Lin JY, Men R, Yang A, Zhou C, Ding M, Zhang YC, Wang P, Wang A, Jiang L, Jia XY, Zhang J, Zhang JW, Zou X, Li ZK, Deng XD, Liu J, Xue JB, Zhou HL, Ma JX, Yu J, Li Y, Lin W, Zhou JR, Tang J, Yang HX. M6: A Chinese multimodal pretrainer. arXiv:2103.00823, 2021.

[40] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557, 2019.

[41] Li G, Duan N, Fang YJ, Gong M, Jiang DX. Unicoder-Vl: A universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11336–11344. [doi: 10.1609/aaai.v34i07.6795]

[42] Su WJ, Zhu XZ, Cao Y, Li B, Lu LW, Wei FR, Dai JF. VL-BERT: Pre-training of generic visual-linguistic representations. arXiv:1908.08530, 2019.

[43] Ren SQ, He KM, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 91–99.

[44] Sun SQ, Chen YC, Li LJ, Wang SH, Fang YW, Liu JJ. LightningDOT: Pre-training visual-semantic embeddings for real-time image-text retrieval. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2021. 982–997.

[45] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 8748–8763.

[46] Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le QV, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 4904–4916.

[47] 付菲菲. 多风格国画生成的神经网络方法 [硕士学位论文]. 成都: 四川大学, 2021.

Fu FF. Neural network methods for multi-style Chinese art paintings generation [MS. Thesis]. Chengdu: Sichuan University, 2021 (in Chinese with English abstract).

[48] Choi Y, Uh Y, Yoo J, Ha JW. StarGAN v2: Diverse image synthesis for multiple domains. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 8188–8197.

[49] Tan MX, Le QV. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 6105–6114.

[50] Liu YH, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.

[51] van den Oord A, Li YZ, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.

[52] Xue A. End-to-end Chinese landscape painting creation using generative adversarial networks. In: Proc. of the 2021 IEEE Winter Conf. on Applications of Computer Vision. Waikoloa: IEEE, 2021. 3863–3871.

[53] Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6629–6640.

[54] Szegedy C, Vanhoucke V, Ioffe S, Shles J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 2818–2826.

[55] Frolov S, Hinz T, Raue F, Hees J, Dengel A. Adversarial text-to-image synthesis: A review. Neural Networks, 2021, 144: 187–209. [doi: 10.1016/j.neunet.2021.07.019]

Get Citation

孙泽龙,杨国兴,温静远,费楠益,卢志武,文继荣.基于多域VQGAN的文本生成国画方法研究.软件学报,2023,34(5):2116-2133

Copy

Article Metrics

Abstract:1238
PDF: 5284
HTML: 2829
Cited by: 0

History

Received:April 16,2022
Revised:May 29,2022
Adopted:
Online: September 20,2022
Published: May 06,2023

You are the first2041556Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History