基于自编码器生成对抗网络的可配置文本图像编辑
作者:
作者简介:

吴福祥(1984-), 男, 博士, 助理研究员, CCF专业会员, 主要研究领域为多模态深度学习, 文本图像合成, 自然语言处理;程俊(1977-), 男, 博士, 研究员, 博士生导师, 主要研究领域为机器视觉, 机器人, 机器智能和控制

通讯作者:

程俊, E-mail: jun.cheng@siat.ac.cn

中图分类号:

TP391

基金项目:

国家自然科学基金(U21A20487); 深圳市基础研究项目(JCYJ20200109113416531, JCYJ20180507182610734); 中国科学院关键技术人才项目


Configurable Text-based Image Editing by Autoencoder-based Generative Adversarial Networks
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [47]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    基于文本的图像编辑是多媒体领域的一个研究热点并具有重要的应用价值. 由于它是根据给定的文本编辑源图像, 而文本和图像的跨模态差异很大, 因此它是一项很具有挑战的任务. 在对编辑过程的直接控制和修正上, 目前方法难以有效地实现, 但图像编辑是用户喜好导向的, 提高可控性可以绕过或强化某些编辑模块以获得用户偏爱的结果. 针对该问题, 提出一种基于自动编码器的文本图像编辑模型. 为了提供便捷且直接的交互配置和编辑接口, 该模型在多层级生成对抗网络中引入自动编码器, 该自动编码器统一多层级间高维特征空间为颜色空间, 从而可以对该颜色空间下的中间编辑结果进行直接修正. 其次, 为了增强编辑图像细节及提高可控性, 构造了对称细节修正模块, 它以源图像和编辑图像为对称可交换输入, 融合文本特征以对前面输入编辑图像进行修正. 在MS-COCO和CUB200数据集上的实验表明, 该模型可以有效地基于语言描述自动编辑图像, 同时可以便捷且友好地修正编辑效果.

    Abstract:

    Text-based image editing is popular in multimedia and is of great application value, which is also a challenging task as the source image is edited on the basis of a given text, and there is a large cross-modal difference between the image and text. The existing methods can hardly achieve effective direct control and correction of the editing process, but image editing is user preference-oriented, and some editing modules can be bypassed or enhanced by controllability improvement to obtain the results of user preference. Therefore, this study proposes a novel autoencoder-based image editing model according to text descriptions. In this model, an autoencoder is first introduced in stacked generative adversarial networks (SGANs) to provide convenient and direct interactive configuration and editing interfaces. The autoencoder can transform high-dimension feature space between multiple layers into color space and directly correct the intermediate editing results under the color space. Then, a symmetrical detail correction module is constructed to enhance the detail of the edited image and improve controllability, which takes the source image and the edited image as symmetrical exchangeable input to correct the previously input edited image by the fusion of text features. Experiments on the MS-COCO and CUB200 datasets demonstrate that the proposed model can effectively and automatically edit images on the basis of linguistic descriptions while providing user-friendly and convenient corrections to the editing.

    参考文献
    [1] Liu YH, De Nadai M, Cai D, Li HY, Alameda-Pineda X, Sebe N, Lepri B. Describe what to change: A text-guided unsupervised image-to-image translation approach. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 1357–1365.
    [2] Liu ZH, Deng JC, Li L, Cai SF, Xu QQ, Wang SH, Huang QM. IR-GAN: Image manipulation with linguistic instruction by increment reasoning. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 322–330.
    [3] Li BW, Qi XJ, Lukasiewicz T, Torr PHS. ManiGAN: Text-guided image manipulation. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 7877–7886.
    [4] Dhamo H, Farshad A, Laina I, Navab N, Hager GD, Tombari F, Rupprecht C. Semantic image manipulation using scene graphs. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 5212–5221.
    [5] Bau D, Strobelt H, Peebles W, Wulff J, Zhou BL, Zhu JY, Torralba A. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics, 2019, 38(4): 59. [doi: 10.1145/3306346.3323023]
    [6] Liu M, Ding YK, Xia M, Liu X, Ding E, Zuo WM, Wen SL. STGAN: A unified selective transfer network for arbitrary image attribute editing. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3673–3682.
    [7] Mirza M, Osindero S. Conditional generative adversarial nets. arXiv: 1411.1784, 2014.
    [8] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2672–2680.
    [9] Reed S, Akata Z, Yan XC, Logeswaran L, Schiele B, Lee H. Generative adversarial text to image synthesis. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: JMLR.org, 2016. 1060–1069.
    [10] 陈佛计, 朱枫, 吴清潇, 郝颖明, 王恩德, 崔芸阁. 生成对抗网络及其在图像生成中的应用研究综述. 计算机学报, 2021, 44(2): 347–369. [doi: 10.11897/SP.J.1016.2021.00347]
    Chen FJ, Zhu F, Wu QX, Hao YM, Wang ED, Cui YG. A survey about image generation with generative adversarial nets. Chinese Journal of Computers, 2021, 44(2): 347–369 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2021.00347]
    [11] Xu T, Zhang PC, Huang QY, Zhang H, Gan Z, Huang XL, He XD. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1316–1324.
    [12] Shaham TR, Dekel T, Michaeli T. SinGAN: Learning a generative model from a single natural image. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 4569–4579.
    [13] Nam S, Kim Y, Kim SJ. Text-adaptive generative adversarial networks: Manipulating images with natural language. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 42–51.
    [14] 杨婉香, 严严, 陈思, 张小康, 王菡子. 基于多尺度生成对抗网络的遮挡行人重识别方法. 软件学报, 2020, 31(7): 1943–1958. http://www.jos.org.cn/1000-9825/5932.htm
    Yang WX, Yan Y, Chen S, Zhang XK, Wang HZ. Multi-scale generative adversarial network for person re-identification under occlusion. Ruan Jian Xue Bao/Journal of Software, 2020, 31(7): 1943–1958 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5932.htm
    [15] Zhang ZZ, Xie YP, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6199–6208.
    [16] Zhang H, Xu T, Li HS, Zhang ST, Wang XG, Huang XL, Metaxas DN. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947–1962. [doi: 10.1109/TPAMI.2018.2856256]
    [17] Yu Z, Yu J, Xiang CC, Fan JP, Tao DC. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12): 5947–5959. [doi: 10.1109/TNNLS.2018.2817340]
    [18] Yu Z, Yu J, Fan JP, Tao DC. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 1839–1848.
    [19] Kim JH, Jun J, Zhang BT. Bilinear attention networks. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 1571–1581.
    [20] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. Austin: ACL, 2016. 457–468.
    [21] Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT. Hadamard product for low-rank bilinear pooling. In: Proc. of the 5th Int’l Conf. on Learning Representations. Toulon: OpenReview.net, 2017.
    [22] Zhu MF, Pan PB, Chen W, Yang Y. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 5795–5803.
    [23] Cheng J, Wu FX, Tian YL, Wang L, Tao DP. RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10908–10917.
    [24] Qiao TT, Zhang J, Xu DQ, Tao DC. MirrorGAN: Learning text-to-image generation by redescription. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 1505–1514.
    [25] Collins E, Bala R, Price B, Süsstrunk S. Editing in style: Uncovering the local semantics of GANs. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 5770–5779.
    [26] Dorta G, Vicente S, Campbell NDF, Simpson IJA. The GAN that warped: Semantic attribute editing with unpaired data. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 5355–5364.
    [27] Lang YN, He Y, Dong JF, Yang F, Xue H. Design-gan: Cross-category fashion translation driven by landmark attention. In: Proc. of the ICASSP 2020–2020 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020. 1968–1972.
    [28] Li BW, Qi XJ, Torr PHS, Lukasiewicz T. Image-to-image translation with text guidance. arXiv: 2002.05235, 2020.
    [29] Liang XD, Zhang H, Lin L, Xing E. Generative semantic manipulation with mask-contrasting GAN. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 574–590.
    [30] Chen YC, Shen XH, Lin Z, Lu X, Pao IM, Jia JY. Semantic component decomposition for face attribute manipulation. In: Proc. of 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9851–9859.
    [31] Tang H, Liu H, Sebe N. Unified generative adversarial networks for controllable image-to-image translation. IEEE Transactions on Image Processing, 2020, 29: 8916–8929. [doi: 10.1109/TIP.2020.3021789]
    [32] Chen JB, Shen YL, Gao JF, Liu JJ, Liu XD. Language-based image editing with recurrent attentive models. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 8721–8729.
    [33] Zhang LS, Chen QC, Hu BT, Jiang SR. Text-guided neural image inpainting. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 1302–1310.
    [34] Li BW, Qi XJ, Lukasiewicz T, Torr PHS. Controllable text-to-image generation. In: Proc. of the 33rd Int’l Conf. Neural Information Processing Systems. Vancouver, 2019. 2063–2073.
    [35] Zhou XR, Huang SY, Li B, Li YM, Li JC, Zhang ZF. Text guided person image synthesis. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3658–3667.
    [36] Cheng Y, Gan Z, Li YT, Liu JJ, Gao JF. Sequential attention GAN for interactive image editing. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 4383–4391.
    [37] Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 4396–4405.
    [38] Chen YP, Kalantidis Y, Li JS, Yan SC, Feng JS. Multi-fiber networks for video recognition. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 364–380.
    [39] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.
    [40] Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200–2011 dataset. Pasadena: California Institute of Technology, 2011.
    [41] Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training GANs. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: Curran Associates Inc., 2016. 2234–2242.
    [42] Barratt S, Sharma R. A note on the inception score. arXiv: 1801.01973, 2018.
    [43] Dong H, Yu SM, Wu C, Guo YK. Semantic image synthesis via adversarial learning. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 5707–5715.
    [44] Zhu XJ, Li TF, de Melo G. Exploring semantic properties of sentence embeddings. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018. 632–637.
    [45] Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

吴福祥,程俊.基于自编码器生成对抗网络的可配置文本图像编辑.软件学报,2022,33(9):3139-3151

复制
分享
文章指标
  • 点击次数:1561
  • 下载次数: 4803
  • HTML阅读次数: 3471
  • 引用次数: 0
历史
  • 收稿日期:2021-06-30
  • 最后修改日期:2021-08-15
  • 在线发布日期: 2022-02-22
  • 出版日期: 2022-09-06
文章二维码
您是第20240877位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号