Abstract:Text-based image editing is popular in multimedia and is of great application value, which is also a challenging task as the source image is edited on the basis of a given text, and there is a large cross-modal difference between the image and text. The existing methods can hardly achieve effective direct control and correction of the editing process, but image editing is user preference-oriented, and some editing modules can be bypassed or enhanced by controllability improvement to obtain the results of user preference. Therefore, this study proposes a novel autoencoder-based image editing model according to text descriptions. In this model, an autoencoder is first introduced in stacked generative adversarial networks (SGANs) to provide convenient and direct interactive configuration and editing interfaces. The autoencoder can transform high-dimension feature space between multiple layers into color space and directly correct the intermediate editing results under the color space. Then, a symmetrical detail correction module is constructed to enhance the detail of the edited image and improve controllability, which takes the source image and the edited image as symmetrical exchangeable input to correct the previously input edited image by the fusion of text features. Experiments on the MS-COCO and CUB200 datasets demonstrate that the proposed model can effectively and automatically edit images on the basis of linguistic descriptions while providing user-friendly and convenient corrections to the editing.