Abstract:In recent years, deep learning has gained more and more attention in image description. The existing deep learning methods using CNNs to extract features and RNNs to fold into one sentence. Nevertheless, when dealing with complex images, the feature extraction is inaccurate. And the fixed mode of sentence generation model leads to inconsistent sentences. To solve this problem, this study proposes a method combine channel-wise attention model and GANs, named CACNN-GAN. The channel-wise attention mechanism is added to each conv-layer to extract features, compare with the COCO dataset, and select the top features to generate sentence. Using GANs to generate the sentences, which is generated by the game process between the generator and the discriminator. After that, we can get a sentence generator contains the varied syntax, smooth sentence, and rich vocabulary. Experiments on real datasets illustrates that CACNN-GAN can effectively describe images, and get higher accuracy compared with the state-of-art.