Abstract:This study proposes a convolutional neural network (CNN) based Transformer to solve the panoptic segmentation task. The method draws on the inherent advantages of the CNN in image feature learning and avoids increase in the amount of calculation when the Transformer is transplanted into the vision task. The CNN-based Transformer is attributed to the two basic structures of the projector performing the feature domain transformation and the extractor responsible for the feature extraction. The effective combination of the projector and the extractor forms the framework of the CNN-based Transformer. Specifically, the projector is implemented by a lattice convolution that models the spatial relationship of the image by designing and optimizing the convolution filter configuration. The extractor is performed by a chain network that improves feature extraction capabilities by chain block stacking. Considering the framework and the substantial function of panoptic segmentation, the CNN-based Transformer is successfully applied to solve the panoptic segmentation task. The experimental results on the MS COCO and Cityscapes datasets demonstrate that the proposed method has excellent performance.