In recent years, single-stage instance segmentation methods have made preliminary progress in real-world applications due to their high efficiency, but there are still two drawbacks compared to two-stage counterparts. (1) Low accuracy: the single-stage method does not have multiple rounds of refinement, so its accuracy is some distance away from real-world applications; (2) Low flexibility: most existing single-stage methods are specifically designed models, which are not compatible with object detectors. This study presents an accurate and flexible framework for single-stage instance segmentation, which contains the following two key designs. (1) To improve the accuracy of instance segmentation, a grid dividing binarization algorithm is proposed, where the bounding box region is firstly divided into several grid cells and then instance segmentation is performed on each grid cell. In this way, the original full-object segmentation task is simplified into the sub-tasks of grid cells, which significantly reduces the complexity of feature representation and further improves the instance segmentation accuracy; (2) To be compatible with object detectors, a plug-and-play module is designed, which can be seamlessly plugged into most existing object detection methods, thus enabling them to perform instance segmentation. The proposed method achieves excellent performance on the public dataset, such as MS COCO. It outperforms most existing single-stage methods and even some two-stage methods.
[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.
[4] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.
[5] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 2261–2269.
[6] Zhang S, Gong YH, Wang JJ. The development of deep convolution neural network and its applications on computer vision. Chinese Journal of Computers, 2019, 42(3): 453–482 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2019.00453] 张顺, 龚怡宏, 王进军. 深度卷积神经网络的发展及其在计算机视觉领域的应用. 计算机学报, 2019, 42(3): 453–482. [doi: 10.11897/SP.J.1016.2019.00453]
[7] He KM, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 2980–2988.
[8] Liu S, Qi L, Qin HF, Shi JP, Jia JY. Path aggregation network for instance segmentation. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 8759–8768.
[9] Huang ZJ, Huang LC, Gong YC, Huang C, Wang XG. Mask scoring R-CNN. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 6402–6411.
[10] Ren SQ, He KM, Girshick RB, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 2015 Advances in Neural Information Processing Systems. Montreal: NIPS, 2015. 91–99.
[11] Girshick R. Fast R-CNN. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 1440–1448.
[12] Bai M, Urtasun R. Deep watershed transform for instance segmentation. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 2858–2866.
[13] Bolya D, Zhou C, Xiao FY, Lee YJ. YOLACT: Real-time instance segmentation. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 9156–9165.
[14] Chen XL, Girshick R, He KM, Dollar P. TensorMask: A foundation for dense object segmentation. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 2061–2069.
[15] Dai JF, He KM, Li Y, Ren SQ, Sun J. Instance-sensitive fully convolutional networks. In: Proc. of the 14th European Conf. on Computer Vision. Amsterdam: Springer, 2016. 534–549.
[16] Xu WQ, Wang HY, Qi FB, Lu CW. Explicit shape encoding for real-time instance segmentation. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 5167–5176.
[17] Wang WH, Xie EZ, Song XG, Zang YH, Wang WJ, Lu T, Yu G, Shen CH. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 8439–8448.
[18] Wang WH, Xie EZ, Li X, Liu XB, Liang D, Yang ZB, Lu T, Shen CH. PAN++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2021.
[19] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.
[20] Xie SN, Girshick R, Dollár P, Tu ZW, He KM. Aggregated residual transformations for deep neural networks. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 5987–5995.
[21] Wang ZY, Yuan C, Li JC. Instance segmentation with separable convolutions and multi-level features. Ruan Jian Xue Bao/Journal of Software, 2019, 30(4): 954–961 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5667.htm 王子愉, 袁春, 黎健成. 利用可分离卷积和多级特征的实例分割. 软件学报, 2019, 30(4): 954–961. http://www.jos.org.cn/1000-9825/5667.htm
[22] Li Y, Qi HZ, Dai JF, Ji XY, Wei YC. Fully convolutional instance-aware semantic segmentation. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 4438–4446.
[23] Lin TY, Dollár P, Girshick R, He KM, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 936–944.
[24] Xie EZ, Sun PZ, Song XG, Wang WH, Liu XB, Liang D, Shen CH, Luo P. PolarMask: Single shot instance segmentation with polar representation. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 12190–12199.
[25] Xie EZ, Wang WH, Ding MY, Zhang RM, Luo P. PolarMask++: Enhanced polar representation for single-shot instance segmentation and beyond. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2021.
[26] Lin TY, Goyal P, Girshick R, He KM, Dollár P. Focal loss for dense object detection. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 2999–3007.
[27] Tian Z, Shen CH, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 9626–9635.
[28] LeCun Y, Haffner P, Bottou L, Bengio Y. Object recognition with gradient-based learning. In: Forsyth DA, Mundy JL, Di Gesú V, Cipolla R, eds. Shape, Contour and Grouping in Computer Vision. Berlin: Springer, 1999. 319–345.
[29] Wu YX, He KM. Group normalization. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 3–19.
[30] Wang SR, Gong YC, Xing JL, Huang LC, Huang C, Hu WM. RDSNet: A new deep architecture forreciprocal object detection and instance segmentation. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI, 2020. 12208–12215.
[31] Wang XL, Kong T, Shen CH, Jiang YN, Li L. SOLO: Segmenting objects by locations. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 649–665.
[32] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: A large-scale hierarchical image database. In: Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 248–255.
[33] Caesar H, Uijlings J, Ferrari V. COCO-stuff: Thing and stuff classes in context. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1209–1218.
[34] Cao JL, Anwer RM, Cholakkal H, Khan FS, Pang YW. SipMask: Spatial information preservation for fast image and video instance segmentation. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 1–18.
[35] Tian Z, Shen CH, Chen H. Conditional convolutions for instance segmentation. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 282–298.
[36] Kirillov A, Girshick R, He KM, Dollár P. Panoptic feature pyramid networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 6392–6401.
[37] Zhao HS, Shi JP, Qi XJ, Wang XG, Jia JY. Pyramid scene parsing network. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6230–6239.
[38] Dai JF, He KM, Sun J. Instance-aware semantic segmentation via multi-task network cascades. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 3150–3158.