GT-4S: 基于图Transformer的场景草图语义分割
作者:
中图分类号:

TP391

基金项目:

国家自然科学基金(62272447); 北京市自然科学基金-海淀原始创新联合基金(L222008)


GT-4S: Graph Transformer for Scene Sketch Semantic Segmentation
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [69]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    场景草图由多个前、背景物体组成, 能够直观、概括地表达复杂的语义信息, 在现实生活中有着广泛的实际应用, 逐渐成为计算机视觉和人机交互领域的研究热点之一. 作为场景草图语义理解的基础任务, 场景草图语义分割的相关研究相对较少, 现有的方法多是对自然图像语义分割的方法进行改进, 不能克服草图自身的稀疏性和抽象性等特点. 针对以上问题, 直接从草图笔画入手, 提出一种图Transformer模型结合草图笔画的时空信息来解决自由手绘场景草图语义分割任务. 首先将矢量场景草图构建成图结构, 笔画表示为图的节点, 笔画在时序和空间上的关联表示为图的边. 然后通过边增强的Transformer模块捕获笔画的时空全局上下文信息. 最后将编码后的时空特征进行多分类优化学习. 在SFSD场景草图数据集上的实验结果表明, 所提方法可以利用笔画时空信息对场景草图进行有效的语义分割, 实现优秀的性能.

    Abstract:

    The scene sketch is made up of multiple foreground and background objects, which can directly and generally express complex semantic information. It has a wide range of practical applications in real life and has gradually become one of the research hotspots in the field of computer vision and human-computer interaction. As the basic task of the semantic understanding of scene sketch, scene sketch semantic segmentation is rarely studied. Most of the existing methods are improved from the semantic segmentation of natural images, which cannot overcome the sparsity and abstraction of sketches. To solve the above problems, this study proposes a graph Transformer model directly from sketch strokes. The model combines the temporal-spatial information of sketch strokes to solve the semantic segmentation task of free-hand scene sketches. First, the vector scene sketch is constructed into a graph with strokes as the nodes of the graph and temporal and spatial correlations between strokes as the edges of the graph. The temporal-spatial global context information of the strokes is then captured by the edge-enhanced Transformer module. Finally, the encoded temporal-spatial features are optimized for multi-classification learning. The experimental results on the SFSD scene sketch dataset show that the proposed method can effectively segment scene sketches using stroke temporal-spatial information and achieve excellent performance.

    参考文献
    [1] Wang F, Lin SJ, Wu HF, Li HH, Wang RM, Luo XN, He XJ. SPFusionNet: Sketch segmentation using multi-modal data fusion. In: Proc. of the 2019 IEEE Int’l Conf. on Multimedia and Expo. Shanghai: IEEE, 2019. 1654–1659. [doi: 10.1109/ICME.2019.00285]
    [2] 王淑侠, 王守霞, 王关峰, 高满屯. 基于几何特征的在线手绘草图分割. 计算机辅助设计与图形学学报, 2015, 27(9): 1686–1693.
    Wang SX, Wang SX, Wang GF, Gao MT. Segmentation of online freehand stroke using geometrical feature. Journal of Computer-aided Design & Computer Graphics, 2015, 27(9): 1686–1693 (in Chinese with English abstract).
    [3] 邓正根, 吕健, 刘翔, 侯宇康, 王帅. 基于StyleGAN的草图生成产品设计效果图方法研究. 包装工程, 2023, 44(6): 188–195.
    Deng ZG, Lyu J, Liu X, Hou YK, Wang S. StyleGAN-based sketch generation method for product design renderings. Packaging Engineering, 2023, 44(6): 188–195 (in Chinese with English abstract).
    [4] Huang F, Canny JF. Sketchforme: Composing sketched scenes from text descriptions for interactive applications. In: Proc. of the 32nd Annual ACM Symp. on User Interface Software and Technology. New Orleans: ACM, 2019. 209–220. [doi: 10.1145/3332165.3347878]
    [5] Zhang JH, Chen YL, Li L, Fu HB, Tai CL. Context-based sketch classification. In: Proc. of the 2018 Joint Symp. on Computational Aesthetics and Sketch-based Interfaces and Modeling and Non-photorealistic Animation and Rendering. Victoria: ACM, 2018. 3.
    [6] 杨金凯, 王国中, 范涛. 基于神经网络的手绘草图的识别与匹配. 智能计算机与应用, 2021, 11(6): 148–152.
    Yang JK, Wang GZ, Fan T. Recognition and matching of hand-drawn sketch based on neural network. Intelligent Computer and Applications, 2021, 11(6): 148–152 (in Chinese with English abstract).
    [7] Song JF, Yu Q, Song YZ, Xiang T, Hospedales TM. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 5552–5561. [doi: 10.1109/ICCV.2017.592]
    [8] 陈健, 白琮, 马青, 郝鹏翼, 陈胜勇. 面向细粒度草图检索的对抗训练三元组网络. 软件学报, 2020, 31(7): 1931–1942. http://www.jos.org.cn/1000-9825/5934.htm
    Chen J, Bai C, Ma Q, Hao PY, Chen SY. Adversarial training triplet network for fine-grained sketch based image retrieval. Ruan Jian Xue Bao/Journal of Software, 2020, 31(7): 1931–1942 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5934.htm
    [9] Zhou BL, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: A 10 million image database for scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1452–1464.
    [10] Gao CY, Liu Q, Xu Q, Wang LM, Liu JZ, Zou CQ. SketchyCOCO: Image generation from freehand scene sketches. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 5173–5182. [doi: 10.1109/CVPR42600.2020.00522]
    [11] Liu F, Zou CQ, Deng XM, Zuo R, Lai YK, Ma CX, Liu YJ, Wang HA. SceneSketcher: Fine-grained image retrieval with scene sketches. In: Proc. of the 16th European Conf. Glasgow: Springer, 2020. 718–734. [doi: 10.1007/978-3-030-58529-7_42]
    [12] Chowdhury PN, Bhunia AK, Sain A, Koley S, Xiang T, Song YZ. SceneTrilogy: On human scene-sketch and its complementarity with photo and text. In: Proc. of the 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023. 10972–10983. [doi: 10.1109/CVPR52729.2023.01056]
    [13] Ye YX, Lu YJ, Jiang H. Human’s scene sketch understanding. In: Proc. of the 2016 ACM on Int’l Conf. on Multimedia Retrieval. New York: ACM, 2016. 355–358. [doi: 10.1145/2911996.2912067]
    [14] Wang JY, Jeon S, Yu SX, Zhang X, Arora H, Lou Y. Unsupervised scene sketch to photo synthesis. In: Proc. of the 2023 European Conf. on Computer Vision. Tel Aviv: Springer, 2023. 273–289. [doi: 10.1007/978-3-031-25063-7_17]
    [15] 王佳欣, 朱志亮, 邓小明, 马翠霞, 王宏安. 基于深度学习的草图分割算法综述. 软件学报, 2022, 33(7): 2729–2752. http://www.jos.org.cn/1000-9825/6299.htm
    Wang JX, Zhu ZL, Deng XM, Ma CX, Wang HA. Survey on sketch segmentation algorithm based on deep learning. Ruan Jian Xue Bao/Journal of Software, 2022, 33(7): 2729–2752 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6299.htm
    [16] Wu XY, Qi YG, Liu J, Yang J. SketchSegNet: A RNN model for labeling sketch strokes. In: Proc. of the 28th IEEE Int’l Workshop on Machine Learning for Signal Processing. Aalborg: IEEE, 2018. 1–6. [doi: 10.1109/MLSP.2018.8516988]
    [17] Qi YG, Tan ZH. SketchSegNet+: An end-to-end learning of RNN for multi-class sketch semantic segmentation. IEEE Access, 2019, 7: 102717–102726.
    [18] Kaiyrbekov K, Sezgin M. Deep stroke-based sketched symbol reconstruction and segmentation. IEEE Computer Graphics and Applications, 2020, 40(1): 112–126.
    [19] Yang LM, Zhuang JJ, Fu HB, Wei XZ, Zhou K, Zheng YY. SketchGNN: Semantic sketch segmentation with graph neural networks. ACM Trans. on Graphics, 2021, 40(3): 28.
    [20] Zou CQ, Yu Q, Du RF, Mo HR, Song YZ, Xiang T, Gao CY, Chen BQ, Zhang H. SketchyScene: Richly-annotated scene sketches. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 438–454. [doi: 10.1007/978-3-030-01267-0_26]
    [21] Ge C, Sun HF, Song YZ, Ma ZY, Liao JX. Exploring local detail perception for scene sketch semantic segmentation. IEEE Trans. on Image Processing, 2022, 31: 1447–1461.
    [22] Zhang ZM, Deng XM, Li JY, Lai YK, Ma CX, Liu YJ, Wang HA. Stroke-based semantic segmentation for scene-level free-hand sketches. The Visual Computer, 2023, 38(12): 6309–6321.
    [23] Oono K, Suzuki T. Graph neural networks exponentially lose expressive power for node classification. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2019.
    [24] Alon U, Yahav E. On the bottleneck of graph neural networks and its practical implications. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [25] Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G, Grohe M. Weisfeiler and Leman go neural: Higher-order graph neural networks. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 4602–4609.
    [26] Xu P, Joshi CK, Bresson X. Multigraph Transformer for free-hand sketch recognition. IEEE Trans. on Neural Networks and Learning Systems, 2022, 33(10): 5150–5161.
    [27] Zheng YX, Xie JY, Sain A, Ma ZY, Song YZ, Guo J. ENDE-GNN: An encoder-decoder GNN framework for sketch semantic segmentation. In: Proc. of the 2022 IEEE Int’l Conf. on Visual Communications and Image Processing. Suzhou: IEEE, 2022. 1–5.
    [28] Ribeiro LSF, Bui T, Collomosse J, Ponti M. Sketchformer: Transformer-based representation for sketched structure. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 14141–14150.
    [29] Tian JL, Xu X, Shen FM, Yang Y, Shen HT. TVT: Three-way vision Transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In: Proc. of the 36th AAAI Conf. on Artificial Intelligence. AAAI Press, 2022. 2370–2378.
    [30] Sezgin TM, Stahovich T, Davis R. Sketch based interfaces: Early processing for sketch understanding. In: Proc. of the 2007 ACM SIGGRAPH Courses. San Diego: ACM, 2007. 1–8. [doi: 10.1145/1281500.1281548]
    [31] Kim DH, Kim MJ. A curvature estimation for pen input segmentation in sketch-based modeling. Computer-aided Design, 2006, 38(3): 238–248.
    [32] Sun ZB, Wang CH, Zhang LQ, Zhang L. Free hand-drawn sketch segmentation. In: Proc. of the 12th European Conf. on Computer Vision. Florence: Springer, 2012. 626–639. [doi: 10.1007/978-3-642-33718-5_45]
    [33] Schneider RG, Tuytelaars T. Example-based sketch segmentation and labeling using CRFS. ACM Trans. on Graphics, 2016, 35(5): 151.
    [34] Qi YG, Song YZ, Xiang T, Zhang HG, Hospedales T, Li Y, Guo J. Making better use of edges via perceptual grouping. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 1856–1865. [doi: 10.1109/CVPR.2015.7298795]
    [35] Wang F, Lin SJ, Li HH, Wu HF, Cai T, Luo XN, Wang RM. Multi-column point-CNN for sketch segmentation. Neurocomputing, 2020, 392: 50–59.
    [36] Jiang JK, Wang RM, Lin SJ, Wang F. SFSegNet: Parse freehand sketches using deep fully convolutional networks. In: Proc. of the 2019 Int’l Joint Conf. on Neural Networks. Budapest: IEEE, 2019. 1–8. [doi: 10.1109/IJCNN.2019.8851974]
    [37] Gori M, Monfardini G, Scarselli F. A new model for learning in graph domains. In: Proc. of the 2005 IEEE Int’l Joint Conf. on Neural Networks. Montreal: IEEE, 2005. 729–734. [doi: 10.1109/IJCNN.2005.1555942]
    [38] Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of the 18th Int’l Conf. on Machine Learning. Williamstown: Morgan Kaufmann, 2001. 282–289.
    [39] Bucher M, Vu TH, Cord M, Pérez P. Zero-shot semantic segmentation. In: Proc. of the 33rd Conf. on Neural Information Processing Systems. Vancouver: NeurIPS, 2019. 466–477.
    [40] Lu Y, Chen YR, Zhao DB, Chen JX. Graph-FCN for image semantic segmentation. In: Proc. of the 16th Int’l Symp. on Neural Networks. Moscow: Springer, 2019. 97–105. [doi: 10.1007/978-3-030-22796-8_11]
    [41] Chen YP, Rohrbach M, Yan ZC, Yan SC, Feng JS, Kalantidis Y. Graph-based global reasoning networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 433–442. [doi: 10.1109/CVPR.2019.00052]
    [42] Zhang L, Li XT, Arnab A, Yang KY, Tong YH, Torr PHS. Dual graph convolutional network for semantic segmentation. In: Proc. of the 30th British Machine Vision Conf. Cardiff: BMVA Press, 2019. 254.
    [43] Liu J, Bao YQ, Ying WZ, Wang HC, Gao Y, Sonke JJ, Gavves E. Few-shot semantic segmentation with support-induced graph convolutional network. arXiv:2301.03194, 2023.
    [44] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [45] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]
    [46] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [47] Zheng SX, Lu JC, Zhao HS, Zhu XT, Luo ZK, Wang YB, Fu YW, Feng JF, Xiang T, Torr PHS, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers. In: Proc. of 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 6877–6886. [doi: 10.1109/CVPR46437.2021.00681]
    [48] Xie EZ, Wang WH, Yu ZD, Anandkumar A, Alvarez JM, Luo P. SegFormer: Simple and efficient design for semantic segmentation with Transformers. In: Proc. of the 35th Conf. on Neural Information Processing Systems. NeurIPS, 2021. 12077–12090.
    [49] Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z, Lin S, Guo BN. Swin Transformer: Hierarchical vision Transformer using shifted windows. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 9992–10002.
    [50] Tripathi A, Mishra A, Chakraborty A. Query-guided attention in vision Transformers for localizing objects using a single sketch. In: Proc. of the 2024 IEEE/CVF Winter Conf. on Applications of Computer Vision. 2024. 1083–1092.
    [51] Chen CQ, Wu YS, Dai QY, Zhou HY, Xu MT, Yang SB, Han XG, Yu YZ. A survey on graph neural networks and graph Transformers in computer vision: A task-oriented perspective. arXiv:2209.13232, 2022.
    [52] Wu ZH, Jain P, Wright MA, Mirhoseini A, Gonzalez JE, Stoica I. Representing long-range context for graph neural networks with global attention. In: Proc. of the 35th Conf. on Neural Information Processing Systems. NeurIPS, 2021. 13266–13279.
    [53] Rong Y, Bian YT, Xu TY, Xie WY, Wei Y, Huang WB, Huang JZ. Self-supervised graph Transformer on large-scale molecular data. In: Proc. of the 34th Conf. on Neural Information Processing Systems. Vancouver: NeurIPS, 2020. 12559–12571.
    [54] Kreuzer D, Beaini D, Hamilton WL, Létourneau V, Tossou P. Rethinking graph Transformers with spectral attention. In: Proc. of the 35th Conf. on Neural Information Processing Systems. NeurIPS, 2021. 21618–21629.
    [55] Ying CX, Cai TL, Luo SJ, Zheng SX, Ke GL, He D, Shen YM, Liu TY. Do Transformers really perform bad for graph representation? In: Proc. of the 35th Conf. on Neural Information Processing Systems. NeurIPS, 2021. 28877–28888.
    [56] Li K, Pang KY, Song YZ, Xiang T, Hospedales TM, Zhang HG. Toward deep universal sketch perceptual grouper. IEEE Trans. on Image Processing, 2019, 28(7): 3219–3231.
    [57] Lin TY, Dollár P, Girshick R, He KM, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 936–944. [doi: 10.1109/CVPR.2017.106]
    [58] Yun XL, Zhang YM, Yin F, Liu CL. Instance GNN: A learning framework for joint symbol segmentation and recognition in online handwritten diagrams. IEEE Trans. on Multimedia, 2021, 24: 2580–2594.
    [59] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
    [60] Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: Transformer for semantic segmentation. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 7242–7252. [doi: 10.1109/ICCV48922.2021.00717]
    [61] Hussain S, Zaki MJ, Subramanian D. Global self-attention as a replacement for graph convolution. In: Proc. of the 28th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Washington: ACM, 2022. 655–665. [doi: 10.1145/3534678.3539296]
    [62] Li L, Fu HB, Tai CL. Fast sketch segmentation and labeling with deep learning. IEEE Computer Graphics and Applications, 2019, 39(2): 38–51.
    [63] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In: Proc. of the 18th Int’l Conf. on Medical Image Computing and Computer-assisted Intervention. Munich: Springer, 2015. 234–241. [doi: 10.1007/978-3-319-24574-4_28]
    [64] Chen LC, Zhu YK, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 833–851. [doi: 10.1007/978-3-030-01234-2_49]
    相似文献
    引证文献
引用本文

张拯明,郭燕,马翠霞,邓小明,王宏安. GT-4S: 基于图Transformer的场景草图语义分割.软件学报,2025,36(3):1375-1389

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-08-11
  • 最后修改日期:2023-10-21
  • 在线发布日期: 2024-05-08
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号