基于半监督学习的长尾时序动作检测
作者:
中图分类号:

TP18

基金项目:

科技创新2030—“新一代人工智能”重大项目(2022ZD0160900); 国家自然科学基金(62076119, 61921006)


Long-tailed Temporal Action Detection Based on Semi-supervised Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [43]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    现实世界中的数据标签分布往往呈现长尾效应, 即少部分类别占据绝大多数样本, 时序动作检测问题也不例外. 现有的时序动作检测方法往往缺乏对少样本类别的关注, 即充分建模样本数量多的头部类别, 而忽视了样本数量少的尾部类别. 对长尾时序动作检测问题进行了系统的定义, 并针对长尾时序动作检测问题, 提出一种基于半监督学习的加权类别重平衡自训练方法, 充分利用现实世界中存在的大规模无标签数据, 来重平衡训练样本中的标签分布, 改善模型对尾部类别的拟合效果. 还针对时序动作检测任务, 提出一种伪标签损失加权方法, 使模型训练更加稳定. 在THUMOS14和HACS Segments数据集上进行实验, 并分别利用THUMOS15数据集和ActivityNet1.3数据集中的视频样本来构成相应的无标签数据集. 此外, 还针对视频审核应用需求, 收集Dance数据集, 包括35个动作类别、6632个有标签视频和13264个无标签视频, 并保留数据分布显著的长尾效应. 使用多种基线模型, 在 THUMOS14、HACS Segments 和 Dance 数据集上进行实验. 实验结果表明, 所提出的加权类别重平衡自训练方法可以提高模型对尾部动作类别的检测效果, 并且能应用于不同的基线时序动作检测模型提升其性能.

    Abstract:

    The label distribution in the real world often shows the long-tail effect, where a small number of categories account for the vast majority of samples. The temporal action detection problem is no exception. The existing temporal action detection methods often focus on the head categories with a large number of samples, while neglecting the few-sample categories. This study systematically defines the long-tail temporal action detection problem and proposes a weighted class-rebalancing self-training method (WCReST) based on a semi-supervised learning framework. WCReST makes full use of the large-scale unlabeled data that exists in the real world to rebalance the label distribution in the training samples to improve the model’s fit for the tail categories. Additionally, a pseudo-label loss weighting method is proposed for the temporal action detection task to enhance the stability of model training. Experiments are conducted on the THUMOS14 and HACS Segments datasets, using video samples from the THUMOS15 and ActivityNet1.3 datasets to form corresponding unlabeled datasets. In addition, the Dance dataset is collected to meet the application requirements of video review, which includes 35 action categories, 6632 labeled videos, and 13264 unlabeled videos, preserving the significant long-tail effect in data distribution. A variety of baseline models are used to conduct experiments on the THUMOS14, HACS Segments, and Dance datasets. The results demonstrate that the proposed WCReST can improve the model’s detection performance on tail action categories and can be applied to different baseline temporal action detection models to enhance their performance.

    参考文献
    [1] Idrees H, Zamir AR, Jiang YG, Gorban A, Laptev I, Sukthankar P, Shah M. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1–23.
    [2] Zhao H, Torralba A, Torresani L, Yan YC. HACS: Human action clips and segments dataset for recognition and temporal localization. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 8667–8677. [doi: 10.1109/ICCV.2019.00876]
    [3] Gorban A, Idrees H, Jiang YG, et al. THUMOS challenge 2015. 2015. http://www.thumos.info/
    [4] Heilbron FC, Escorcia V, Ghanem B, Niebles JC. ActivityNet: A large-scale video benchmark for human activity understanding. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 961–970. [doi: 10.1109/CVPR.2015.7298698]
    [5] Zeng RH, Huang WB, Gan C, Tan MK, Rong Y, Zhao PL, Huang JZ. Graph convolutional networks for temporal action localization. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 7093–7102. [doi: 10.1109/ICCV.2019.00719]
    [6] Lin TW, Zhao X, Su HS, Wang CJ, Yang M. BSN: Boundary sensitive network for temporal action proposal generation. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 3–21. [doi: 10.1007/978-3-030-01225-0_1]
    [7] Lin TW, Liu X, Li X, Ding ER, Wen SL. BMN: Boundary-matching network for temporal action proposal generation. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 3888–3897. [doi: 10.1109/ICCV.2019.00399]
    [8] Tan J, Tang JQ, Wang LM, Wu GS. Relaxed Transformer decoders for direct action proposal generation. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 13506–13515. [doi: 10.1109/ICCV48922.2021.01327]
    [9] Gao JY, Chen K, Nevatia R. CTAP: Complementary temporal action proposal generation. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 70–85. [doi: 10.1007/978-3-030-01216-8_5]
    [10] Gao JY, Yang ZH, Sun C, Chen K, Nevatia R. TURN TAP: Temporal unit regression network for temporal action proposals. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 3648–3656. [doi: 10.1109/ICCV.2017.392]
    [11] Lin CM, Li J, Wang YB, Tai Y, Luo DH, Cui ZP, Wang CJ, Li JL, Huang FY, Ji RR. Fast learning of temporal action proposal via dense boundary generator. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI Press, 2020. 11499–11506.
    [12] Qing ZW, Su HS, Gan WH, Wang DL, Wu W, Wang X, Qiao Y, Yan JJ, Gao CX, Sang N. Temporal context aggregation network for temporal action proposal refinement. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 485–494. [doi: 10.1109/CVPR46437.2021.00055]
    [13] Xu MM, Zhao C, Rojas DS, Thabet A, Ghanem B. G-TAD: Sub-graph localization for temporal action detection. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10153–10162. [doi: 10.1109/CVPR42600.2020.01017]
    [14] Shou Z, Wang DA, Chang SF. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 1049–1058. [doi: 10.1109/CVPR.2016.119]
    [15] Lin TW, Zhao X, Shou Z. Single shot temporal action detection. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain: ACM, 2017. 988–996. [doi: 10.1145/3123266.3123343]
    [16] Xu HJ, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 5794–5803. [doi: 10.1109/ICCV.2017.617]
    [17] Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R. Rethinking the Faster R-CNN architecture for temporal action localization. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1130–1139. [doi: 10.1109/CVPR.2018.00124]
    [18] Long FC, Yao T, Qiu ZF, Tian XM, Luo JB, Mei T. Gaussian temporal awareness networks for action localization. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 344–353. [doi: 10.1109/CVPR.2019.00043]
    [19] Lin CM, Xu CM, Luo DH, Wang YB, Tai Y, Wang CJ, Li JL, Huang FY, Fu YW. Learning salient boundary feature for anchor-free temporal action localization. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 3320–3329. [doi: 10.1109/CVPR46437.2021.00333]
    [20] Liu XL, Wang QM, Hu Y, Tang X, Zhang SW, Bai S, Bai X. End-to-end temporal action detection with Transformer. IEEE Trans. on Image Processing, 2022, 31: 5427–5441.
    [21] Shi DF, Zhong YJ, Cao Q, Zhang J, Ma L, Li J, Tao DC. ReAct: Temporal action detection with relational queries. In: Proc. of the 17th European Conf. on Computer Vision. Tel Aviv: Springer, 2022. 105–121. [doi: 10.1007/978-3-031-20080-9_7]
    [22] Zhang CL, Wu JX, Li Y. ActionFormer: Localizing moments of actions with Transformers. In: Proc. of the 17th European Conf. on Computer Vision. Tel Aviv: Springer, 2022. 492–510. [doi: 10.1007/978-3-031-19772-7_29]
    [23] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]
    [24] Pouyanfar S, Tao YD, Mohan A, Tian HM, Kaseb AS, Gauen K, Dailey R, Aghajanzadeh S, Lu YH, Chen SC, Shyu ML. Dynamic sampling in convolutional neural networks for imbalanced data classification. In: Proc. of the 2018 IEEE Conf. on Multimedia Information Processing and Retrieval. Miami: IEEE, 2018. 112–117. [doi: 10.1109/MIPR.2018.00027]
    [25] He HB, Garcia EA. Learning from imbalanced data. IEEE Trans. on Knowledge and Data Engineering, 2009, 21(9): 1263–1284.
    [26] Cui Y, Jia ML, Lin TY, Song Y, Belongie S. Class-balanced loss based on effective number of samples. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9260–9269. [doi: 10.1109/CVPR.2019.00949]
    [27] Huang C, Li YN, Loy CC, Tang XO. Deep imbalanced learning for face recognition and attribute prediction. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2020, 42(11): 2781–2794.
    [28] Byrd J, Lipton Z. What is the effect of importance weighting in deep learning? In: Proc. of the 36th Int’l Conf. on Machine Learning. Long Beach: PMLR, 2019. 872–881.
    [29] Li B, Yao YQ, Tan JR, Zhang G, Yu FW, Lu JW, Luo Y. Equalized focal loss for dense long-tailed object detection. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 6980–6989. [doi: 10.1109/CVPR52688.2022.00686]
    [30] Tan JR, Lu X, Zhang G, Yin CQ, Li QQ. Equalization loss v2: A new gradient balance approach for long-tailed object detection. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 1685–1694.
    [31] Liu ZW, Miao ZQ, Zhan XH, Wang JY, Gong BQ, Yu SX. Large-scale long-tailed recognition in an open world. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 2532–2541. [doi: 10.1109/CVPR.2019.00264]
    [32] Yin X, Yu X, Sohn K, Liu XM, Chandraker M. Feature transfer learning for face recognition with under-represented data. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 5697–5706. [doi: 10.1109/CVPR.2019.00585]
    [33] Huang C, Li YN, Loy CC, Tang XO. Learning deep representation for imbalanced classification. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5375–5384. [doi: 10.1109/CVPR.2016.580]
    [34] Zhang X, Fang ZY, Wen YD, Li ZF, Qiao Y. Range loss for deep face recognition with long-tailed training data. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 5419–5428. [doi: 10.1109/ICCV.2017.578]
    [35] Kang BY, Xie SN, Rohrbach M, Yan ZC, Gordo A, Feng JS, Kalantidis Y. Decoupling representation and classifier for long-tailed recognition. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.
    [36] Zhou BY, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9716–9725. [doi: 10.1109/CVPR42600.2020.00974]
    [37] Kang BY, Li Y, Xie S, Yuan ZH, Feng JS. Exploring balanced feature spaces for representation learning. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [38] Zhong ZS, Cui JQ, Liu S, Jia JY. Improving calibration for long-tailed recognition. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 16484–16493. [doi: 10.1109/CVPR46437.2021.01622]
    [39] Yang YZ, Xu Z. Rethinking the value of labels for improving class-imbalanced learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1618.
    [40] Wei C, Sohn K, Mellina C, Yuille A, Yang F. CReST: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 10852–10861.
    [41] Hyun M, Jeong J, Kwak N. Class-imbalanced semi-supervised learning. arXiv:2002.06815, 2020.
    [42] Kim J, Hur Y, Park S, Yang E, Hwang SJ, Shin J. Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1221.
    [43] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the Kinetics dataset. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 4724–4733. [doi: 10.1109/CVPR.2017.502]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王雨虹,武港山,王利民.基于半监督学习的长尾时序动作检测.软件学报,2025,36(2):625-643

复制
分享
文章指标
  • 点击次数:200
  • 下载次数: 1610
  • HTML阅读次数: 345
  • 引用次数: 0
历史
  • 收稿日期:2023-08-11
  • 最后修改日期:2023-09-07
  • 在线发布日期: 2024-07-17
文章二维码
您是第19982318位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号