Video Summarization Based on Spacial-temporal Transform Network

doi:10.13328/j.cnki.jos.006621

微信服务号

微信订阅号

2025-5-13- 3

Home > Archive>Volume 33, Issue 9, 2022 >3195-3209. DOI:10.13328/j.cnki.jos.006621

PDF HTML XML Export Cite reminder

Video Summarization Based on Spacial-temporal Transform Network
DOI:
                        10.13328/j.cnki.jos.006621
                    
Author:
                        LI QunLI Qun
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
XIAO FuXIAO Fu
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHANG Zi-YiZHANG Zi-Yi
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHANG FengZHANG Feng
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LI Yan-ChaoLI Yan-Chao
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP391
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Video summarization is an indispensable and critical task in computer vision, the goal of which is to generate a concise and complete video summary by selecting the most informative part of a video. A generated video summary is a set of representative video frames (such as video keyframes) or a short video formed by stitching key video segments in time sequence. Although the study on video summarization has made considerable progress, the existing methods have the problems of deficient temporal information and incomplete feature representation, which can easily affect the correctness and completeness of a video summary. To solve the problems, this study proposes a model based on a spatiotemporal transform network, which includes three modules, i.e., the embedding layer, the feature transformation and fusion layer, and the output layer. Specifically, the embedding layer can simultaneously embed spatial and temporal features, and the feature transformation and fusion layer can realize the transformation and fusion of multi-modal features; finally, the output layer generates the video summary by segment prediction and key shot selection. The spatial and temporal features are embedded separately to fix the problem of deficient temporal information in existing models, and the transformation and fusion of multi-modal features can solve the problem of incomplete feature representation. Sufficient experiments and analyses on two benchmark datasets are conducted, and the results verify the effectiveness of the proposed model.

Key words:video summarization;spacial-temporal transform network;ViLBERT;feature fusion;multi-modal

Get Citation

李群,肖甫,张子屹,张锋,李延超.基于空时变换网络的视频摘要生成.软件学报,2022,33(9):3195-3209

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:June 29,2021
Revised:August 15,2021
Adopted:
Online: February 22,2022
Published: September 06,2022

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History