Abstract:Video summarization is an indispensable and critical task in computer vision, the goal of which is to generate a concise and complete video summary by selecting the most informative part of a video. A generated video summary is a set of representative video frames (such as video keyframes) or a short video formed by stitching key video segments in time sequence. Although the study on video summarization has made considerable progress, the existing methods have the problems of deficient temporal information and incomplete feature representation, which can easily affect the correctness and completeness of a video summary. To solve the problems, this study proposes a model based on a spatiotemporal transform network, which includes three modules, i.e., the embedding layer, the feature transformation and fusion layer, and the output layer. Specifically, the embedding layer can simultaneously embed spatial and temporal features, and the feature transformation and fusion layer can realize the transformation and fusion of multi-modal features; finally, the output layer generates the video summary by segment prediction and key shot selection. The spatial and temporal features are embedded separately to fix the problem of deficient temporal information in existing models, and the transformation and fusion of multi-modal features can solve the problem of incomplete feature representation. Sufficient experiments and analyses on two benchmark datasets are conducted, and the results verify the effectiveness of the proposed model.