基于张量表示的直推式多模态视频语义概念检测
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant Nos.60603096, 60533090 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant No.2006AA010107 (国家高技术研究发展计划(863); the National Key Technology R&D Program of China under Grant No.2007BAH11B01 (国家科技支撑计划); the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant Nos.IRT0652, PCSIRT (长江学者和创新团队发展计划)


Transductive Multi-Modality Video Semantic Concept Detection with Tensor Representation
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [33]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    提出了一种基于高阶张量表示的视频语义分析与理解框架.在此框架中,视频镜头首先被表示成由视频中所包含的文本、视觉和听觉等多模态数据构成的三阶张量;其次,基于此三阶张量表达及视频的时序关联共生特性设计了一种子空间嵌入降维方法,称为张量镜头;由于直推式学习从已知样本出发能对特定的未知样本进行学习和识别,最后在这个框架中提出了一种基于张量镜头的直推式支持张量机算法,它不仅保持了张量镜头所在的流形空间的本征结构,而且能够将训练集合外数据直接映射到流形子空间,同时充分利用未标记样本改善分类器的学习性能.实验结果表明,该方法能够有效地进行视频镜头的语义概念检测.

    Abstract:

    A higher-order tensor framework for video analysis and understanding is proposed in this paper. In this framework, image frame, audio and text are represented, which are the three modalities in video shots as data points by the 3rd-order tensor. Then a subspace embedding and dimension reduction method is proposed, which explicitly considers the manifold structure of the tensor space from temporal-sequenced associated co-occurring multimodal media data in video. It is called TensorShot approach. Transductive learning uses a large amount of unlabeled data together with the labeled data to build better classifiers. A transductive support tensor machines algorithm is proposed to train effective classifier. This algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled, and is also able to map out-of-sample data points directly. Moreover, the utilization of unlabeled data improves classification ability. Experimental results show that this method improves the performance of video semantic concept detection.

    参考文献
    [1] Zhuang YT, Yang Y, Wu F. Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. on Multimedia, 2008,10(2):221-229.
    [2] Yang Y, Zhuang YT, Wu F, Pan YH. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. on Multimedia, 2008,10(3):437-446.
    [3] Zhang H, Wu F, Zhuang YT, Chen JX. Cross-Media retrieval method based on content correlations. Chinese Journal of Computers, 2008,31(5):820-826 (in Chinese with English abstract).
    [4] Babaguchi N, Kawai Y, Kitahashi T. Event based indexing of broadcast sports video by intermodal collaboration. IEEE Trans. on Multimedia, 2002,4(1):68-75.
    [5] Snoek CGM, Worring M. Multimedia event-based video indexing using time intervals. IEEE Trans. on Multimedia, 2005,7(4): 638-647.
    [6] Hu N, Wang YW, Lü N. Study on multimodel retrieval method of content-based video. Journal of Jilin University (Information Science Edition), 2006,24(3):265-270 (in Chinese with English abstract).
    [7] Liu YN, Wu F. Video semantic concept detection using multi-modality subspace correlation propagation. In: Proc. of the 13th Int’l Multimedia Modeling Conf. (MMM 2007). Berlin, Heidelberg: Springer-Verlag, 2006. 527-534.
    [8] Yu HC, Bennamoun M. 1D-PCA, 2D-PCA to nD-PCA. In: Proc. of the 18th Int’l Conf. on Pattern Recognition. New York: IEEE Computer Society, 2006. 181-184.
    [9] Vasilescu MAO, Terzopoulos D. Multilinear analysis of image ensembles: TensorFaces. In: Proc. of the 7th European Conf. on Computer Vision. Berlin, Heidelberg: Springer-Verlag, 2002. 447-460.
    [10] Jolliffe IT. Principal Component Analysis. 2nd ed., New York: Springer-Verlag, 2002.
    [11] Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000,290(5500):2323-2326.
    [12] Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500):2319-2323.
    [13] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Proc. of the Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2002. 585-591.
    [14] He XF, Niyogi P. Locality preserving projections. In: Proc. of the Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2003.
    [15] Turk MA, Pentland AP. Face recognition using eigenfaces. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. New York: IEEE Computer Society, 1991. 586-591.
    [16] Matusik W, Pfister H, Brand M, McMillan L. A data-driven reflectance model. In: Proc. of the SIGGRAPH. New York: ACM, 2003. 759-769.
    [17] He XF, Ma WY, Zhang HJ. Learning an image manifold for retrieval. In: Proc. of the ACM Conf. on Multimedia. New York: ACM, 2004. 17-23.
    [18] Tang JH, Hua XS, Qi GJ, Wang M, Mei T, Wu XQ. Structure-Sensitive manifold ranking for video concept detection. In: Proc. of the ACM Conf. on Multimedia. New York: ACM, 2007. 852-861.
    [19] Hoi SCH, Lyu MR. A multimodal and multilevel ranking scheme for large-scale video retrieval. IEEE Trans. on Multimedia, 2008, 10(4):607-619.
    [20] He XF, Cai D, Liu HF, Han JW. Image clustering with tensor representation. In: Proc. of the ACM Conf. on Multimedia. New York: ACM, 2005. 132-140.
    [21] de Lathauwer L, Moor BD, Vandewalle J. A multilinear singular value decomposition. SIAM Journal of Matrix Analysis and Applications, 2000,21(4):1253-1278.
    [22] Tao DC, Li XL, Wu XD, Hu WM, Maybank SJ. Supervised tensor learning. Knowledge and Information Systems, 2007,13(1): 1-42.
    [23] Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998,2(2): 121-167.
    [24] Zhu XJ. Semi-Supervised learning literature survey. Technical Report, 1530, Department of Computer Science, University of Wisconsin-Madison, 2005.
    [25] Joachims T. Transductive inference for text classification using support vector machines. In: Bratko I, Dzeroski S, eds. Proc. of the 16th Int’l Conf. on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1999. 200-209.
    [26] Lathauwer LD. Signal processing based on multilinear algebra [Ph.D. Thesis]. Belgium: Katholieke Universiteit Leuven, 1997.
    [27] Chen YS, Wang GP, Dong SH. A progressive transductive inference algorithm based on support vector machine. Journal of Software, 2003,14(3):451-460 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/14/451.htm
    [28] He XF, Cai D, Niyogi P. Tensor subspace analysis. In: Proc. of the Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2005. 499-506.
    [29] Chung FRK. Spectral graph theory. 2nd ed., Providence: American Mathematical Society, 1997. 2-14.
    [30] TREVID: TREC video retrieval evaluation. http://www-nlpir.nist.gov/projects/trecvid
    [31] Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM. The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proc. of the ACM Int’l Conf. on Multimedia. New York: ACM, 2006. 421-430.
    [32] Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006,27(8):861-874.
    [33] Eng J. ROC analysis: Web-Based calculator for ROC curves. Baltimore: Johns Hopkins University. http://www.jrocfit.org
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

吴 飞,刘亚楠,庄越挺.基于张量表示的直推式多模态视频语义概念检测.软件学报,2008,19(11):2853-2868

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2008-03-01
  • 最后修改日期:2008-08-26
文章二维码
您是第19899926位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号