视觉问答研究综述

doi:10.13328/j.cnki.jos.006215

微信服务号

微信订阅号

2025年5月10日 21:37 星期六

首页 > 过刊浏览>2021年第32卷第8期 >2522-2544. DOI:10.13328/j.cnki.jos.006215

PDF HTML阅读 XML下载导出引用引用提醒

视觉问答研究综述
DOI:
                        10.13328/j.cnki.jos.006215
                    
CSTR:
                        
                    
作者:
                        包希港包希港
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
周春来周春来
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
肖克晶肖克晶
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找
覃飙覃飙
中国人民大学 信息学院, 北京 100872
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:包希港(1997-),男,博士生,主要研究领域为视觉问答,知识库问答.
肖克晶(1991-),女,博士生,主要研究领域为自然语言处理,深度学习,数据挖掘.
周春来(1976-),男,博士,副教授,CCF专业会员,主要研究领域为人工智能不确定性.
覃飙(1972-),男,博士,副教授,博士生导师,CCF专业会员,主要研究领域为人工智能,因果分析和不确定数据库.
通讯作者:覃飙,E-mail:qinbiao@ruc.edu.cn
中图分类号:
基金项目:国家自然科学基金（61772534，61732006）

Survey on Visual Question Answering

Author:

BAO Xi-Gang
BAO Xi-Gang
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Chun-Lai
ZHOU Chun-Lai
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
XIAO Ke-Jing
XIAO Ke-Jing
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找
QIN Biao
QIN Biao
School of Information, Renmin University of China, Beijing 100872, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Natural Science Foundation of China (61772534, 61732006)

摘要

图/表

访问统计

参考文献 [111]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

视觉问答是计算机视觉领域和自然语言处理领域的交叉方向，近年来受到了广泛关注.在视觉问答任务中，算法需要回答基于特定图片（或视频）的问题.自2014年第一个视觉问答数据集发布以来，若干大规模数据集在近5年内被陆续发布，并有大量算法在此基础上被提出.已有的综述性研究重点针对视觉问答任务的发展进行了总结，但近年来，有研究发现，视觉问答模型强烈依赖语言偏见和数据集的分布，特别是自VQA-CP数据集发布以来，许多模型的效果大幅度下降.主要详细介绍近年来提出的算法以及发布的数据集，特别是讨论了算法在加强鲁棒性方面的研究.对视觉问答任务的算法进行分类总结，介绍了其动机、细节以及局限性.最后讨论了视觉问答任务的挑战及展望.

关键词:视觉问答;交叉方向;语言偏见;数据集分布;鲁棒性

Abstract:

Visual question answering (VQA) is an interdisciplinary direction in the field of computer vision and natural language processing. It has received extensive attention in recent years. In the visual question answering, the algorithm is required to answer questions based on specific pictures (or videos). Since the first visual question answering dataset was released in 2014, several large-scale datasets have been released in the past five years, and a large number of algorithms have been proposed based on them. Existing research has focused on the development of visual question answering, but in recent years, visual question answering has been found to rely heavily on language bias and the distribution of datasets, especially since the release of the VQA-CP dataset, the accuracy of many models has been greatly reduced. This paper mainly introduces the proposed algorithms and the released datasets in recent years, especially discusses the research of algorithms on strengthening the robustness. The algorithms of visual question answering are summarized and their motivation, details, and limitations are also introduced. Finally, the challenge and prospect of visual question answering are discussed.

Key words:visual question answering;interdisciplinary direction;language bias;distribution of datasets;robustness

参考文献

[1] Szegedy C, Vanhoucke V, Ioffe S, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 2818-2826.[doi:10.1109/CVPR.2016.308]

[2] Huang G, Liu Z, Van Der Maaten LQ, Weinberger K. Densely connected convolutional networks. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 4700-4708.[doi:10.1109/CVPR.2017.243]

[3] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once:Unified, real-time object detection. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 779-788.[doi:10.1109/CVPR.2016.91]

[4] Lin T, Dollár P, Girshick R, He KM, Harharan B, Belongie S. Feature pyramid networks for object detection. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 2117-2125.[doi:10.1109/CVPR.2017.106]

[5] Zhu G, Zhang L, Shen P, Shen PY, Song J. Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access, 2017,5:4517-4524.

[6] Narayana P, Beveridge R, Draper BA. Gesture recognition:Focus on the hands. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 5235-5244.[doi:10.1109/CVPR.2018.00549]

[7] Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 2625-2634.[doi:10.1109/CVPR.2015.7298878]

[8] Karpathy A, Joulin A, Li FF. Deep fragment embeddings for bidirectional image sentence mapping. In:Proc. of the Advances in Neural Information Processing Systems. 2014. 1889-1897.

[9] Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). In:Proc. of the Int'l Conf. on Learning Representations. 2015.

[10] Bajaj P, Campos D, Craswell N. Ms Marco:A human-generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.

[11] Hu M, Peng Y, Huang Z, Qiu X, Wei F, Zhou M. Reinforced mnemonic reader for machine reading comprehension. arXiv preprint arXiv:1705.02798, 2017.

[12] Xian GJ, Huang YZ. A review of research on visual question-answering technology based on neural network. Network Security Technologies and Applications, 2018(1):42-47(in Chinese with English abstract).

[13] Yu J, Wang L, Yu Z. Research on visual question answering techniques. Journal of Computer Research and Development, 2018, 55(9):1946-1958(in Chinese with English abstract).

[14] Kafle K, Kanan C. An analysis of visual question answering algorithms. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 1965-1973.[doi:10.1109/ICCV.2017.217]

[15] Wu Q, Teney D, Wang P, Shen CH, Dick A, Van Den Hengel A. Visual question answering:A survey of methods and datasets. Computer Vision and Image Understanding, 2017,163:21-40.

[16] Kafle K, Kanan C. Visual question answering:Datasets, algorithms, and future challenges. Computer Vision and Image Understanding, 2017,163:3-20.

[17] Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter:Elevating the role of image understanding in visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 6904-6913.[doi:10.1109/CVPR.2017.670]

[18] Ramakrishnan S, Agrawal A, Lee S. Overcoming language priors in visual question answering with adversarial regularization. In:Proc. of the Advances in Neural Information Processing Systems. 2018. 1541-1551.

[19] Yang Z, He X, Gao J, Deng L, Smola A. Stacked attention networks for image question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 21-29.[doi:10.1109/CVPR.2016.10]

[20] Deng J, Dong W, Socher R, Li L, Li K, Li F. Imagenet:A large-scale hierarchical image database. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2009. 248-255.[doi:10.1109/CVPR.2009.5206848]

[21] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In:Proc. of the Int'l Conf on Learning Representations. 2015.

[22] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 770-778.[doi:10.1109/CVPR.2016.90]

[23] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 1-9.[doi:10.1109/CVPR.2015. 7298594]

[24] Anderson P, He X, Buehler C, Teney D, Johnson M, Dould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6077-6086.[doi:10.1109/CVPR.2018.00636]

[25] Ren S, He K, Girshick R, Sun J. Faster r-CNN:Towards real-time object detection with region proposal networks. In:Proc. of the Advances in Neural Information Processing Systems. 2015. 91-99.

[26] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997,9(8):1735-1780.

[27] Cho K, Van Merriënboer B, Gulcehre C, Bahdnau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In:Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). 2014. 1724-1734.

[28] Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S. Skip-thought vectors. In:Proc. of the Advances in Neural Information Processing Systems. 2015. 3294-3302.

[29] Malinowski M, Rohrbach M, Fritz M. Ask your neurons:A neural-based approach to answering questions about images. In:Proc. of the IEEE Int'l Conf on Computer Vision. 2015. 1-9.[doi:10.1109/ICCV.2015.9]

[30] Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W. Are you talking to a machine? Dataset and methods for multilingual image question. In:Proc. of the Advances in Neural Information Processing Systems. 2015. 2296-2304.

[31] Noh H, Hongsuck Seo P, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 30-38.[doi:10.1109/CVPR.2016.11]

[32] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In:Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. 2016. 457-468.

[33] Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li J, Shamma DA, Bernstein MS, Li F. Visual genome:Connecting language and vision using crowdsourced dense image annotations. Int'l Journal of Computer Vision, 2017, 123(1):32-73.

[34] Kim JH, On KW, Lim W, Kim J, Ha J, Zhang B. Hadamard product for low-rank bilinear pooling. In:Proc. of the Int'l Conf. on Learning Representations. 2017.

[35] Yu Z, Yu J, Fan J, Tao P. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In:Proc. of the IEEE Int'l Conf on Computer Vision. 2017. 1821-1830.[doi:10.1109/ICCV.2017.202]

[36] Yu Z, Yu J, Xiang C, Fan J, Tao D. Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. on Neural Networks and Learning Systems, 2018,29(12):5947-5959.

[37] Ben-Younes H, Cadene R, Cord M, Thome N. Mutan:Multimodal tucker fusion for visual question answering. In:Proc. of the IEEE Int'l Conf on Computer Vision. 2017. 2612-2620.[doi:10.1109/ICCV.2017.285]

[38] Ben-Younes H, Cadene R, Thome N, Cord M. Block:Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In:Proc. of the AAAI Conf. on Artificial Intelligence, Vol.33. 2019. 8102-8109.

[39] Kim JH, Lee SW, Kwak D, Caramanis C. Multimodal residual learning for visual QA. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 361-369.

[40] Saito K, Shin A, Ushiku Y, Harada T. Dualnet:Domain-invariant network for visual question answering. In:Proc. of the IEEE Int'l Conf. on Multimedia and Expo (ICME). IEEE, 2017. 829-834.

[41] Gao P, You H, Zhang Z, Wang X, Li H. Multi-modality latent interaction network for visual question answering. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 5825-5835.[doi:10.1109/ICCV.2019.00592]

[42] Do T, Do TT, Tran H, Tjiputra E, Tran QD. Compact trilinear interaction for visual question answering. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 392-401.[doi:10.1109/ICCV.2019.00048]

[43] Bro R, Harshman RA, Sidiropoulos ND, Lundy ME. Modeling multiway data with linearly dependent loadings. Journal of Chemometrics:A Journal of the Chemometrics Society, 2009,23(7-8):324-340.

[44] Wang W, Shen J, Dong X, Borji A. Salient object detection driven by fixation prediction. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 1711-1720.[doi:10.1109/CVPR.2018.00184]

[45] Ke L, Pei W, Li R, Shen X, Tai Y. Reflective decoding network for image captioning. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 8888-8897.[doi:10.1109/ICCV.2019.00898]

[46] Xiao T, Li Y, Zhu J, Yu Z, Liu T. Sharing attention weights for fast transformer. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence. 2019. 5292-5298.

[47] Xu K, Ba J, Kiros R, Cho K, Courvile A, Salakhutdinov R, Zemel RS, Bengio Y. Show, attend and tell:Neural image caption generation with visual attention. In:Proc. Int'l Conf. on Machine Learning. 2015. 2048-2057.

[48] Zhu Y, Groth O, Bernstein M, Li F. Visual7w:Grounded question answering in images. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 4995-5004.[doi:10.1109/CVPR.2016.540]

[49] Shih KJ, Singh S, Hoiem D. Where to look:Focus regions for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 4613-4621.[doi:10.1109/CVPR.2016.499]

[50] Patro B, Namboodiri VP. Differential attention for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 7680-7688.[doi:10.1109/CVPR.2018.00801]

[51] Lu J, Yang J, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In:Proc. of the Advances in Neural Information Processing Systems. 2016. 289-297.

[52] Nam H, Ha JW, Kim J. Dual attention networks for multimodal reasoning and matching. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 299-307.[doi:10.1109/CVPR.2017.232]

[53] Nguyen DK, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6087-6096.[doi:10.1109/CVPR.2018.00637]

[54] Yu D, Fu J, Mei T, Rui Y. Multi-level attention networks for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 4709-4717.[doi:10.1109/CVPR.2017.446]

[55] Wang P, Wu Q, Shen C, Van Den Hengel A. The VQA-machine:Learning how to use existing vision algorithms to answer new questions. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 1173-1182.[doi:10.1109/CVPR.2017. 416]

[56] Wu Q, Wang P, Shen C, Reid I, Van Den Hengel A. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6106-6115.[doi:10.1109/CVPR.2018.00639]

[57] Yu Z, Yu J, Cui Y, Tao D, Tian Q. Deep modular co-attention networks for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 6281-6290.[doi:10.1109/CVPR.2019.00644]

[58] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Lukasz K, Polosukhin I. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 5998-6008.

[59] Gao P, Jiang Z, You H, Lu P, Hoi S, Wang X, Li H. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 6639-6648.[doi:10.1109/CVPR.2019.00680]

[60] Teney D, Anderson P, He X, Van Den Hengel A. Tips and tricks for visual question answering:Learnings from the 2017 challenge. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 4223-4232.[doi:10.1109/CVPR.2018. 00444]

[61] Lu P, Li H, Zhang W, Wang J, Wang X. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. arXiv preprint arXiv:1711.06794, 2017.

[62] Wu C, Liu J, Wang X, Dong X. Object-difference attention:A simple relational attention for visual question answering. In:Proc. of the 26th ACM Int'l Conf. on Multimedia. 2018. 519-527.

[63] Cadene R, Ben-Younes H, Cord M, Thome N. Murel:Multimodal relational reasoning for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 1989-1998.[doi:10.1109/CVPR.2019.00209]

[64] Li L, Gan Z, Cheng Y, Liu J. Relation-aware graph attention network for visual question answering. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 10313-10322.[doi:10.1109/ICCV.2019.01041]

[65] Andreas J, Rohrbach M, Darrell T, Klein D. Neural module networks. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 39-48.

[66] Klein D, Manning CD. Accurate unlexicalized parsing. In:Proc. of the 41st Annual Meeting of the Association for Computational Linguistics. 2003. 423-430.

[67] De Marneffe MC, Manning CD. The Stanford typed dependencies representation. In:Proc. of the Workshop on Cross-framework and Cross-domain Parser Evaluation. 2008. 1-8.

[68] Andreas J, Rohrbach M, Darrell T, Klein D. Learning to compose neural networks for question answering. In:Proc. of the Annual Conf. of the North American Chapter of the Association for Computational Linguistics. 2016. 1545-1554.

[69] Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K. Learning to reason:End-to-end module networks for visual question answering. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 804-813.

[70] Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R, Socher R. Ask me anything:Dynamic memory networks for natural language processing. In:Proc. of the Int'l Conf. on Machine Learning. 2016. 1378-1387.

[71] Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In:Proc. of the Int'l Conf. on Machine Learning. 2016. 2397-2406.

[72] Noh H, Han B. Training recurrent answering units with joint loss minimization for VQA. arXiv preprint arXiv:1606.03647, 2016.

[73] Wang P, Wu Q, Shen C, Van Den Hengel A, Dick A. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015.

[74] Wang P, Wu Q, Shen C, Dick A, Van Den Hengel A. FVQA:Fact-based visual question answering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018,40(10):2413-2427.

[75] Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. Dbpedia:A nucleus for a Web of open data. In:Proc. of the Semantic Web. Berlin, Heidelberg:Springer-Verlag, 2007. 722-735.

[76] Wu Q, Wang P, Shen C, Dick A, Van Den Hengel A. Ask me anything:Free-form visual question answering based on knowledge from external sources. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 4622-4630.[doi:10.1109/CVPR.2016.500]

[77] Wu Q, Shen C, Wang P, Dick A, Van Den Hengel A. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2017,40(6):1367-1381.

[78] Agrawal A, Batra D, Parikh D. Analyzing the behavior of visual question answering models. In:Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. 2016. 1955-1960.

[79] Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D. Yin and yang:Balancing and answering binary visual questions. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 5014-5022.[doi:10.1109/CVPR.2016.542]

[80] Shah M, Chen X, Rohrbach M, Parikh D. Cycle-consistency for robust visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 6649-6658.

[81] Xu X, Chen X, Liu C, Rohrbach A, Darrell T, Song D. Fooling vision and language models despite localization and attention mechanism. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 4951-4961.[doi:10.1109/CVPR. 2018.00520]

[82] Agrawal A, Batra D, Parikh D, Kembhavi A. Don't just assume; look and answer:Overcoming priors for visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 4971-4980.[doi:10.1109/CVPR. 2018.00520]

[83] Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y. Counterfactual samples synthesizing for robust visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2020. 10800-10809.

[84] Grand G, Belinkov Y. Adversarial regularization for visual question answering:Strengths, shortcomings, and side effects. In:Proc. of the 57th Conf. on Computational Natural Language Learning. ACL, 2019. 1-13.

[85] Belinkov Y, Poliak A, Shieber SM, Durme BV, Rush AM. Don't take the premise for granted:Mitigating artifacts in natural language inference. In:Proc. of the 57th Conf. on Computational Natural Language Learning. ACL, 2019. 877-891.

[86] Cadene R, Dancette C, Cord M, Parikh D. Rubi:Reducing unimodal biases for visual question answering. In:Proc. of the Advances in Neural Information Processing Systems. 2019. 841-852.

[87] Clark C, Yatskar M, Zettlemoyer L. Don't take the easy way out:Ensemble based methods for avoiding known dataset biases. In:Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing. 2019. 4069-4082.

[88] Mahabadi RK, Henderson J. Simple but effective techniques to reduce biases. arXiv preprint arXiv:1909.06321, 2019.

[89] Wu J, Mooney R. Self-critical reasoning for robust visual question answering. In:Proc. of the Advances in Neural Information Processing Systems. 2019. 8604-8614.

[90] Singh A, Natarajan V, Shah M, Jiang Y, Chen X. Towards VQA models that can read. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 8317-8326.[doi:10.1109/CVPR.2019.00851]

[91] Biten AF, Tito R, Mafla A, Gomez L, Rusinol M, Valveny E, Jawahar CV, Karatzas D. Scene text visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 4291-4301.[doi:10.1109/CVPR.2019.00851]

[92] Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In:Proc. of the IEEE Int'l Conf on Computer Vision. 2017. 2223-2232.[doi:10.1109/ICCV.2017.244]

[93] Zhang Y, Hare J, Prügel-Bennett A. Learning to count objects in natural images for visual question answering. In:Proc. of the Int'l Conf. on Learning Representations. 2018.

[94] Acharya M, Kafle K, Kanan C. TallyQA:Answering complex counting questions. In:Proc. of the AAAI Conf. on Artificial Intelligence, Vol.33. 2019. 8076-8084.

[95] Shrestha R, Kafle K, Kanan C. Answer them all! Toward universal visual question answering models. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 10472-10481.[doi:10.1109/CVPR.2019.01072]

[96] Hudson D, Manning CD. Learning by abstraction:The neural state machine. In:Proc. of the Advances in Neural Information Processing Systems. 2019. 5903-5916.

[97] Shi Y, Furlanello T, Zha S, Anandkumar A. Question type guided attention in visual question answering. In:Proc. of the European Conf. on Computer Vision (ECCV). 2018. 151-166.

[98] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In:Proc. of the Advances in Neural Information Processing Systems. 2014. 1682-1690.

[99] Ren M, Kiros R, Zemel R. Image question answering:A visual semantic embedding model and a new dataset. arXiv preprint arXiv:1505.02074, 2015.

[100] Antol S, Agrawal A, Lu J, Antol S, Mitchell M, Zitnick L, Batra D, Parikh D. VQA:Visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 2425-2433.

[101] Huk Park D, Anne Hendricks L, Akata Z, Rohrbach A, Schiele B, Darrell T, Rohrbach M. Multimodal explanations:Justifying decisions and pointing to the evidence. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 8779-8788.[doi:10.1109/CVPR.2018.00915]

[102] Johnson J, Hariharan B, van der Maaten L, Li F, Zitnick CL, Girshick R. Clevr:A diagnostic dataset for compositional language and elementary visual reasoning. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 2901-2910.[doi:10.1109/CVPR.2017.215]

[103] Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollar P. Microsoft coco:Common objects in context. In:Proc. of the European Conf. on Computer Vision (ECCV). 2014. 740-755.

[104] Borisyuk F, Gordo A, Sivakumar V. Rosetta:Large scale system for text detection and recognition in images. In:Proc. of the 24th ACM SIGKDD Int'l Conf. on Knowledge Discovery & Data Mining. 2018. 71-79.

[105] Chattopadhyay P, Vedantam R, Selvaraju RR, Batra D, Parikh D. Counting everyday objects in everyday scenes. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 1135-1144.[doi:10.1109/CVPR.2017.471]

[106] Trott A, Xiong C, Socher R. Interpretable counting for visual question answering. In:Proc. of the Int'l Conf. on Learning Representations. 2017. 133-138.

[107] Zitnick CL, Agrawal A, Antol S, Mitchell M, Batra D, Parikh D. Measuring machine intelligence through visual question answering. AI Magazine, 2016,37(1):63-72.

[108] Wu Z, Palmer M. Verb semantics and lexical selection. In:Proc. of the Conf. on Association for Computational Linguistics. 1994.

附中文参考文献:

[12] 鲜光靖,黄永忠.基于神经网络的视觉问答技术研究综述.网络安全技术与应用,2018(1):42-47.

[13] 俞俊,汪亮,余宙.视觉问答技术研究.计算机研究与发展,2018,55(9):1946-1958.

引用本文

包希港,周春来,肖克晶,覃飙.视觉问答研究综述.软件学报,2021,32(8):2522-2544

复制

文章指标

点击次数:4112
下载次数: 10324
HTML阅读次数: 7322
引用次数: 0

历史

收稿日期:2020-07-09
最后修改日期:2020-10-02
录用日期:
在线发布日期: 2021-01-15
出版日期: 2021-08-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码