基于深度学习的口语理解联合建模算法综述

doi:10.13328/j.cnki.jos.006385

微信服务号

微信订阅号

2025年8月5日 17:59 星期二

首页 > 过刊浏览>2022年第33卷第11期 >4192-4216. DOI:10.13328/j.cnki.jos.006385

PDF HTML阅读 XML下载导出引用引用提醒

基于深度学习的口语理解联合建模算法综述
DOI:
                        10.13328/j.cnki.jos.006385
                    
CSTR:
                        
                    
作者:
                        魏鹏飞魏鹏飞
广东工业大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
曾碧曾碧
广东工业大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
汪明慧汪明慧
广东工业大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
曾安曾安
广东工业大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:魏鹏飞(1991-),男,助理实验师,主要研究领域为自然语言理解,任务对话系统,强化学习;汪明慧(1974-),女,博士,讲师,主要研究领域为人工智能;曾碧(1963-),女,博士,教授,博士生导师,CCF高级会员,主要研究领域为人工智能,智能人机交互,智能机器人;曾安(1978-),女,博士,教授,CCF高级会员,主要研究领域为人工智能,机器学习.
通讯作者:曾碧,E-mail:zb9215@gdut.edu.cn
中图分类号:
基金项目:国家自然科学基金（61772143）；广东省自然科学基金（2018A030313868）；广东省产学研重大专项（2016B010108004）

Survey on Joint Modeling Algorithms for Spoken Language Understanding Based on Deep Learning

Author:

WEI Peng-Fei
WEI Peng-Fei
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG Bi
ZENG Bi
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Ming-Hui
WANG Ming-Hui
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
ZENG An
ZENG An
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [99]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

口语理解是自然语言处理领域的研究热点之一，应用在个人助理、智能客服、人机对话、医疗等多个领域.口语理解技术指的是将机器接收到的用户输入的自然语言转换为语义表示，主要包含意图识别、槽位填充这两个子任务.现阶段，使用深度学习对口语理解中意图识别和槽位填充任务的联合建模方法已成为主流，并且获得了很好的效果.因此，对基于深度学习的口语理解联合建模算法进行总结分析具有十分重要的意义.首先介绍了深度学习技术应用到口语理解的相关工作，然后从意图识别和槽位填充的关联关系对现有的研究工作进行剖析，并对不同模型的实验结果进行了对比分析和总结，最后给出了未来的研究方向及展望.

关键词:意图识别;槽位填充;注意力机制;胶囊网络;任务对话系统;深度学习

Abstract:

Spoken language understanding is one of the hot research topics in the field of natural language processing. It is applied in many fields such as personal assistants, intelligent customer service, human-computer dialogue, and medical treatment. Spoken language understanding technology refers to the conversion of natural language input by the user into semantics representation, which mainly includes 2 sub-tasks of intent recognition and slot filling. At this stage, the deep modeling of joint recognition methods for intent recognition and slot filling tasks in spoken language understanding has become mainstream and has achieved sound results. Summarizing and analyzing the joint modeling algorithm of deep learning for spoken language learning is of great significance. First, it introduces the related work to the application of deep learning technology to spoken language understanding, and then the existing research work is analyzed from the relationship between intention recognition and slot filling. The experimental results of different models are compared and summarized. Finally, the challenges that future research may face are prospected.

Key words:intention recognition;slot filling;attention mechanism;capsule network;task dialogue system;deep learning

参考文献

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444.

[2] Chen C, Zhu QQ, Yan R, Liu JF. Survey on deep learning based open domain dialogue system. Chinese Journal of Computers, 2019, 42(7):1439-1466(in Chinese with English abstract).

[3] López-Cózar R, Callejas Z, Griol D, et al. Review of spoken dialogue systems. Loquens, 2014, 1(2):2386-2637.

[4] Hou LX, Li YL, Li CC. Review of research on task-oriented spoken language understanding. Computer Engineering and Applications, 2019, 55(11):7-15(in Chinese with English abstract).

[5] Chen H, Liu X, Yin D, et al. A survey on dialogue systems:Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, 2017, 19(2):25-35.

[6] Gao J, Galley M, Li L. Neural approaches to conversational AI. Foundations and Trends® in Information Retrieval, 2019, 13(2-3):127-298.

[7] Zhang X, Wang H. A joint model of intent determination and slot filling for spoken language understanding. In:Proc. of the 25th Int'l Joint Conf. on Artificial Intelligence. 2016. 2993-2999.

[8] Guo D, Tur G, Yih W, et al. Joint semantic utterance classification and slot filling with recursive neural networks. In:Proc. of the IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014. 554-559.

[9] Liu B, Lane I. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv:1609.01454, 2016.

[10] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate ICLR. In:Proc. of the Int'l Conf. on Learning Representations. 2015.

[11] Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. In:Proc. of the Conf. on Empirical Methods in Natural Language Processing. 2015. 1412-1421.

[12] Goo CW, Gao G, Hsu YK, et al. Slot-gated modeling for joint slot filling and intent prediction. In:Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Vol.2. 2018. 753-757.

[13] Li C, Li L, Qi J. A self-attentive model with gate mechanism for spoken language understanding. In:Proc. of the Conf. on Empirical Methods in Natural Language Processing. 2018. 3824-3833.

[14] Chen M, Zeng J, Lou J. A self-attention joint model for spoken language understanding in situational dialog applications. arXiv:1905.11393, 2019.

[15] E HH, Niu P, Chen Z, et al. A novel bi-directional interrelated model for joint intent detection and slot filling. In:Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. 5467-5471.

[16] Gorin AL, Riccardi G, Wright JH. How may I help you? Speech Communication, 1997, 23(1-2):113-127.

[17] Price P. Evaluation of spoken language systems:The ATIS domain. In:Proc. of the Speech and Natural Language:Proc. of a Workshop Held at Hidden Valley. 1990. 91-95.

[18] De Mori R. Spoken language understanding:A survey. In:Proc. of the IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). IEEE, 2007. 365-376.

[19] Young S. Talking to machines (statistically speaking). In:Proc. of the 7th Int'l Conf. on Spoken Language Processing. 2002.

[20] Hahn S, Dinarelli M, Raymond C, et al. Comparing stochastic approaches to spoken language understanding in multiple languages. IEEE Trans. on Audio, Speech, and Language Processing, 2010, 19(6):1569-1583.

[21] Haffner P, Tur G, Wright JH. Optimizing SVMs for complex call classification. In:Proc. of the IEEE Int'l Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2003), Vol.1. IEEE, 2003. 632-635.

[22] Schapire RE, Singer Y. BoosTexter:A boosting-based system for text categorization. Machine Learning, 2000, 39(2-3):135-168.

[23] Fu B, Liu T. Implicit user consumption intent recognition in social media. Ruan Jian Xue Bao/Journal of Software, 2016, 27(11):2843-2854(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4870.htm[doi:10.13328/j.cnki.jos.004870]

[24] Deoras A, Sarikaya R. Deep belief network based semantic taggers for spoken language understanding. In:Proc. of the Interspeech. 2013. 2713-2717.

[25] Ravuri S, Stolcke A. Recurrent neural network and LSTM models for lexical utterance classification. In:Proc. of the 16th Annual Conf. of the Int'l Speech Communication Association. 2015. 135-139.

[26] Zhang ZC, Zhang ZW, Zhang ZM. User intent classification based on IndRNN-attention. Journal of Computer Research and Development, 2019, 56(7):1517-1524(in Chinese with English abstract).

[27] Zhou JZ, Zhu ZK, He ZQ, Chen WL, Zhang M. Hybrid neural network models for human-machine dialogue intention classification. Ruan Jian Xue Bao/Journal of Software, 2019, 30(11):3313-3325(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5862.htm[doi:10.13328/j.cnki.jos.005862]

[28] Raymond C, Riccardi G. Generative and discriminative algorithms for spoken language understanding. In:Proc. of the 8th Annual Conf. of the Int'l Speech Communication Association. 2007. 1605-1608.

[29] McCallum A, Freitag D, Pereira FCN. Maximum entropy Markov models for information extraction and segmentation. In:Proc. of the 17th Int'l Conf. on Machine Learning. 2000. 591-598.

[30] Yao K, Zweig G, Hwang MY, et al. Recurrent neural networks for language understanding. In:Proc. of the Interspeech. 2013. 2524-2528.

[31] Yao K, Peng B, Zweig G, et al. Recurrent conditional random field for language understanding. In:Proc. of the IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014. 4077-4081.

[32] Mesnil G, He X, Deng L, et al. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In:Proc. of the Interspeech. 2013. 3771-3775.

[33] Yao K, Peng B, Zhang Y, et al. Spoken language understanding using long short-term memory neural networks. In:Proc. of the IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014. 189-194.

[34] Simonnet E, Camelin N, Deléglise P, et al. Exploring the use of attention-based recurrent neural networks for spoken language understanding. In:Proc. of the Machine Learning for Spoken Language Understanding and Interaction NIPS 2015 Workshop (SLUNIPS 2015). Montreal, 2015. 642-650.

[35] Zhu S, Yu K. Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In:Proc. of the IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017. 5675-5679.

[36] Hakkani-Tür D, Tür G, Celikyilmaz A, et al. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In:Proc. of the Interspeech. 2016. 715-719.

[37] Qin L, Che W, Li Y, et al. A stack-propagation framework with token-level intent detection for spoken language understanding. In:Proc. of the Conf. on Empirical Methods in Natural Language Processing and the 9th Int'l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). 2019. 2078-2087.

[38] Wang Y, Shen Y, Jin H. A Bi-model based RNN semantic frame parsing model for intent detection and slot filling. In:Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Vol.2. 2018. 309-314.

[39] Zhang C, Li Y, Du N, et al. Joint slot filling and intent detection via capsule neural networks. In:Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. 5259-5267.

[40] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2014. 580-587.

[41] Girshick R. Fast R-CNN. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2015. 1440-1448.

[42] Ren S, He K, Girshick R, et al. Faster R-CNN:Towards real-time object detection with region proposal networks. In:Proc. of the Advances in Neural Information Processing Systems. 2015. 91-99.

[43] Lin TY, Goyal P, Girshick R, et al. Focal loss for dense object detection. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 2980-2988.

[44] Liu W, Anguelov D, Erhan D, et al. SSD:Single shot multibox detector. In:Proc. of the European Conf. on Computer Vision. Cham:Springer, 2016. 21-37.

[45] Redmon J, Farhadi A. Yolov3:An incremental improvement. arXiv:1804.02767, 2018.

[46] Li YG, Zhou XG, Sun Y, Zhang HG. Research and implementation of Chinese microblog sentiment classification. Ruan Jian Xue Bao/Journal of Software, 2017, 28(12):3183-3205(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5283. htm[doi:10.13328/j.cnki.jos.005283]

[47] Wu RS, Wang HL, Wang ZQ, Zhou GD. Short text summary generation with global self-matching mechanism. Ruan Jian Xue Bao/Journal of Software, 2019, 30(9):2705-2717(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5850.htm[doi:10.13328/j.cnki.jos.005850]

[48] Qiu M, Li FL, Wang S, et al. Alime chat:A sequence to sequence and rerank based chatbot engine. In:Proc. of the 55th Annual Meeting of the Association for Computational Linguistics, Vol.2. 2017. 498-503.

[49] Xu H, Zhang H, Han K, et al. Learning alignment for multimodal emotion recognition from speech. arXiv:1909.05645, 2019.

[50] Codevilla F, Miiller M, López A, et al. End-to-end driving via conditional imitation learning. In:Proc. of the IEEE Int'l Conf. on Robotics and Automation (ICRA). IEEE, 2018. 1-9.

[51] Badue C, Guidolini R, Carneiro RV, et al. Self-driving cars:A survey. arXiv:1901.04407, 2019.

[52] Leydesdorff L, Vaughan L. Co-occurrence matrices and their applications in information science:Extending ACA to the Web environment. Journal of the American Society for Information Science and, Technology, 2006, 57(12):1616-1628.

[53] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022.

[54] Deerwester S, Dumais ST, Furnas GW, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6):391-407.

[55] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3:1137-1155.

[56] Mikolov T, Kombrink S, Deoras A, et al. RNNLM-recurrent neural network language modeling toolkit. In:Proc. of the ASRU Workshop. 2011. 196-201.

[57] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.

[58] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In:Proc. of the Advances in Neural Information Processing Systems. 2013. 3111-3119.

[59] Mikolov T. word2vec. 2013. https://code.google.com/archive/p/word2vec.2013.

[60] Xu P, Sarikaya R. Convolutional neural network based triangular CRF for joint intent detection and slot filling. In:Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013. 78-83.

[61] Chen Q, Zhuo Z, Wang W. Bert for joint intent classification and slot filling. arXiv:1902.10909, 2019.

[62] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back propagating errors. Nature, 1986, 323:533-536.

[63] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. In:Proc. of the 11th Annual Conf. of the Int'l Speech Communication Association. 2010. 1045-1048.

[64] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.

[65] Cho K, van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation:Encoder-decoder approaches. In:Proc. of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8). 2014. 103-111.

[66] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 5998-6008.

[67] Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In:Proc. of the Advances in Neural Information Processing Systems. 2017. 3856-3866.

[68] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In:Proc. of the Advances in Neural Information Processing Systems. 2012. 1097-1105.

[69] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 1-9.

[70] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 770-778.

[71] Hemphill CT, Godfrey JJ, Doddington GR. The ATIS spoken language systems pilot corpus. In:Proc. of the Speech and Natural Language:Proc. of a Workshop Held at Hidden Valley. Pennsylvania, 1990. 96-101.

[72] Coucke A, Saade A, Ball A, et al. Snips voice platform:An embedded spoken language understanding system for private-by-design voice interfaces. arXiv:1805.10190, 2018.

[73] Lai S, Xu L, Liu K, et al. Recurrent convolutional neural networks for text classification. In:Proc. of the 29th AAAI Conf. on Artificial Intelligence. 2015. 2267-2273.

[74] Boden M. A guide to recurrent neural networks and backpropagation. The Dallas Project, 2002, 2:1-10.

[75] Luong MT, Le QV, Sutskever I, et al. Multi-task sequence to sequence learning. arXiv:1511.06114, 2015.

[76] Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to sequence-video to text. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2015. 4534-4542.

[77] Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 2005, 18(5-6):602-610.

[78] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In:Proc. of the Neural Information Processing Systems. Montreal, 2014. 3104-3112.

[79] Graves A, Jaitly N, Mohamed A. Hybrid speech recognition with deep bidirectional LSTM. In:Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013. 273-278.

[80] Devlin J, Chang MW, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding. In:Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Vol.1. 2019. 4171-4186.

[81] Wu Y, Schuster M, Chen Z, et al. Google's neural machine translation system:Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.

[82] Forney GD. The viterbi algorithm. Proc. of the IEEE, 1973, 61(3):268-278.

[83] Kim Y, Jernite Y, Sontag D, et al. Character-aware neural language models. In:Proc. of the 30th AAAI Conf. on Artificial Intelligence. 2016. 2741-2749.

[84] Zhang Y, Wallace B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv:1510.03820, 2015.

[85] Jeong M, Lee GG. Triangular-chain conditional random fields. IEEE Trans. on Audio, Speech, and Language Processing, 2008, 16(7):1287-1302.

[86] E HH, Zhang WJ, Xiao SQ, Cheng R, Hu YX, Zhou XS, Niu PQ. Survey of entity relationship extraction based on deep learning. Ruan Jian Xue Bao/Journal of Software, 2019, 30(6):1793-1818(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5817.htm[doi:10.13328/j.cnki.jos.005817]

[87] Sutton RS, Barto AG. Reinforcement Learning:An Introduction. MIT Press, 2018.

[88] Zhang T, Huang M, Zhao L. Learning structured representation for text classification via reinforcement learning. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence. 2018. 6053-6060.

[89] Zeng X, He S, Liu K, et al. Large scaled relation extraction with reinforcement learning. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence. 2018. 5658-5665.

[90] Qiu X, Sun T, Xu Y, et al. Pre-trained models for natural language processing:A survey. Science China Technological Sciences, 2020, 63:1872-1897.

附中文参考文献:

[2] 陈晨,朱晴晴, 严睿, 柳军飞. 基于深度学习的开放领域对话系统研究综述. 计算机学报, 2019, 42(7):1439-1466.

[4] 侯丽仙, 李艳玲, 李成城. 面向任务口语理解研究现状综述. 计算机工程与应用, 2019, 55(11):7-15.

[23] 付博, 刘挺.社会媒体中用户的隐式消费意图识别. 软件学报, 2016, 27(11):2843-2854. http://www.jos.org.cn/1000-9825/4870. htm[doi:10.13328/j.cnki.jos.004870]

[26] 张志昌, 张珍文, 张治满. 基于IndRNN-Attention的用户意图分类. 计算机研究与发展, 2019, 56(7):1517-1524.

[27] 周俊佐, 朱宗奎, 何正球, 陈文亮, 张民. 面向人机对话意图分类的混合神经网络模型. 软件学报, 2019, 30(11):3313-3325. http://www.jos.org.cn/1000-9825/5862.htm[doi:10.13328/j.cnki.jos.005862]

[46] 李勇敢, 周学广, 孙艳, 张焕国. 中文微博情感分析研究与实现. 软件学报, 2017, 28(12):3183-3205. http://www.jos.org.cn/1000-9825/5283.htm[doi:10.13328/j.cnki.jos.005283]

[47] 吴仁守, 王红玲, 王中卿, 周国栋. 全局自匹配机制的短文本摘要生成方法. 软件学报, 2019, 30(9):2705-2717. http://www.jos.org.cn/1000-9825/5850.htm[doi:10.13328/j.cnki.jos.005850]

[86] 鄂海红, 张文静, 肖思琪, 程瑞, 胡莺夕, 周筱松, 牛佩晴. 深度学习实体关系抽取研究综述. 软件学报, 2019, 30(6):1793-1818. http://www.jos.org.cn/1000-9825/5817.htm[doi:10.13328/j.cnki.jos.005817]

引用本文

魏鹏飞,曾碧,汪明慧,曾安.基于深度学习的口语理解联合建模算法综述.软件学报,2022,33(11):4192-4216

复制

文章指标

点击次数:1862
下载次数: 5648
HTML阅读次数: 5560
引用次数: 0

历史

收稿日期:2020-08-04
最后修改日期:2021-04-16
录用日期:
在线发布日期: 2021-08-02
出版日期: 2022-11-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码