基于知识蒸馏和生成对抗网络的远场语音识别
DOI:
作者:
作者单位:

作者简介:

邬龙(1991-),男,河南信阳人,学士,主要研究领域为远场语音识别;黎塔(1982-),男,博士,研究员,主要研究领域为语音识别,语音信号处理,人机交互,海云计算;王丽(1985-),女,副研究员,主要研究领域为语音识别声学建模;颜永红(1967-),男,博士,研究员,博士生导师,CCF专业会员,主要研究领域为语音信号处理,听感知,人机交互,海云计算.

通讯作者:

黎塔,E-mail:lita@hccl.ioa.ac.cn

中图分类号:

基金项目:

国家自然科学基金(11590774,11590770);新疆维吾尔自治区重大科技专项(2016A03007-1);中国科学院声学研究所青年英才计划(QNYC201602)


Distant Speech Recognition Based on Knowledge Distillation and Generative Adversarial Network
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (11590774, 11590770); Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1); IACAS Young Elite Researcher Project (QNYC201602)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为了进一步利用近场语音数据来提高远场语音识别的性能,提出一种基于知识蒸馏和生成对抗网络相结合的远场语音识别算法.该方法引入多任务学习框架,在进行声学建模的同时对远场语音特征进行增强.为了提高声学建模能力,使用近场语音的声学模型(老师模型)来指导远场语音的声学模型(学生模型)进行训练.通过最小化相对熵使得学生模型的后验概率分布逼近老师模型.为了提升特征增强的效果,加入鉴别网络来进行对抗训练,从而使得最终增强后的特征分布更逼近近场特征.AMI数据集上的实验结果表明,该算法的平均词错误率(WER)与基线相比在单通道的情况下,在没有说话人交叠和有说话人交叠时分别相对下降5.6%和4.7%.在多通道的情况下,在没有说话人交叠和有说话人交叠时分别相对下降6.2%和4.1%.TIMIT数据集上的实验结果表明,该算法获得了相对7.2%的平均词错误率下降.为了更好地展示生成对抗网络对语音增强的作用,对增强后的特征进行了可视化分析,进一步验证了该方法的有效性.

    Abstract:

    In order to further utilize near-field speech data to improve the performance of far-field speech recognition, this paper proposes an approach to integrate knowledge distillation with the generative adversarial network. In this work, a multi-task learning structure is firstly proposed to jointly train the acoustic model with feature mapping. To enhance the acoustic modeling, the acoustic model trained with far-field data (student model) is guided by an acoustic model trained with near-field data (teacher model). Such training process makes the student model mimics the behavior of the teacher model by minimizing the Kullback-Leibler Divergence. To improve the speech enhancement, an additional discriminator network is introduced to distinguish the enhanced features from the real clean ones. The distribution of the enhanced features is further pushed towards that of the clean features through this adversarial multi-task training. Evaluated on AMI single distant microphone data, the method achieves 5.6% relative non-overlapped word error rate (WER) and 4.7% relative overlapped WER decrease over the baseline model. Evaluated on AMI multi-channel distant microphone data, the method achieves 6.2% relative non-overlapped WER and 4.1% relative overlapped WER decrease over the baseline model. Evaluated on the TIMIT data, the method can reach 7.2% WER reduction. To better demonstrate the effects of generative adversarial network on speech enhancement, the enhanced features is visualized and the effectiveness of this method is verified.

    参考文献
    相似文献
    引证文献
引用本文

邬龙,黎塔,王丽,颜永红.基于知识蒸馏和生成对抗网络的远场语音识别.软件学报,2019,30(S2):25-34

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-07-15
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2020-01-02
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号