基于音频-语言模型的端到端说话人日志系统
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP37

基金项目:

国家重点研发计划(2023YFF1204100)


End-to-end Speaker Diarization System Based on Audio-language Model
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    会议纪要、客服质检等应用对多说话人语音转写与归属判断的需求正日益增长. 随着近年来多模态大语言模型的迅速发展, 音频-语言模型因其能够同时理解音频信号与自然语言提示, 并在自回归解码框架中统一处理两种模态的能力, 天然契合这种“说话人日志”任务的需求, 为端到端多说话人音频转写提供了全新的思路. 提出一种基于音频-语言模型的端到端说话人日志系统, 通过两阶段训练策略实现语音识别能力与判断说话人归属能力的协同优化, 将音频-语言模型的能力泛化到具体的下游任务上. 训练的第1阶段采用监督微调(SFT), 在标准交叉熵损失中引入“说话人损失”, 以加权的方式强化对稀疏说话人标签token的学习信号; 第2阶段使用了基于组相对策略优化(GRPO)算法的强化学习策略, 以联合指标cpCERSA-CER设计奖励函数, 突破了监督学习的性能瓶颈. 在双说话人的场景下开展实验, 对比了热门开源工具3D-Speaker、Diar Sortformer和闭源的AssemblyAI、Microsoft Azure说话人日志API, 并通过消融实验证明了训练方法的合理性, 随后将实验拓宽至四说话人场景. 结果表明, 两阶段的训练方法在双说话人环境中显著提升了模型的语音识别能力与判断说话人归属的能力, 而在四说话人场景中, 常规的监督微调已取得较大收益. 进一步讨论了大模型资源消耗、输入时长限制、跨域适应等问题, 提出了引入流式音频编码器、课程学习、拒绝采样策略等未来优化方向. 研究表明音频-语言模型在多说话人日志任务中具备显著潜力, 但亦需在复杂声学场景下完成更多技术突破.

    Abstract:

    The demand for multi-speaker speech transcription and speaker attribution in applications such as meeting minutes and customer service quality inspection is increasing. Recent advances in multimodal large language models have given rise to audio-language models (ALMs) that can simultaneously interpret audio signals and natural-language prompts within a unified autoregressive decoding framework, making them a natural fit for the speaker diarization task and offering a fresh approach to end-to-end multi-speaker audio transcription. This study proposes an end-to-end speaker diarization system based on an ALM and achieves synergistic optimization of speech-recognition capability and speaker-attribution capability via a two-stage training strategy, thus generalizing the capability of ALMs to specific downstream tasks. In the first stage, supervised fine-tuning (SFT) introduces a “speaker loss” into the standard cross-entropy objective to weight and strengthen the learning signal for sparse speaker-label tokens. In the second stage, a reinforcement-learning scheme based on group relative policy optimization (GRPO) is employed, designing a reward function that jointly considers cpCER and SA-CER to break through the performance bottleneck of supervised learning. Experiments in a two-speaker setting compare with the open-source 3D-Speaker toolkit and the Diar Sortformer model, as well as the proprietary speaker diarization APIs from AssemblyAI and Microsoft Azure. Ablation studies are further conducted to validate the training methodology, and experiments are subsequently extended to a four-speaker scenario. Results demonstrate that the two-stage approach significantly improves both ASR and speaker-attribution performance in the two-speaker environment, whereas in the four-speaker setting, conventional SFT already yields substantial improvements. Challenges such as resource consumption, input-length limitations, and cross-domain adaptation are also discussed, and future enhancements are proposed, including streaming audio encoders, curriculum learning, and rejection-sampling strategies. This study shows that ALMs hold great promise for multi-speaker diarization tasks but require additional technical advances to handle more complex acoustic scenarios.

    参考文献
    相似文献
    引证文献
引用本文

韦舒羽,丘德来,刘升平,桑基韬.基于音频-语言模型的端到端说话人日志系统.软件学报,2026,37(5):1903-1918

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-25
  • 最后修改日期:2025-07-11
  • 录用日期:
  • 在线发布日期: 2025-09-23
  • 出版日期: 2026-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号