融合音乐知识结构化表征的高精度符号音乐理解
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP37

基金项目:

国家自然科学基金 (62201524)


High-precision Symbolic Music Understanding Incorporating Structured Representation of Music Knowledge
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    符号音乐理解(symbolic music understanding, SMU)是多媒体内容理解的重要任务之一, 旨在从符号化音乐表示中提取旋律、力度、作曲家风格、情感与流派等多维音乐属性. 现有方法在音乐序列依赖建模方面取得了显著进展, 但是仍然存在两方面关键问题: (1)表示单一化: 将复杂的音乐结构简化为线性符号序列, 忽略了音乐固有的多维层级信息; (2)乐理知识缺乏: 基于序列数据驱动的模型难以融入系统化乐理知识, 限制了对音乐深层语义的理解. 针对上述问题, 提出了一种融合音乐知识结构化表征的高精度符号音乐理解模型CNN-Midiformer. 该模型首先基于音乐理论构建音乐知识和音乐序列的结构化表征; 其次, 设计互补音乐特征提取模块, 利用卷积神经网络(convolutional neural network, CNN)提取音乐知识结构化表征的深层局部特征, 并通过Transformer编码器的自注意力机制捕获音乐序列的深层语义特征; 最后, 设计音乐知识自适应增强的特征融合模块, 利用高效的交叉注意力机制将CNN提取的深层音乐知识特征与Transformer编码器的深层语义特征进行动态融合, 实现对序列语境的感知与特征增强. 在6个公开符号音乐理解数据集Pop1K7、ASAP、POP909、Pianist8、EMOPIA和ADL上的对比实验表明, 所提出的模型CNN-Midiformer在旋律识别、力度预测、作曲家分类、情感分类和流派分类这5个符号音乐理解的基准下游任务上均优于最新方法, 相较于基线模型准确率提升0.21–7.14个百分点.

    Abstract:

    Symbolic music understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi-dimensional musical attributes such as melody, dynamics, compositional style, emotion, and genre from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: Current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi-dimensional hierarchical information; (2) Lack of music-theory integration: Purely data-driven sequence models struggle to incorporate structured music-theory knowledge, limiting deep semantic understanding of music. To address these issues, this study proposes CNN-Midiformer, a high-precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for music theory and musical sequences based on domain knowledge. Second, a complementary music-feature extraction module is devised to employ convolutional neural networks (CNN) for capturing deep local features from structured musical-knowledge representations, while a Transformer encoder with self-attention captures deep semantic features from musical sequences. Finally, a music-knowledge adaptive-enhancement feature-fusion module dynamically integrates the deep musical-knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross-attention mechanism, thus enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic-music datasets (Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL) demonstrate that CNN-Midiformer surpasses state-of-the-art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving a precision gain of 0.21–7.14 percentage points over baseline models.

    参考文献
    相似文献
    引证文献
引用本文

黄恒焱,邹逸,时乐轩,程皓楠,叶龙.融合音乐知识结构化表征的高精度符号音乐理解.软件学报,2026,37(5):1887-1902

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-26
  • 最后修改日期:2025-07-11
  • 录用日期:
  • 在线发布日期: 2025-09-23
  • 出版日期: 2026-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号