融合音乐知识结构化表征的高精度符号音乐理解

doi:10.13328/j.cnki.jos.007542

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>2026年第37卷第5期 >1887-1902. DOI:10.13328/j.cnki.jos.007542

PDF HTML阅读 XML下载导出引用引用提醒

融合音乐知识结构化表征的高精度符号音乐理解
DOI:
                        10.13328/j.cnki.jos.007542
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP37
基金项目:国家自然科学基金 (62201524)

High-precision Symbolic Music Understanding Incorporating Structured Representation of Music Knowledge

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

符号音乐理解(symbolic music understanding, SMU)是多媒体内容理解的重要任务之一, 旨在从符号化音乐表示中提取旋律、力度、作曲家风格、情感与流派等多维音乐属性. 现有方法在音乐序列依赖建模方面取得了显著进展, 但是仍然存在两方面关键问题: (1)表示单一化: 将复杂的音乐结构简化为线性符号序列, 忽略了音乐固有的多维层级信息; (2)乐理知识缺乏: 基于序列数据驱动的模型难以融入系统化乐理知识, 限制了对音乐深层语义的理解. 针对上述问题, 提出了一种融合音乐知识结构化表征的高精度符号音乐理解模型CNN-Midiformer. 该模型首先基于音乐理论构建音乐知识和音乐序列的结构化表征; 其次, 设计互补音乐特征提取模块, 利用卷积神经网络(convolutional neural network, CNN)提取音乐知识结构化表征的深层局部特征, 并通过Transformer编码器的自注意力机制捕获音乐序列的深层语义特征; 最后, 设计音乐知识自适应增强的特征融合模块, 利用高效的交叉注意力机制将CNN提取的深层音乐知识特征与Transformer编码器的深层语义特征进行动态融合, 实现对序列语境的感知与特征增强. 在6个公开符号音乐理解数据集Pop1K7、ASAP、POP909、Pianist8、EMOPIA和ADL上的对比实验表明, 所提出的模型CNN-Midiformer在旋律识别、力度预测、作曲家分类、情感分类和流派分类这5个符号音乐理解的基准下游任务上均优于最新方法, 相较于基线模型准确率提升0.21–7.14个百分点.

Abstract:

Symbolic music understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi-dimensional musical attributes such as melody, dynamics, compositional style, emotion, and genre from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: Current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi-dimensional hierarchical information; (2) Lack of music-theory integration: Purely data-driven sequence models struggle to incorporate structured music-theory knowledge, limiting deep semantic understanding of music. To address these issues, this study proposes CNN-Midiformer, a high-precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for music theory and musical sequences based on domain knowledge. Second, a complementary music-feature extraction module is devised to employ convolutional neural networks (CNN) for capturing deep local features from structured musical-knowledge representations, while a Transformer encoder with self-attention captures deep semantic features from musical sequences. Finally, a music-knowledge adaptive-enhancement feature-fusion module dynamically integrates the deep musical-knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross-attention mechanism, thus enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic-music datasets (Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL) demonstrate that CNN-Midiformer surpasses state-of-the-art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving a precision gain of 0.21–7.14 percentage points over baseline models.

参考文献

相似文献

引证文献

引用本文

黄恒焱,邹逸,时乐轩,程皓楠,叶龙.融合音乐知识结构化表征的高精度符号音乐理解.软件学报,2026,37(5):1887-1902

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-26
最后修改日期:2025-07-11
录用日期:
在线发布日期: 2025-09-23
出版日期: 2026-05-06

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码