[关键词]
[摘要]
在机器学习和模式识别任务中,选择一种合适的距离度量方法是至关重要的.度量学习主要利用判别性信息学习一个马氏距离或相似性度量.然而,大多数现有的度量学习方法都是针对数值型数据的,对于一些有结构的数据(比如符号型数据),用传统的距离度量来度量两个对象之间的相似性是不合理的;其次,大多数度量学习方法会受到维度的困扰,高维度使得训练时间长,模型的可扩展性差.提出了一种基于几何平均的混杂数据度量学习方法.采用不同的核函数将数值型数据和符号型数据分别映射到可再生核希尔伯特空间,从而避免了特征的高维度带来的负面影响.同时,提出了一个基于几何平均的多核度量学习模型,将混杂数据的度量学习问题转化为求黎曼流形上两个点的中心点问题.在UCI数据集上的实验结果表明,针对混杂数据的多核度量学习方法与现有的度量学习方法相比,在准确性方面展现出更优异的性能.
[Key word]
[Abstract]
How to choose a proper distance metric is vital to many machine learning and pattern recognition tasks. Metric learning mainly uses discriminant information to learn a Mahalanobis distance or similarity metric. However, most existing metric learning methods are for numerical data, and it is unreasonable to calculate the similarity between two heterogeneous objects (e.g., categorical data) using traditional distance metrics. Besides, they suffer from curse of dimensionality, resulting in poor efficiency and scalability when the feature dimension is very high. In this paper, a geometric mean metric learning method is proposed for heterogeneous data. The numerical data and categorical data are mapped to a reproducing kernel Hilbert space by using different kernel functions, thus avoiding the negative influence of the high dimensionality of the feature. At the same time, a multiple kernel metric learning model based on geometric mean is introduced to transform the metric learning problem of heterogeneous data into solving the midpoint between two points on the Riemannian manifold. Experiments on benchmark UCI datasets show that the presented method shows promising performances in terms of accuracy in comparison with the state-of-the-art metric learning methods.
[中图分类号]
[基金项目]
国家自然科学基金(61502332,61732011)