Abstract:Semi-Supervised clustering aims to partition the data points into different clusters based on the user-specified must-link and cannot-link constraints. The current semi-supervised clustering algorithms either modify the clustering methods or combine the metric learning approaches to adapt the clustering result as consistent with the pairwise constraints as possible. However, few of them try to explicitly compute the degrees of influence that each pairwise constraint exerts on the unconstrained data points. This paper proposes a semi-supervised clustering algorithm via a two-level random walk, which is composed of a lower-level random walk on vertices and a higher-level random walk on components. The lower-level random walk is responsible for computing the influence range of every vertex constrained by a pairwise constraint. This information is encapsulated in an intermediate structure called "component". The higher-level random walk further propagates the pairwise constraints on the components with adaptive strength, followed by the integration of all the constraint influence into a cluster indicating matrix. The experiments on UCI database and large real-world data sets demonstrate that, compared with other semi-supervised clustering algorithms, the proposed method not only produces more satisfactory clustering results but also exhibits good efficiency.