Kernel Subspace Clustering Algorithm for Categorical Data
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (U1805263, 61672157); Project of Science and Technology Bureau, Fujian Province (JK2017007); Program of Innovative Research Team of Fujian Normal University (IRTL1704)

  • Article
  • | |
  • Metrics
  • |
  • Reference [29]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Currently, the mainstream subspace clustering methods for categorical data are dependent on linear similarity measure and the relationship between attributes is overlooked. In this study, an approach is proposed for clustering categorical data with a novel kernel soft feature-selection scheme. First, categorical data is projected into the high-dimensional kernel space by introducing the kernel function and the similarity measure of categorical data in kernel subspace is given. Based on the measure, the kernel subspace clustering objective function is derived and an optimization method is proposed to solve the objective function. At last, kernel subspace clustering algorithm for categorical data is proposed, the algorithm considers the relationship between the attributes and each attribute assigned with weights measuring its degree of relevance to the clusters, enabling automatic feature selection during the clustering process. A cluster validity index is also defined to evaluate the categorical clusters. Experimental results carried out on some synthetic datasets and real-world datasets demonstrate that the proposed method effectively excavates the nonlinear relationship among attributes and improves the performance and efficiency of clustering.

    Reference
    [1] Han JW, Kamber M, Pei J, Worte; Fan M, Meng XF, Trans. Data Mining:Concepts and Techniques. 3rd ed., Beijing:China Machine Press, 2012(in Chinese).[doi:10.3969/j.issn.1674-6511.2008.03.043]
    [2] Chen LF, Wu T. Feature Reduction in Data Mining. Beijing:Science Press, 2016(in Chinese).
    [3] Cai XY, Dai GZ, Yang LB. Survey on spectral clustering algorithms. Computer Science, 2008,35(7):14-18(in Chinese with English abstract).[doi:10.3969/j.issn.1002-137X.2008.07.004]
    [4] Jain AK, Murty MN, Flynn PJ. Data clustering:A review. ACM Computing Surveys, 1999,31(3):264-323.
    [5] Perona P, Freeman W. A factorization approach to grouping. In:Proc. of the European Conf. on Computer Vision. 1998. 655-670.
    [6] Huang JZ, Ng MK, Rong H, et al. Automated variable weighting in k-means type clustering. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2005,27(5):657-668.[doi:10.1109/TPAMI.2005.95]
    [7] Chen LF, Guo GD, Jiang QS. Adaptive algorithm for soft subspace clustering. Ruan Jian Xue Bao/Journal of Software, 2010,21(10):2513-2523(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/3763.htm[doi:10.3724/SP.J.1001.2010.03763]
    [8] Ng MK, Li MJ, Huang JZ, et al. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2007,29(3):503-507.[doi:10.1109/TPAMI.2007.53]
    [9] Boriah S, Chandola V, Kumar V. Similarity measures for categorical data:A comparative evaluation. In:Proc. of the 2008 SIAM Int'l Conf. on Data Mining. 2008. 243-254.[doi:10.1137/1.9781611972788.22]
    [10] Knippenberg RW. Orthogonalization of categorical data:How to fix a measurement problem in statistical distance metrics. Ssrn Electronic Journal, 2013.[doi:10.2139/ssrn.2357607]
    [11] Kong R, Zhang GX, Shi ZS, et al. Kernel-based K-means clustering. Computer Engineering, 2004,30(11):12-13,80(in Chinese with English abstract).[doi:10.3969/j.issn.1000-3428.2004.11.005]
    [12] Chan E, Ching W, Ng M, et al. An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition, 2004,37(5):943-952.[doi:10.1016/j.patcog.2003.11.003]
    [13] Cao F, Liang J, Li D, et al. A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 2013, 108(5):23-30.[doi:10.1016/j.neucom.2012.11.009]
    [14] Chen L, Wang S, Wang K, et al. Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognition, 2016, 51(C):322-332.[doi:10.1016/j.patcog.2015.09.027]
    [15] Huang Z, Ng MK. A note on K-modes clustering. Journal of Classification, 2003,20(2):257-261.[doi:10.1007/s00357-003-0014-4]
    [16] Hartigan J A, Wong MA. A K-means clustering algorithm. Applied Statistics, 1979,28(1):100-108.[doi:10.2307/2346830]
    [17] Sun H, Wang S, Jiang Q. FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition, 2004, 37(10):2027-2037.[doi:10.1016/j.patcog.2004.03.012]
    [18] Burnham KP, Anderson DR. Model Selection and Multimodel Inference:A Practical Information-theoretic Approach. Springer-Verlag, 2002.[doi:10.1198/tech.2003.s146]
    [19] Pelleg D, Moore AW. X-means:Extending K-means with efficient estimation of the number of clusters. In:Proc. of the 17th Int'l Conf. on Machine Learning. 2000.[doi:10.1007/3-540-44491-2_3]
    [20] Bai L, Liang J, Dang C, et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognition, 2011,44(12):2843-2861.[doi:10.1016/j.patcog.2011.04.024]
    [21] Chen L, Jiang Q, Wang S. A probability model for projective clustering on high dimensional data. In:Proc. of the IEEE Int'l Conf. on Data Mining. 2008. 755-760.[doi:10.1109/ICDM.2008.15]
    [22] Chen LF. A probabilistic framework for optimizing projected clusters with categorical attributes. Science China Information Sciences, 2015,58(7):1-15.[doi:10.1007/s11432-014-5267-5]
    [23] Bezdek JC. Pattern Analysis in Handbook of Fuzzy Computation. IOP Publishing Ltd., 1998.[doi:10.1887/0750304278]
    附中文参考文献:
    [1] 韩家炜,Kamber M,裴健,著;范明,孟小峰,译.数据挖掘:概念与技术.第3版,北京:机械工业出版社,2012.[doi:10.3969/j.issn.1674-6511.2008.03.043]
    [2] 陈黎飞,吴涛.数据挖掘中的特征约简.北京:科学出版社,2016.
    [3] 蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述.计算机科学,2008,35(7):14-18.[doi:10.3969/j.issn.1002-137X.2008.07.004]
    [7] 陈黎飞,郭躬德,姜青山.自适应的软子空间聚类算法.软件学报,2010,21(10):2513-2523. http://www.jos.org.cn/1000-9825/3763.htm[doi:10.3724/SP.J.1001.2010.03763]
    [11] 孔锐,张国宣,施泽生,等.基于核的K-均值聚类.计算机工程,2004,30(11):12-13,80.[doi:10.3969/j.issn.1000-3428.2004.11.005]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

徐鲲鹏,陈黎飞,孙浩军,王备战.类属型数据核子空间聚类算法.软件学报,2020,31(11):3492-3505

Copy
Share
Article Metrics
  • Abstract:1162
  • PDF: 4187
  • HTML: 1811
  • Cited by: 0
History
  • Received:January 10,2018
  • Revised:May 16,2018
  • Online: November 07,2020
  • Published: November 06,2020
You are the first2044853Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063