Kernel Subspace Clustering Algorithm for Categorical Data

doi:10.13328/j.cnki.jos.005819

微信服务号

微信订阅号

2025-5-16- 7

Home > Archive>Volume 31, Issue 11, 2020 >3492-3505. DOI:10.13328/j.cnki.jos.005819

PDF HTML XML Export Cite reminder

Kernel Subspace Clustering Algorithm for Categorical Data
DOI:
                        10.13328/j.cnki.jos.005819
                    
Author:
                        XU Kun-PengXU Kun-Peng
College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China;Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring(Fujian Normal University), Fuzhou 350117, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
CHEN Li-FeiCHEN Li-Fei
College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China;Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring(Fujian Normal University), Fuzhou 350117, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
SUN Hao-JunSUN Hao-Jun
College of Engineering, Shantou University, Shantou 515063, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WANG Bei-ZhanWANG Bei-Zhan
College of Software, Xiamen University, Xiamen 361005, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:National Natural Science Foundation of China (U1805263, 61672157); Project of Science and Technology Bureau, Fujian Province (JK2017007); Program of Innovative Research Team of Fujian Normal University (IRTL1704)

Article

Figures

Metrics

Reference [29]

Related [20]

Cited by

Materials

Comments

Abstract:

Currently, the mainstream subspace clustering methods for categorical data are dependent on linear similarity measure and the relationship between attributes is overlooked. In this study, an approach is proposed for clustering categorical data with a novel kernel soft feature-selection scheme. First, categorical data is projected into the high-dimensional kernel space by introducing the kernel function and the similarity measure of categorical data in kernel subspace is given. Based on the measure, the kernel subspace clustering objective function is derived and an optimization method is proposed to solve the objective function. At last, kernel subspace clustering algorithm for categorical data is proposed, the algorithm considers the relationship between the attributes and each attribute assigned with weights measuring its degree of relevance to the clusters, enabling automatic feature selection during the clustering process. A cluster validity index is also defined to evaluate the categorical clusters. Experimental results carried out on some synthetic datasets and real-world datasets demonstrate that the proposed method effectively excavates the nonlinear relationship among attributes and improves the performance and efficiency of clustering.

Key words:clustering;categorical data;kernel method;nonlinear measure;subspace

Reference

[1] Han JW, Kamber M, Pei J, Worte; Fan M, Meng XF, Trans. Data Mining:Concepts and Techniques. 3rd ed., Beijing:China Machine Press, 2012(in Chinese).[doi:10.3969/j.issn.1674-6511.2008.03.043]

[2] Chen LF, Wu T. Feature Reduction in Data Mining. Beijing:Science Press, 2016(in Chinese).

[3] Cai XY, Dai GZ, Yang LB. Survey on spectral clustering algorithms. Computer Science, 2008,35(7):14-18(in Chinese with English abstract).[doi:10.3969/j.issn.1002-137X.2008.07.004]

[4] Jain AK, Murty MN, Flynn PJ. Data clustering:A review. ACM Computing Surveys, 1999,31(3):264-323.

[5] Perona P, Freeman W. A factorization approach to grouping. In:Proc. of the European Conf. on Computer Vision. 1998. 655-670.

[6] Huang JZ, Ng MK, Rong H, et al. Automated variable weighting in k-means type clustering. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2005,27(5):657-668.[doi:10.1109/TPAMI.2005.95]

[7] Chen LF, Guo GD, Jiang QS. Adaptive algorithm for soft subspace clustering. Ruan Jian Xue Bao/Journal of Software, 2010,21(10):2513-2523(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/3763.htm[doi:10.3724/SP.J.1001.2010.03763]

[8] Ng MK, Li MJ, Huang JZ, et al. On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. on Pattern Analysis & Machine Intelligence, 2007,29(3):503-507.[doi:10.1109/TPAMI.2007.53]

[9] Boriah S, Chandola V, Kumar V. Similarity measures for categorical data:A comparative evaluation. In:Proc. of the 2008 SIAM Int'l Conf. on Data Mining. 2008. 243-254.[doi:10.1137/1.9781611972788.22]

[10] Knippenberg RW. Orthogonalization of categorical data:How to fix a measurement problem in statistical distance metrics. Ssrn Electronic Journal, 2013.[doi:10.2139/ssrn.2357607]

[11] Kong R, Zhang GX, Shi ZS, et al. Kernel-based K-means clustering. Computer Engineering, 2004,30(11):12-13,80(in Chinese with English abstract).[doi:10.3969/j.issn.1000-3428.2004.11.005]

[12] Chan E, Ching W, Ng M, et al. An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition, 2004,37(5):943-952.[doi:10.1016/j.patcog.2003.11.003]

[13] Cao F, Liang J, Li D, et al. A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 2013, 108(5):23-30.[doi:10.1016/j.neucom.2012.11.009]

[14] Chen L, Wang S, Wang K, et al. Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognition, 2016, 51(C):322-332.[doi:10.1016/j.patcog.2015.09.027]

[15] Huang Z, Ng MK. A note on K-modes clustering. Journal of Classification, 2003,20(2):257-261.[doi:10.1007/s00357-003-0014-4]

[16] Hartigan J A, Wong MA. A K-means clustering algorithm. Applied Statistics, 1979,28(1):100-108.[doi:10.2307/2346830]

[17] Sun H, Wang S, Jiang Q. FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition, 2004, 37(10):2027-2037.[doi:10.1016/j.patcog.2004.03.012]

[18] Burnham KP, Anderson DR. Model Selection and Multimodel Inference:A Practical Information-theoretic Approach. Springer-Verlag, 2002.[doi:10.1198/tech.2003.s146]

[19] Pelleg D, Moore AW. X-means:Extending K-means with efficient estimation of the number of clusters. In:Proc. of the 17th Int'l Conf. on Machine Learning. 2000.[doi:10.1007/3-540-44491-2_3]

[20] Bai L, Liang J, Dang C, et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognition, 2011,44(12):2843-2861.[doi:10.1016/j.patcog.2011.04.024]

[21] Chen L, Jiang Q, Wang S. A probability model for projective clustering on high dimensional data. In:Proc. of the IEEE Int'l Conf. on Data Mining. 2008. 755-760.[doi:10.1109/ICDM.2008.15]

[22] Chen LF. A probabilistic framework for optimizing projected clusters with categorical attributes. Science China Information Sciences, 2015,58(7):1-15.[doi:10.1007/s11432-014-5267-5]

[23] Bezdek JC. Pattern Analysis in Handbook of Fuzzy Computation. IOP Publishing Ltd., 1998.[doi:10.1887/0750304278]

附中文参考文献:

[1] 韩家炜,Kamber M,裴健,著;范明,孟小峰,译.数据挖掘:概念与技术.第3版,北京:机械工业出版社,2012.[doi:10.3969/j.issn.1674-6511.2008.03.043]

[2] 陈黎飞,吴涛.数据挖掘中的特征约简.北京:科学出版社,2016.

[3] 蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述.计算机科学,2008,35(7):14-18.[doi:10.3969/j.issn.1002-137X.2008.07.004]

[7] 陈黎飞,郭躬德,姜青山.自适应的软子空间聚类算法.软件学报,2010,21(10):2513-2523. http://www.jos.org.cn/1000-9825/3763.htm[doi:10.3724/SP.J.1001.2010.03763]

[11] 孔锐,张国宣,施泽生,等.基于核的K-均值聚类.计算机工程,2004,30(11):12-13,80.[doi:10.3969/j.issn.1000-3428.2004.11.005]

Get Citation

徐鲲鹏,陈黎飞,孙浩军,王备战.类属型数据核子空间聚类算法.软件学报,2020,31(11):3492-3505

Copy

Article Metrics

Abstract:1162
PDF: 4187
HTML: 1811
Cited by: 0

History

Received:January 10,2018
Revised:May 16,2018
Adopted:
Online: November 07,2020
Published: November 06,2020

You are the first2044853Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History