基于谱聚类的无监督特征选择算法
作者:
作者单位:

作者简介:

谢娟英(1971-),女,陕西西安人,博士,教授,博士生导师,CCF高级会员,主要研究领域为机器学习,数据挖掘,生物医学数据分析;王明钊(1990-),男,博士生,主要研究领域为数据挖掘,生物信息学;丁丽娟(1994-),女,硕士生,主要研究领域为机器学习,数据挖掘.

通讯作者:

xiejuany@snnu.edu.cn

中图分类号:

TP181

基金项目:

国家自然科学基金(61673251);陕西省科技攻关重点项目(2018ZDXMSF-079);国家重点研发计划(2016YFC0901900);科技成果转化培育项目(GK201806013);中央高校基本科研业务费专项资金(GK201701006);研究生培养创新基金(2015CXS028,2016CSY009,2018TS078)


Spectral Clustering Based Unsupervised Feature Selection Algorithms
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61673251); Key Projects of Science and Technology Research in Shaanxi Province (2018ZDXMSF-079); National Key Research and Development Program of China (2016YFC0901900); Scientific and Technological Achievements Transformation and Cultivation Funds of Shaanxi Normal University (GK201806013); Fundamental Research Funds for the Central Universities (GK201701006); Innovation Funds of Graduate Programs at Shaanxi Normal University (2015CXS028, 2016CSY009, 2018TS078)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    基因表达数据具有高维小样本特点,包含了大量与疾病无关的基因,对该类数据进行分析的首要步骤是特征选择.常见的特征选择方法需要有类标的数据,但样本类标获取往往比较困难.针对基因表达数据的特征选择问题,提出基于谱聚类的无监督特征选择思想FSSC(feature selection by spectral clustering).FSSC对所有特征进行谱聚类,将相似性较高的特征聚成一类,定义特征的区分度与特征独立性,以二者之积度量特征重要性,从各特征簇选取代表性特征,构造特征子集.根据使用的不同谱聚类算法,得到FSSC-SD(FSSC based on standard deviation)、FSSC-MD(FSSC based on mean distance)和FSSC-ST(FSSC based on self-tuning)这3种无监督特征选择算法.以SVMs(support vector machines)和KNN(K-nearest neighbours)为分类器,在10个基因表达数据集上进行实验测试.结果表明,FSSC-SD、FSSC-MD和FSSC-ST算法均能选择到具有强分类能力的特征子集.

    Abstract:

    Gene expression data usually comprise small number of samples with tens of thousands of genes. There are a large number of genes unrelated to diseases in this kind of data. The primary task is to detect those key essential genes when analyzing this kind of data. The common feature selection algorithms depend on labels of data, but it is very difficult to get labels for data. To overcome the challenges, especially for gene expression data, the unsupervised feature selection idea is proposed, named as FSSC (feature selection by spectral clustering). FSSC groups all of features into clusters by a spectral clustering algorithm, so that similar features are in same clusters. The feature discernibility and independence are defined, and the feature importance is defined as the product of its discernibility and independence. The representative feature is selected from each cluster to construct the feature subset. According to the spectral clustering algorithms used in FSSC, three kinds of unsupervised feature selection algorithms named as FSSC-SD (FSSC based on standard deviation), FSSC-MD (FSSC based on mean distance) and FSSC-ST (FSSC based on self-tuning) are developed. The SVM (support vector machines) and KNN (K-nearest neighbors) classifiers are adopted to test the performance of the selected feature subsets in experiments. Experimental results on 10 gene expression datasets show that FSSC-SD, FSSC-MD, and FSSC-ST algorithms can select powerful features to classify samples.

    参考文献
    相似文献
    引证文献
引用本文

谢娟英,丁丽娟,王明钊.基于谱聚类的无监督特征选择算法.软件学报,2020,31(4):1009-1024

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-05-31
  • 最后修改日期:2019-07-29
  • 录用日期:
  • 在线发布日期: 2020-01-14
  • 出版日期: 2020-04-06
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号