基于统计相关性与K-means的区分基因子集选择算法
作者:
基金项目:

国家自然科学基金(31372250); 中央高校基本科研业务费专项基金(GK201102007); 陕西省科技攻关项目(2013K12-03-24)


Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [35]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择.算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集;然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练SVM分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集.将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法Random, Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明:所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能.

    Abstract:

    To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.

    参考文献
    [1] Li ST, Wu XX, Hu XY. Gene selection using genetic algorithm and support vectors machines. Soft Computing, 2008,12(7): 693~698. [doi: 10.1007/s00500-007-0251-2]
    [2] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003,3: 1157~1182.
    [3] 张军英,Wang YJ, Khan J, Clarke R.基于类别空间的基因选择.中国科学(E辑), 2003,33(12):1125~1137.
    [4] Li YX, Li JG, Ruan XG. Study of inofrmative gene selection for tissue classification based on tumor gene expression profiles. Chinese Journal of Computers, 2006,29(2):324~330 (in Chinese with English abstract).
    [5] Xie JY, Xie WX. Several feature selection algorithms based on the discernibility of a feature subset and support vector machines. Chinese Journal of Computers, 2014,37(8):1704~1718 (in Chinese with English abstract).
    [6] Ding C, Peng HC. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 2005,3(2):185~205. [doi: 10.1142/S0219720005001004]
    [7] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002,46(1-3):389~422. [doi: 10.1023/A:1012487302797]
    [8] Niijima S, Kuhara S. Recursive gene selection based on maximum margin criterion: A comparison with SVM-RFE. BMC Bioinformatics, 2006,7(1):543. [doi: 10.1186/1471-2105-7-543]
    [9] Wang YH, Makedon FS, Ford JC, Pearlman J. HykGene: A hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics, 2005,21(8):1530~1537. [doi: 10.1093/bioinformatics/bti192]
    [10] Han JW, Kamber M. Data Mining: Concepts and Techniques. 2nd ed., San Francisco: Morgan Kaufmann Publishers, 2006. 383~386.
    [11] Deng L, Pei J, Ma JW, Lee DL. A rank sum test method for informative gene discovery. In: Elder J, ed. Proc. of the 10th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. Seattle: ACM Press, 2004. 410~419. [doi: 10.1145/1014052. 1014099]
    [12] Weston J, Elisseeff A, Schölkopf B, Tipping M. Use of the zero norm with linear models and kernel methods. The Journal of Machine Learning Research, 2003,3:1439~1461.
    [13] Song QB, Ni JJ, Wang GT. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. on Knowledge and Data Engineering, 2013,25(1):1~14. [doi: 10.1109/TKDE.2011.181]
    [14] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 5:1205~1224.
    [15] Loscalzo S, Yu L, Ding C. Consensus group stable feature selection. In: Elder J, ed. Proc. of the 15th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. Paris: ACM Press, 2009. 567~576. [doi: 10.1145/1557019.1557084]
    [16] MacQueen JB. Some methods for classification and analysis of multivariate observations. In: LeCam LM, Neyman J, eds. Proc. of the 5th Berkeley Symp. on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967. 281~297.
    [17] Huang ZX. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998,2(3):283~304. [doi: 10.1023/A:1009769707641]
    [18] Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Haussler D, ed. Proc. of the 5th Annual Workshop on Computational Learning Theory. New York: ACM Press, 1992. 144~152. [doi: 10.1145/130385.130401]
    [19] Vapnik VN. Statistical Learning Theory. New York : Jobn Wiley & Sons, 1998.
    [20] Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. United Kingdom: Cambridge University Press, 2000.
    [21] Huang JZ, Ng MK, Rong HQ, Li ZC. Automated variable weighting in k-means type clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2005,27(5):657~668. [doi: 10.1109/TPAMI.2005.95]
    [22] Witten IH, Frank E, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed., United States: Morgan Kaufmann Publishers, 2010. 147~187.
    [23] Tang YC, Zhang YQ, Huang Z. Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Trans. on Computational Biology and Bioinformatics, 2007,4(3):365~381. [doi: 10.1109/TCBB.2007.70224]
    [24] Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. 3rd ed., Cambridge: MIT Press, 2009. 1~14.
    [25] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999, 286(5439):531~537. [doi: 10.1126/science.286.5439.531]
    [26] Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. of the National Academy of Sciences, 1999, 96(12):6745~6750. [doi: 10.1073/pnas.96.12.6745]
    [27] Notterman DA, Alon U, Sierk AJ, Levine AJ. Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Research, 2001,61(7):3124~3130.
    [28] Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology (TIST), 2011,2(3):27.
    [29] Liu J, Iba H, Ishizuka M. Selecting informative genes with parallel genetic algorithms in tissue classification. In: Proc. of the Genome Informatics Series. 2001. 14~23.
    [30] Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics, 2003,19(9):1061~1069. [doi: 10.1093/bioinformatics/btf867]
    [31] Krishnapuram B, Carin L, Hartemink A. Gene expression analysis: Joint feature selection and classifier design. In: Proc. of the Kernel Methods in Computational Biology. 2004. 299~317.
    [32] Chao S, Lihui C. Feature dimension reduction for microarray data analysis using locally linear embedding. In: Bajic VB, ed. Proc. of the 3rd Asia-Pacific Bioinformatics Conf. 2005. 211~218.
    [33] Model F, Adorjan P, Olek A, Piepenbrock C. Feature selection for DNA methylation based cancer classification. Bioinformatics, 2001,17(Suppl. 1):S157~S164. [doi: 10.1093/bioinformatics/17.1.1]
    [34] Yu L, Han Y, Berens ME. Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans. on Computational Biology and Bioinformatics (TCBB), 2012,9(1):262~272. [doi: 10.1109/TCBB.2011.47]
    [35] Han Y, Yu L. A variance reduction framework for stable feature selection. In: Kotagiri R, ed. Proc. of the IEEE Int’l Conf. on Data Mining (ICDM 2010). Sydney: IEEE,2010. 206~215. [doi: 10.1109/ICDM.2010.144]
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

谢娟英,高红超.基于统计相关性与K-means的区分基因子集选择算法.软件学报,2014,25(9):2050-2075

复制
分享
文章指标
  • 点击次数:5495
  • 下载次数: 8435
  • HTML阅读次数: 2420
  • 引用次数: 0
历史
  • 收稿日期:2014-04-08
  • 最后修改日期:2014-05-14
  • 在线发布日期: 2014-09-09
文章二维码
您是第19757188位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号