[关键词]
[摘要]
针对基于信息增益与皮尔森相关系数的特征选择算法FSIP (feature selection based on information gain and Pearson correlation coefficient)存在的特征子集选取需要人工参与的问题,提出基于可辨识矩阵的完全自适应2D特征选择算法DFSIP (discernibility based FSIP).DFSIP算法完全自适应地发现特征子集,每次选择当前特征中最重要的一个特征,并以此特征约简可辨识矩阵,剔除冗余特征,最终自适应地获得最优特征子集.依据最优特征子集构建K-ELM分类器来评价最优特征子集的类别辨识能力.在基因数据集的实验测试以及与FSIP,mRMR,LLE Score,DRJMIM,AVC,AMID算法的实验比较和统计重要性检测表明:DFSIP算法能够自动选择出辨识能力更强的特征子集,基于此特征子集的分类器具有很好的分类性能.
[Key word]
[Abstract]
To overcome the limitations of the FSIP (feature selection based on information gain and Pearson correlation coefficient) feature selection algorithm that need human to determine the borderline to detect the feature subsets, the totally adaptive 2D feature selection algorithm is proposed in this study based on discernibility matrix. It is referred to as DFSIP (discernibility based FSIP). DFSIP introduces discernibility matrix into the feature selection process of FSIP. It first initializes the candidate feature set comprising all features and constructs the initial discernibility matrix, then it detects the most significant feature from the current candidate feature set, so as to add it to feature subset and use it to reduce the discernibility matrix. After that the candidate feature set is updated using the union of the cells of the reduced discernibility matrix, and the most significant feature is detected from the current candidate feature set again, so as to put it into the feature subset and use it to reduce the discernibility matrix, and the candidate feature set is updated again. This process repeats till there is not any feature left in the candidate feature set. The power of DFSIP is tested on very famous gene expression datasets, and its performance is compared with that of the popular feature selection algorithms including FSIP, mRMR, LLE Score, DRJMIM, AVC, and AMID by comparing the performance of the K-ELM classifier built using the feature subset detected by these feature selection algorithms. In addition, the significant test is done to verify whether or not there is the significant difference between DFSIP and FSIP as well as other compared feature selection algorithms. The experimental results demonstrate that DFSIP is superior to the compared ones, especially it has the significant difference to LLE Score, DRJMIM, and AMID feature selection algorithms. Although there is not significant difference between DFSIP and FSIP, it defeats FSIP in performance. It can be concluded that DFSIP can totally adaptively detect the feature subset with sound classification capability.
[中图分类号]
[基金项目]
国家自然科学基金(62076159,61673251,12031010);国家重点研发计划(2016YFC0901900);中央高校基本科研业务费专项资金(GK202105003);研究生培养创新基金(2016CSY009,2018TS078)