Hybrid Feature Selection Algorithm Combining Information Gain Ratio and Genetic Algorithm
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [44]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    In recent years, the application of information technology and electronic medical records and medical records in medical institutions has become more and more widespread, which has resulted in a large amount of medical data in hospital databases. Decision tree is widely used in medical data analysis because of its high classification precision, fast calculation speed, and simple and easily understood classification rules. However, due to the inherent high dimensional feature space and high feature redundancy of medical data, the classification precision of traditional decision trees is low. Based on this, this paper proposes a hybrid feature selection algorithm (GRRGA) that combines information gain ratio ranking grouping and group evolution genetic algorithm. Firstly, the information gain ratio based filtering algorithm is used to sort the original feature set; then, the ranked features are grouped according to the density principle of equal division; finally, a group evolution genetic algorithm is used to perform a search on the ranked feature groups. There are two kinds of evolution methods: in-population and out-population, which use two different fitness functions to control the evolution process in group evolution genetic algorithm. The experimental results show that the average precision index of the GRRGA algorithm on the six UCI datasets is 87.13%, which is significantly better than the traditional feature selection algorithm. In addition, compared with the other two classification algorithms, the feature selection performance of the GRRGA algorithm proposed in this study is optimal. More importantly, the precision index of the bagging method on the arrhythmia and cancer medical datasets is 84.7% and 78.7% respectively, which fully proves the practical application significance of the proposed algorithm.

    Reference
    [1] Koleck TA, Dreisbach C, Bourne PE, et al. Natural language processing of symptoms documented in free-text narratives of electronic health records:A systematic review. Journal of the American Medical Informatics Association, 2019, 26(4):364-379.
    [2] Ghazavi SN, Liao TW. Medical data mining by fuzzy modeling with selected features. Artificial Intelligence in Medicine, 2008, 43(3):195-206.
    [3] Chen J, Li K, Rong H, et al. A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Information Sciences, 2018, 435:124-149.
    [4] Gao W, Bao W, Zhou X. Analysis of cough detection index based on decision tree and support vector machine. Journal of Combinatorial Optimization, 2019, 37(1):375-384.
    [5] Pölsterl S, Conjeti S, Navab N, et al. Survival analysis for high-dimensional, heterogeneous medical data:Exploring feature extraction as an alternative to feature selection. Artificial intelligence in medicine, 2016, 72:1-11.
    [6] Liu CZ, Wang YJ. Research on medical data mining and its applications. Journal of Biomedical Engineering, 2014, 31. 5): 1182-1186(in Chinese with English abstract).
    [7] Park CH, Kim SB. Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Systems with Applications, 2015, 42(5):2336-2342.
    [8] Liu Y, Cao JJ, Diao XC, Zhou X. Survey on stability of feature selection. Ruan Jian Xue Bao/Journal of Software, 2018, 29(9): 2559-2579(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5394.htm[doi:10.13328/j.cnki.jos.005394]
    [9] Sayed GI, Hassanien AE, Azar AT. Feature selection via a novel chaotic crow search algorithm. Neural Computing and Applications, 2019, 31. 1):171-188.
    [10] Abedinia O, Amjady N, Zareipour H. A new feature selection technique for load and price forecast of electrical power systems. IEEE Trans. on Power Systems, 2016, 32(1):62-74.
    [11] Karegowda AG, Manjunath AS, Jayaram MA. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int'l Journal of Information Technology and Knowledge Management, 2010, 2(2):271-277.
    [12] Hsu HH, Hsieh CW, Lu MD. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications, 2011, 38(7):8144-8150.
    [13] Chen Y, Cheng XQ, Li Y, Dai L. Lightweight Intrusion detection system based on feature selection. Ruan Jian Xue Bao/Journal of Software, 2007, 18(7):1639-1651. in Chinese with English abstract). http://www.jos.org.cn/1000-9825/18/1639.htm[doi:10.1360/jos181639]
    [14] Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-based Systems, 2018, 140:103-119.
    [15] Xue X, Yao M, Wu Z. A novel ensemble-based wrapper method for feature selection using extreme learning machine and genetic algorithm. Knowledge and Information Systems, 2018, 57(2):389-412.
    [16] Zhang L, Zhang B. Research on the mechanism of genetic algorithms. Ruan Jian Xue Bao/Journal of Software, 2000, 11. 7): 945-952(in Chinese with English abstract). http://jos.org.cn/jos/article/abstract/20000712?st=article_issue
    [17] Das AK, Das S, Ghosh A. Ensemble feature selection using bi-objective genetic algorithm. Knowledge-based Systems, 2017, 123: 116-127.
    [18] Eroglu DY, Kilic K. A novel hybrid genetic local search algorithm for feature selection and weighting with an application in strategic decision making in innovation management. Information Sciences, 2017, 405:18-32.
    [19] Lu H, Chen J, Yan K, et al. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing, 2017, 256:56-62.
    [20] Hu Z, Bao Y, Xiong T, et al. Hybrid filter-wrapper feature selection for short-term load forecasting. Engineering Applications of Artificial Intelligence, 2015, 40:17-27.
    [21] Sosa-Cabrera G, García-Torres M, Gómez-Guerrero S, et al. A multivariate approach to the symmetrical uncertainty measure: Application to feature selection problem. Information Sciences, 2019, 494:1-20.
    [22] Zhang X, Mei C, Chen D, et al. Feature selection in mixed data:A method using a novel fuzzy rough set-based information entropy. Pattern Recognition, 2016, 56:1-15.
    [23] Palma-Mendoza RJ, Rodriguez D, De-Marcos L. Distributed ReliefF-based feature selection in spark. Knowledge and Information Systems, 2018, 57(1):1-20.
    [24] Xue B, Zhang M, Browne WN, et al. A survey on evolutionary computation approaches to feature selection. IEEE Trans. on Evolutionary Computation, 2015, 20(4):606-626.
    [25] Zhang Y, Wang S, Phillips P, et al. Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowledge-based Systems, 2014, 64:22-31.
    [26] Welikala RA, Fraz MM, Dehmeshki J, et al. Genetic algorithm based feature selection combined with dual classification for the automated detection of proliferative diabetic retinopathy. Computerized Medical Imaging and Graphics, 2015, 43:64-77.
    [27] Wang L, Liu SJ, Chen BL, et al. Heuristic discrimination cotton ripeness using hybrid filter and wrapper. Journal of Computer Research and Development, 2013, 50(2):269-277(in Chinese with English abstract).
    [28] Lee SJ, Xu Z, Li T, et al. A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. Journal of Biomedical Informatics, 2018, 78:144-155.
    [29] Uğuz H. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-based Systems, 2011, 24(7):1024-1032.
    [30] Ghareb AS, Bakar AA, Hamdan AR. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Systems with Applications, 2016, 49:31-47.
    [31] Rani MJ, Devaraj D. Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification. Journal of Medical Systems, 2019, 43(8):1-11.
    [32] Tanha J, van Someren M, Afsarmanesh H. Semi-supervised self-training for decision tree classifiers. Int'l Journal of Machine Learning and Cybernetics, 2017, 8(1):355-370.
    [33] Xu P, Lin S. Internet traffic classification using C4.5 decision tree. Ruan Jian Xue Bao/Journal of Software, 2009, 20(10): 2692-2704(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/3444.htm[doi:10.3724/SP.J.1001.2009.03444]
    [34] He M, Zhang J, Vittal V. Robust online dynamic security assessment using adaptive ensemble decision-tree learning. IEEE Trans. on Power Systems, 2013, 28(4):4089-4098.
    [35] Han L, Li W, Su Z. An assertive reasoning method for emergency response management based on knowledge elements C4.5 decision tree. Expert Systems with Applications, 2019, 122:65-74.
    [36] Lohita K, Sree A A, Poojitha D, et al. Performance analysis of various data mining techniques in the prediction of heart disease. Indian Journal of Science and Technology, 2015, 8(35):1-7.
    [37] Gomes HM, Barddal JP, Enembreck F, et al. A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 2017, 50(2):1-36.
    附中文参考文献:
    [6] 刘婵桢, 王友俊. 医学数据挖掘技术与应用研究. 生物医学工程学杂志, 2014, 31. 5):1182-1186.
    [8] 刘艺, 曹建军, 刁兴春, 周兴. 特征选择稳定性研究综述. 软件学报, 2018, 29(9):2559-2579. http://www.jos.org.cn/1000-9825/5394.htm[doi:10.13328/j.cnki.jos.005394]
    [13] 陈友, 程学旗, 李洋, 戴磊. 基于特征选择的轻量级入侵检测系统. 软件学报, 2007, 18(7):1639-1651. http://www.jos.org.cn/1000-9825/18/1639.htm[doi:10.1360/jos181639]
    [16] 张铃, 张钹. 遗传算法机理的研究. 软件学报, 2000, 11. 7):945-952. http://jos.org.cn/jos/article/abstract/20000712?st=article_issue
    [27] 王玲, 刘善军, 陈兵林, 等. 混合过滤器和封装器启发式判别籽棉成熟度. 计算机研究与发展, 2013, 50(2):269-277.
    [33] 徐鹏, 林森. 基于C4.5决策树的流量分类方法. 软件学报, 2009, 20(10):2692-2704. http://www.jos.org.cn/1000-9825/3444. htm[doi:10.3724/SP.J.1001.2009.03444]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

许召召,申德荣,聂铁铮,寇月.融合信息增益比和遗传算法的混合式特征选择算法.软件学报,2022,33(3):1128-1140

Copy
Share
Article Metrics
  • Abstract:1142
  • PDF: 4883
  • HTML: 3321
  • Cited by: 0
History
  • Received:January 23,2020
  • Revised:March 09,2020
  • Online: March 11,2022
  • Published: March 06,2022
You are the first2044902Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063