Abstract:
To analyze the gene expression datasets with tens of thousands features and a very small number of samples, a new 2D space based adaptive feature selection algorithm named FSIP (feature selection based on information gain and Pearson correlation coefficient) is proposed in this paper. It adopts the information gain of a feature to value the discernibility of the feature, and the Pearson correlation coefficients to evaluate the independence of the feature. To detect features with both high discernibility and independence and make the good harmony between the feature discernibility and its independence, the importance of a feature is defined as the product of its discernibility and its independence. Those features with much higher importance than the rest features are selected to comprise the feature subset. The K-ELM (kernel extreme learning machine) is adopted as a classification tool to measure the capability of the selected features to classification. The performance of the proposed FSIP algorithm on several popular gene expression datasets was tested, and its performance was compared to that of several famous feature selection algorithms, including SVM-RFE, DRJMIM, mRMR, LLE Score, AMID and AVC, and the significant test was conducted as well. All the experimental results show that the proposed FSIP algorithm can detect the feature subset with much more capability in classification, and the K-ELM classifier based on the selected feature subset has got very good performance in classification.