基于信息增益与皮尔森相关系数的2D自适应特征选择算法-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

数据挖掘专题

基于信息增益与皮尔森相关系数的2D自适应特征选择算法

谢娟英*,吴肇中,郑清泉

（陕西师范大学计算机科学学院，陕西西安 710119）

谢娟英，女，教授，博士生导师，研究方向为机器学习、数据挖掘、生物医学数据分析等。E-mail：xiejuany@snnu.edu.cn

摘要:

针对基因表达数据的高维小样本特点，提出基于信息增益与皮尔森相关系数的2D自适应特征选择算法FSIP（feature selection based on information gain and Pearson correlation coefficient）。以特征的信息增益度量相应特征所携带的信息量，定义特征辨识度来度量特征的辨识能力大小，采用皮尔森相关系数定义特征独立性。为了尽可能选择到辨识能力和独立性都很好的特征，并能调和特征的辨识度与独立性对分类的贡献，定义两者之积为特征重要性，自适应地选择重要性远高于其余特征重要性的特征构成特征子集。以核极限学习机K-ELM（kernel extreme learning machine）为分类器，评价所选择特征子集的分类性能。基因数据集的实验测试以及与经典特征选择算法SVM-RFE、DRJMIM、mRMR、LLE Score、AMID、AVC的实验对比和统计重要性检测表明，提出的FSIP特征选择算法能够选择出分类能力很好的特征子集，基于被选特征子集的K-ELM具有很好的分类性能。

关键词：

信息增益;皮尔森相关系数;特征选择;极限学习机;特征相关性

收稿日期：

2020-06-01

中图分类号：

TP181

文献标识码：

文章编号：

1672-4291(2020)06-0069-13

基金项目：

国家自然科学基金（61673251,62076159，12031010）；国家重点研发计划（2016YFC0901900）；科技成果转化培育项目（GK201806013）；研究生培养创新基金（2015CXS028, 2016CSY009）

Doi:

An adaptive 2D feature selection algorithm based on information gain and Pearson correlation coefficient

XIE Juanying*, WU Zhaozhong, ZHENG Qingquan

（School of Computer Science，Shaanxi Normal University， Xi′an 710119，Shaanxi，China）

Abstract:

To analyze the gene expression datasets with tens of thousands features and a very small number of samples, a new 2D space based adaptive feature selection algorithm named FSIP (feature selection based on information gain and Pearson correlation coefficient) is proposed in this paper. It adopts the information gain of a feature to value the discernibility of the feature, and the Pearson correlation coefficients to evaluate the independence of the feature. To detect features with both high discernibility and independence and make the good harmony between the feature discernibility and its independence, the importance of a feature is defined as the product of its discernibility and its independence. Those features with much higher importance than the rest features are selected to comprise the feature subset. The K-ELM (kernel extreme learning machine) is adopted as a classification tool to measure the capability of the selected features to classification. The performance of the proposed FSIP algorithm on several popular gene expression datasets was tested, and its performance was compared to that of several famous feature selection algorithms, including SVM-RFE, DRJMIM, mRMR, LLE Score, AMID and AVC, and the significant test was conducted as well. All the experimental results show that the proposed FSIP algorithm can detect the feature subset with much more capability in classification, and the K-ELM classifier based on the selected feature subset has got very good performance in classification.

KeyWords:

information gain; Pearson correlation coefficient; feature selection; extreme learning machine; feature correlation