F-score结合核极限学习机的集成特征选择算法-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

人工智能专题

F-score结合核极限学习机的集成特征选择算法

谢娟英*，郑清泉，吉新媛

（陕西师范大学计算机科学学院，陕西西安 710119）

谢娟英，女，教授，博士生导师，研究方向为机器学习、数据挖掘、生物医学数据分析等。E-mail：xiejuany@snnu.edu.cn

摘要:

特征选择是高维小样本癌症基因数据分析的首要和关键步骤, 但是现有特征选择算法存在特征子集依赖于训练样本且随训练样本不同而变化的问题。为了解决特征选择过程的特征子集不稳定问题, 提出一种基于核极限学习机的集成特征选择方法, 利用5-折交叉验证划分原始数据, 对各训练集继续采用5-折交叉验证进行划分并进行特征选择, 以所得5个特征子集之并集作为该训练集的特征子集, 构造核极限学习机评价该特征子集的分类性能, 以原始数据集5-折交叉验证所得特征子集的平均Jaccard系数评价特征选择算法所选特征子集的稳定性。5个基因数据集的实验测试以及与经典特征选择算法SVM-RFE、LLE Score、ARCO、DRJMIM、Random Forest和mRMR的实验比较表明, 本文算法不仅能选择到稳定的特征子集, 且所选特征子集具有很好的泛化能力。

关键词：

F-score；特征选择；极限学习机；集成特征选择

收稿日期：

2020-01-19

中图分类号：

TP181

文献标识码：

文章编号：

1672-4291(2020)02-0001-08

基金项目：

国家自然科学基金（61673251）；国家重点研发计划（2016YFC0901900）；科技成果转化培育项目（GK201806013）；研究生培养创新基金（2015CXS028, 2016CSY009）

Doi:

An ensemble feature selection algorithm based on F-score and kernel extreme learning machine

XIE Juanying*, ZHENG Qingquan, JI Xinyuan

（School of Computer Science, Shaanxi Normal University, Xi′an 710119, Shaanxi, China）

Abstract:

Feature selection is an essential step for analyzing gene expression datasets with very much high dimensions and small number of samples. However, the available feature subset selection algorithms share the common deficiencies that the feature subset is dependent on the training subset, and is various with different training samples. In order to solve this problem in feature selection, a new ensemble feature selection algorithm based on the kernel extreme learning machines is put forward. 5-fold cross validation experiments are adopted to partition the original dataset. For each training subset, 5-fold cross validation experiments are adopted again to partition it, then feature selection process has been done on each sub-training subset, and the union of the five selected feature subsets constructs the feature subset corresponding to the training subset. The classification power of the feature subset is evaluated by the performance of the kernel extreme learning machine built on it. The stability of feature subsets detected by the feature selection algorithms is evaluated by the mean Jaccard coefficient of five feature subsets obtained by 5-fold cross validation experiments on original data. The performance of the proposed ensemble feature selection algorithm is tested on five gene expression datasets. The performance of the proposed feature selection algorithm is compared to the available ones, including SVM-RFE, LLE Score, ARCO, DRJMIM, Random Forest, and mRMR. All the experimental results show that the proposed ensemble feature selection algorithm can not only detect the stable feature subset, but also can select the feature subset with high predictive power.

KeyWords:

F-score; feature selection; extreme learning machine; ensemble feature selection algorithm