自然科学版
陕西师范大学学报(自然科学版)
数学与计算机科学
最大化ROC曲线下面积的不平衡基因数据集差异表达基因选择算法
PDF下载 ()
谢娟英*, 王明钊, 胡秋锋
(陕西师范大学 计算机科学学院, 陕西 西安 710119)
谢娟英,女,副教授。E-mail:xiejuany@snnu.edu.cn
摘要:
针对ARCO(AUC and rank correlation coefficient optimization)算法在进行两类问题特征选择时,采用斯皮尔曼等级相关系数度量已选特征子集冗余性带来信息损失和特征相关性与冗余性度量取值范围不一致的缺陷,提出改进的Pearson相关系数度量特征冗余性,并归一化特征相关性和冗余性度量范围,得到APCO(AUC and improved Pearson correlation coefficient optimization)算法以克服ARCO算法的不足。同时,针对实现多类特征选择的MAUCD(using MAUC as the relevance metric to rank features directly)和MDFS(MAUC decomposition based feature selection method)算法没有考虑特征冗余,且MDFS易选择到局部最优特征子集的问题,提出适于多类问题的改进Pearson相关系数度量特征冗余性,得到基于mRMR (maximal relevance -minimal redundancy)框架的MAUCP和MDFSP算法,克服MAUCD和MDFS算法的缺陷。以SVM、NB和KNN为分类工具,构造基于所选特征子集的相应分类器,以其AUC(MAUC)值度量相应特征子集的性能。7个二类和3个多类不平衡基因数据集的实验结果表明:提出的APCO、MAUCP和MDFSP算法分别优于ARCO、MAUCD和MDFS算法,也优于其他经典基因选择算法。
关键词:
基因选择; 差异表达基因; AUC; mRMR; 不平衡数据
收稿日期:
2016-04-26
中图分类号:
TP181.1
文献标识码:
A
文章编号:
1672-4291(2017)01-0013-10doi:10.15983/j.cnki.jsnu.2017.01.113
基金项目:
陕西省科技攻关项目(2013K12-03-24);国家自然科学基金(61673251);中央高校基本科研业务费专项资金(GK201503067)
Doi:
The differentially expressed gene selection algorithms for unbalancedgene datasets by maximize the area under ROC
XIE Juanying* , WANG Mingzhao , HU Qiufeng
(School of Computer Science, Shaanxi Normal University, Xi′an 710119, Shaanxi, China)
Abstract:
ARCO(AUC and rank correlation coefficient optimization) algorithm may cause information loss when it values the redundancy of selected features in Spearman′s correlation coefficient, and the ranges of ARCO are different for evaluating the correlation of features to classification and redundancy between features. To overcome these shortcomings of ARCO, it is proposed the revised Pearson correlation coefficient to assess the correlation between features, and uniformed the ranges of correlation and redundancy, then it is got the APCO(AUC and improved Pearson correlation coefficient optimization) algorithm. Both MAUCD (using MAUC as the relevance metric to rank features directly) and MDFS (MAUC decomposition based feature selection method) algorithms for features selection of multiclass problems do not consider the redundancy between features, and furthermore MDFS easily converges to the locally optimal solution of the differentially expressed genes. To avoid the deficiencies of MAUCD and MDFS algorithms, it is proposed to measure the redundancy of features in Pearson coefficient revised by us for multiclass problems, and the MAUCP and MDFSP algorithms based on the framework of mRMR (maximal relevance-minimal redundancy). SVM, NB and KNN classifiers are adopted as the classification tools, and AUC (or MAUC for multiclass classification problems) is used to assess the performance of the classifiers built on the selected feature subsets. Experimental results on seven two-class unbalanced gene datasets and three multi-class unbalanced gene datasets demonstrate that the proposed APCO, MAUCP and MDFSP algorithms are superior to the original algorithms including ARCO, MAUCD and MDFS, and outperform others classic gene selection algorithms.
KeyWords:
gene selection; differentially expressed genes; AUC; mRMR; unbalanced datasets