自然科学版
陕西师范大学学报(自然科学版)
人工智能专题
基于主动学习的标签噪声清洗方法
PDF下载 ()
孟晓超1, 姜高霞1, 王文剑2*
(1 山西大学 计算机与信息技术学院, 山西 太原 030006; 2 计算机智能与中文信息处理教育部重点实验室(山西大学), 山西 太原 030006)
王文剑,女,教授,博士生导师,研究方向为机器学习、计算智能及数据挖掘等。 E-mail:wjwang@sxu.edu.cn
摘要:
在监督分类学习中,标签噪声对模型有重要的影响;而现有的标签噪声过滤方法一般都是基于模型的预测结果对噪声样本进行检测并去除,当噪声样本较多时,去除噪声样本的同时将会影响原来样本的完整性,使样本信息缺失。针对这一问题,提出一种基于主动学习的标签噪声清洗方法(active label noise cleaning based on classification with gaussian process,GP_ALNC),该方法将高斯过程模型和主动学习相结合,从已有标签样本集中筛选出不确定性最高的样本交给人工专家进行检验,通过这种迭代方法清洗掉大部分噪声数据的同时保持了原有数据的完整性;并针对二分类任务中的标签噪声问题,在MNIST数据集和UCI数据集上,与已有方法ALNR(active label noise removal)以及ICCN_SMO(iterative correction of class noise based on SMO)进行了实验对比,并取得了不错的表现。
关键词:
标签噪声; 噪声清洗; 高斯过程; 主动学习
收稿日期:
2019-07-01
中图分类号:
TP391
文献标识码:
A
文章编号:
1672-4291(2020)02-0009-08
基金项目:
国家自然科学基金(61673249,U1805263); 山西省回国留学人员科研基金(2016004)
Doi:
A method of label noise cleaning based on active learning
MENG Xiaochao1, JIANG Gaoxia1, WANG Wenjian2*
(1 School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China;2 Key Laboratory of Computation Intelligence and Chinese Information Processing,Ministry of Education(Shanxi University), Taiyuan 030006, Shanxi, China)
Abstract:
In supervised classification learning, the impact of label noise on the model is often more important. The existing label noise filtering methods generally detect and remove noise samples based on the prediction results of the model. When the number of noise samples is large, removing the noise samples will affect the integrity of the original samples and make the sample information missing. Aiming at this problem, a method of label noise cleaning based on active learning is proposed, namely GP_ALNC(active label noise cleaning based on classification with Gaussian process). This method combines Gaussian process model and active learning to select the most uncertain samples from existing labeled sample sets and outsourcing them to artificial experts for examining. The proposed iterative method can clean away most of the noise data while maintaining the integrity of the original data. For the label noise problem in the two-class task, the proposed method is compared with the existing methods ALNR(active label noise removal) and ICCN_SMO(iterative correction of class noise based on SMO) on the MNIST and UCI data sets. The experiment results show that the proposed GP_ALNC may achieve good performance.
KeyWords:
label noise; noise cleaning; Gaussian process; active learning