一种新聚类评价指标-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

数学与计算机科学

一种新聚类评价指标

谢娟英*，周颖

(陕西师范大学计算机科学学院, 陕西西安 710119)

谢娟英，女，副教授，主要从事机器学习、数据挖掘、大数据分析方面的研究。E-mail:xiejuany@snnu.edu.cn

摘要:

用于发现数据集类簇数k的常用内部评价指标DB(Davies Bouldin)和BWP(Between-within Proportion)等需要先确定一个搜索范围kmax，使数据集的类簇数满足k≤kmax，但如何确定kmax尚无理论指导。针对这一问题，提出一个新F统计量Fr, 将Fr作为新聚类有效性准则, 以判断聚类算法收敛与否, 自适应地确定数据集类簇数；将Fr应用于快速K-medoids算法的收敛性判断，并以基于最小生成树的测地距离，即样本对在最小生成树上的路径长度，代替其间的直接欧氏距离度量样本相似性，得到一种自适应的快速K-medoids聚类算法，解决了K-medoids算法需要人为给定类簇数和不能发现任意形状簇的问题。UCI机器学习数据库数据集和人工模拟数据集实验测试表明，本文提出的Fr指标是一种有效的聚类算法评价指标，基于该指标和测地距离的K-medoids算法不仅能发现任意形状的簇，还可以自适应地确定数据集的类簇数，且对噪音数据有很好的鲁棒性。

关键词：

F统计量；内部评价指标；类簇数； K-medoids聚类算法；最小生成树

收稿日期：

2015-05-22

中图分类号：

TP181.1

文献标识码：

文章编号：

1672-4291(2015)06-0001-08doi:10.15983/j.cnki.jsnu.2015.06.161

基金项目：

国家自然科学基金(31372250); 陕西省科技攻关项目 (2013K12-03-24); 中央高校基本科研业务费专项资金（GK201503067）

Doi:

A new criterion for clustering algorithm

XIE Juanying*, ZHOU Ying

(School of Computer Science, Shaanxi Normal University, Xi′an 710119, Shaanxi, China)

Abstract:

The internal evaluation criteria, such as DB and BWP et al, need the upper bound of kmax to be given in advance when these criteria are used to determine the number of clusters of a dataset, so that the number of clusters of k in a dataset satisfies k≤kmax. However, there is not so far a theory about how to find the upper bound of kmax. Aiming at this problem, a new F-statistics Fr is proposed in this paper.Fr can be used as a new criterion to judge whether a clustering algorithm is converged or not, so that the number of clusters of a dataset can be found automatically without the upper bound of kmax to be given in advance. In addition, Fr is applied to the fast K-medoids clustering algorithm as its criterion to judge whether its iterations continues or not, and the geodesic distance between samples on the minimum spanning tree (MST), that is the length of the path between samples on MST, is introduced to instead of the direct Euclidean distance between them to measure their similarities, so that an adaptive fast K-medoids algorithm is obtained. The adaptive K-medoids algorithm avoids the deficiencies of the available K-medoids algorithms which need the number of clusters to be given in advance and cannot detect the clusters with any arbitrary shape. The power of the proposed criterion Fr and the power of the adaptive fast K-medoids algorithm are tested in real world datasets from UCI machine learning repository and in the synthetic datasets. All the experimental results demonstrate that our proposed criterion Fr is a valid clustering criterion, and the adaptive fast K-medoids algorithm based on the Fr and geodesic distance can recognize the clusters with arbitrary shape, and can detect the number of clusters automatically, and is robust to noises as well.

KeyWords:

F-statistics; internal evaluation criterion; the number of clusters; K-medoids clustering algorithm; minimum spanning tree