一种确定最佳聚类数的新算法-陕西师范大学学报期刊社网站

陕西师范大学学报（自然科学版）

数学与计算机科学

一种确定最佳聚类数的新算法

谢娟英1,2，马箐1，谢维信2, 3

(1 陕西师范大学计算机科学学院，陕西西安 710062； 2 西安电子科技大学电子工程学院，陕西西安 710071； 3 深圳大学信息工程学院，广东深圳 518001)

谢娟英, 女, 副教授, 主要从事智能信息处理和模式识别研究. E-mail: xiejuany@snnu.edu.cn.

摘要:

针对K-均值聚类算法需要事先确定聚类数K的问题，将粒度计算引入样本相似度函数，定义了新的样本相似度，用模糊等价聚类确定数据集可能的最大类簇数Kmax.以Kmax为搜索上界，利用改进全局K-均值聚类算法，以BWP(Between-Within Proportion)为聚类有效性度量指标，提出确定最佳聚类数的一种新方法.通过UCI机器学习数据库数据集以及随机生成的人工模拟数据集实验测试，证明该算法不仅能有效确定数据集的最佳聚类数，而且适用于大规模数据集，但是会受到噪音点影响.

关键词：

信息粒度；K-均值；全局K-均值；模糊相似度；聚类指标BWP

收稿日期：

2011-04-05

中图分类号：

TP181.1文献标志码： A

文献标识码：

文章编号：

1672-4291(2012)01-0013-06

基金项目：

陕西省自然科学基金资助项目(2010JM3004); 中央高校基本科研业务费专项资金重点项目（GK200901006, GK201001003）；陕西师范大学研究生培养创新基金项目(2011CX029).

Doi:

A new algorithm to determine the optimal number of clusters

XIE Juan-ying1,2, MA Qing1, XIE Wei-xin2,3

(1 College of Computer Science, Shaanxi Normal University, Xi′an 710062, Shaanxi, China; 2 College of Electronic Engineering, Xidian University, Xi′an 710071, Shaanxi, China; 3 College of Information Engineering, Shenzhen University, Shenzhen 518001, Guangdong, China)

Abstract:

To determine the optimal number of clusters for K-means clustering, a new algorithm is proposed based on the granular computing and the improved global K-means clustering. This algorithm introduces the granular computing into similar function to determine the similarity between two samples, so that the potential largest number Kmax of clusters is determined by the new similar function and fuzzy equivalence relation. Then the improved global K-means clustering and the criterion of BWP (Between-Within Proportion) are combined to determine the optimal number of clusters of a dataset, where BWP is a criterion to estimate the clustering result, and the optimal number of clusters for K-means clustering is determined according to the scores of BWP on different clustering results, during the procedure the Kmax is used as the upper bound of searching for the optimal number of clusters. The new algorithm is tested and compared to available studies about how many clusters will be best for K-means clustering through the UCI datasets and synthetic datasets with noisy data. All experimental results demonstrate that our new algorithm is effective in determining the optimal number of clusters especially in large datasets. The disadvantage of it is that it is sensitive to noisy data.

KeyWords:

information granularity; K-means; global K-means; fuzzy similarity; clustering criterion BWP