XIE Juan-ying1,2, MA Qing1, XIE Wei-xin2,3
(1 College of Computer Science, Shaanxi Normal University, Xi′an 710062, Shaanxi, China; 2 College of Electronic Engineering, Xidian University, Xi′an 710071, Shaanxi, China; 3 College of Information Engineering, Shenzhen University, Shenzhen 518001, Guangdong, China)
Abstract:
To determine the optimal number of clusters for K-means clustering, a new algorithm is proposed based on the granular computing and the improved global K-means clustering. This algorithm introduces the granular computing into similar function to determine the similarity between two samples, so that the potential largest number Kmax of clusters is determined by the new similar function and fuzzy equivalence relation. Then the improved global K-means clustering and the criterion of BWP (Between-Within Proportion) are combined to determine the optimal number of clusters of a dataset, where BWP is a criterion to estimate the clustering result, and the optimal number of clusters for K-means clustering is determined according to the scores of BWP on different clustering results, during the procedure the Kmax is used as the upper bound of searching for the optimal number of clusters. The new algorithm is tested and compared to available studies about how many clusters will be best for K-means clustering through the UCI datasets and synthetic datasets with noisy data. All experimental results demonstrate that our new algorithm is effective in determining the optimal number of clusters especially in large datasets. The disadvantage of it is that it is sensitive to noisy data.
KeyWords:
information granularity; K-means; global K-means; fuzzy similarity; clustering criterion BWP