Analysis of R software in the system of application clustering

Summary of multivariate statistical clustering method has been widely used in natural science and social science fields, but in reality, cluster analysis multivariate data processing, statistical software can not do without the support, R software because of its free, open source, powerful Statistical analysis and mapping capabilities have been the perfect growing concern with the application, this paper describes an instance of the R statistical software in a multi-system analysis applications.

Keywords: R software, cluster analysis, multivariate statistical


Multivariate statistical analysis is an important branch of statistics, also known as multivariate statistical analysis, in real life, shared by many indicators of the role and impact of the phenomenon abound, multivariate statistical analysis is to study the interdependent relationship between the number of random variables its an important subject within the statistical laws, the most commonly used cluster analysis, cluster analysis as multivariate statistical methods generally involve complex mathematical theory, the general can not be calculated by hand, must have computer and statistical software.

Statistical software, commonly used statistical software SPSS, SAS, STAT, R, S-PLUS, and so on. R is a free software, free, open source software is a powerful statistical analysis with excellent statistical functions and graphing capabilities statistical software, now is home to many statisticians favorite data analysis tool with this example to illustrate the multivariate statistical software R clustering applications.

A cluster analysis

Cluster analysis, also known as cluster analysis, it is the study (sample or indicators of a multivariate statistical classification method, the so-called class, layman's terms, refers to a collection of similar elements in the socio-economic field there are a large number of classification problems, such as If the price index of some large cities, to inspect, and a lot of price index, a price index of agricultural production, services price index, consumer price index of food, building materials retail price index, etc. As the price index to study a lot, usually the first price index for classification of these short, many problems require classification, the cluster analysis of this useful tool more and more people's attention, which in many areas have been widely used.

Cluster analysis is very rich, systematic clustering method, ordered sample clustering, dynamic clustering, fuzzy clustering, graph theory, clustering, clustering prediction method, the most commonly used cluster analysis of the most successful clustering method for the system, the system's basic idea of ??clustering first n samples of each as a class, then the provisions of the sample between the 'distance' between classes and the distance from the nearest two options combined into a new class, new classes and other types of computing (the distance of the current class, then the nearest two combined so that each class merger reduced until all samples are the property into a class so far.

System clustering of basic steps:

1, calculate the n samples twenty-two distance.
2, the structure of n classes, each class contains only a sample.
3, the nearest two combined for a new class.
4, a new class of computing the distance with the current class.
5 Repeat steps 3 and 4, the nearest two combined for the new class until all classes and a class so far.
6, art clustering pedigree chart.
7 to determine the number of classes and class.
Hierarchical clustering methods: 1, shortest distance, 2, the most long-distance method, 3, middle distance, 4, center of gravity method, 5, class average, 6, sum of squared deviations method (Ward method.

Second, based on R language cluster analysis program

R software package offers a variety of related clustering methods, clustering methods are mainly system, rapid clustering method, fuzzy clustering method, clustering method commonly used in the system.

R software system clustering procedure is as follows:

hclust (d, method = 'complete', members = NULL

Which, d is a 'dist' from the structure, composition, specifically including the absolute distance, Euclidean distance, Chebyshev distance, Mahalanobis distance, Gram distance, Euclidean distance by default, method including the class average average, center of gravity method centroid, middle distance-median, the most long-distance method complete, the shortest distance method single, squared deviation method ward, etc., default is the most long-distance method complete.

Third, the application example

Table 1 cities in Shandong Province in 2008 the average household consumption expenditure per person per year, using the system to the data for clustering of municipalities (Table 1

R language program is as follows:

> X <-read.delim ('clipboard', header = T
> Row.names (X <-c ('Jinan', 'Qingdao', 'Zibo', 'Zaozhuang,' 'Dongying,' 'Yantai', 'Weifang', 'Jining', 'Tai', 'Weihai' 'sunshine', 'Laiwu,' 'Linyi', 'Texas', 'Liaocheng', 'Binzhou', 'Heze'
> D <-dist (scale (X Links to free download

> Hc1 <-hclust (d, 'single' # shortest distance method

> Hc2 <-hclust (d, 'complete' # most long-distance method

> Hc3 <-hclust (d, 'median' # middle distance method

> Hc4 <-hclust (d, 'ward' # Ward method

> Opar <-par (mfrow = c (2,2

> Plot (hc1, hang =- 1, plot (hc2, hang =- 1

> Plot (hc3, hang =- 1), plot (hc4, hang =- 1

Output (Figure 1

It can be seen from Figure 1, different methods of classification in general, as with the specific circumstances of Shandong Province, the most long-distance classification method is better.

In cluster analysis, using R software is the most convenient, simplest, most easy to learn, and depending on the circumstances, can modify other people's programs, more convenient, you can handle multiple data clustering analysis, using R software has a great advantage.

Main References:
[1] Wang Bin will Multivariate Statistical Analysis and R language modeling [M]. Guangzhou: Jinan University Press, 2010.

[2] Tang silver only. R language and statistical analysis [M]. Beijing: Higher Education Press, 2005.

[3] high-Hui Xuan. Applied multivariate statistical analysis [M]. Beijing: Peking University Press, 2005.

[4] Li Weidong Applied multivariate statistical analysis [M]. Beijing: Peking University Press, 2008.

Links to free download

Statistics Papers