Application of Cluster Analysis Using Agglomerative Method

Abstract


INTRODUCTION
Improving the quality of human resources is the main supporting factor in increasing national productivity in various fields and development sectors. The government's productive investment that can spur the nation's competitiveness in the global era prioritizes Indonesia's education development. The government determines education development policies in three policy pillars which are outlined in the mission of education. These policies focus on increasing education services' availability, expanding the affordability of education services, improving the quality of education services, realizing equality in education services, and ensuring certainty of obtaining educational services [1]. The education development policy is an indicator of a measuring tool for success based on a strategic education plan with an educational mission. However, efforts to develop education in Indonesia entering the 21st century face challenges in preparing human resources quality since the existence of education autonomy because not all districts or cities can provide valid and reliable data information to the center.
Another challenge is understanding education indicators and the relationship with accountability for the success of education development programs. Only a few education managers who are in the ranks of the Ministry of Education and Culture or education managers in the Provincial Education Office and District or City Education Office understand these two things [1]. Therefore, the preparation and study of education indicators for the education development not a factor of socio-economic conditions and accessibility but a motivational factor [4]. The factors that affect the GER in Higher Education for each Province in Indonesia consist of central government expenditure in higher education to GRDP; student lecturer ratio; and population [5].
This study's educational indicators to cluster provinces in Indonesia are based on Statistics Indonesia's data in 2018 using the Average Linkage and Ward methods. The education indicators consist of the Population Literacy Rate aged 15-24 years; Literacy Rate ≥15 years old; Child Gross Participation Rate in Early Childhood Education; Higher Education Gross Enrollment Rate; Net Enrollment Rate Population with the lowest 40% expenditure group at the primary school, junior high school, senior high school level; Number of Villages with School Facilities at Elementary School, Junior High School, Senior High School, and College Level; GER Ratio at the Higher Education Level, as well as the Average Length of Schooling for Population Aged ≥15 Years.
The first use of cluster analysis by Tyron in 1939. The purpose of cluster analysis is to classify individuals who are independent of each other in a group to have the same or similar characteristics. Grouping cluster analysis uses a measure that describes the similarity or closeness between complex data into a simple group structure. This measure is a measure of distance or similarity [6] and a measure of the distance known as Euclid's distance [6].
The cluster analysis method consists of hierarchical and non-hierarchical methods. There is no known number of groups to be obtained in the hierarchical method. Meanwhile, the nonhierarchical method assumes that there are k groups first. The hierarchical method consists of the agglomerative and divisive methods. The agglomerative method consists of the Single Linkage method, Complete Linkage, Average Linkage, Ward's, Centroid, and the Median method [7]. The methods that are included in non-hierarchical methods are the K-means method and the fuzzy method. This study using a hierarchical method consisting of the Average Linkage and Ward methods.
The use of cluster analysis has been widely carried out in various scientific fields such as economics, geography, health, social, and multiple fields. The grouping of districts, districts or cities, and provinces in Indonesia uses cluster analysis based on indexes in the economic, geographic, health, and social fields [8], [9], [10], [11], [12], [13], [14]. The use of cluster analysis by grouping regions based on health indicators, people's welfare indicators, village potentials, macroeconomic indicators, human development indexes, and HIV/AIDS indicators.
Relevant research has been conducted by [14] by comparing cluster analysis with the Average Linkage method and the Ward Linkage Method in a case study of the Human Development Index in South Sulawesi Province. The results showed that the grouping using the Average Linkage method produced the best Dunn index of 0.55 compared to the Ward method of 0.43. Then, it was obtained the number of clusters formed as many as 8 clusters. Also, the number of groups formed is 8 clusters. Then, [8] reconducted research related to cluster analysis using the hierarchical method for grouping districts or cities in East Java-based health indicators. The hierarchical method used is Single Linkage, Complete Linkage, Average Linkage, Ward's, and Centroid based on the validity index, namely RMSSTD (Root Mean Square Standard Deviation). The results showed that the Ward Linkage method is the best method of grouping for the hierarchical method used with the smallest RMSSTD index value of 13.947 and forming clusters of 5 groups.
The following relevant research has been conducted by [10] by analyzing sub-district clusters in Semarang district based on village potential using the Ward and Single Linkage methods. The results showed that the Single Linkage method with R-Squared value is smaller than the Ward method, which shows that the Single Linkage method produces heterogeneous clusters compared to the Ward Linkage method. The subsequent research by [9] conducted a cluster analysis using the Average Linkage method in grouping districts or cities in Central Java Province based on People's Welfare Indicators. The results showed that the process of grouping 35 districts or towns in Central Java province could be formed three groups with groups A, B, and C, each consisting of 28, 2, and 5 districts or cities.
Subsequent research uses cluster analysis with the K-Means method for grouping districts or cities in Maluku province based on the 2014 human development index indicators, namely life expectancy, literacy rate, average years of schooling, and per capita expenditure rate [12]. The results showed that there were three clusters: cluster 1 consisting of Ambon City with a very maximum number compared to the other 2 clusters, cluster 2: MTB, Aru Islands, SBB, SBT, MBD, and Bursel, and cluster 3: Malra, Malteng, Buru, Tual. Research by conducting Cluster Analysis with Outlier Data Using Centroid Linkage and K-Means Clustering for Grouping HIV / AIDS Indicators in Indonesia shows that the Centroid Linkage method has a higher homogeneous level compared to the K-Means method [13]. The comparison of the two methods uses the SW and SB ratios. Furthermore, cluster analysis uses the Average Linkage method, and Ward uses Unit Link life insurance customer data [15]. The results showed that the Average Linkage method had better performance than the Ward method with SB and SW of 0.486 and 0.710.
Based on the indicator study and the article literature above, in this study, this study conducted and compared cluster analysis using the Agglomerative method, namely the Average Linkage method and the Ward method in showing regional clusters in Indonesia based on 14 educational indicators. Determining an exemplary group is based on the average standard deviation ratio in the cluster to the standard deviation between clusters.

Data sources and Research Variables
The data used in grouping using cluster analysis is provincial data in Indonesia in 2018. The data used is secondary data based on education indicators for all Indonesia provinces obtained from the Statistics Indonesia in 2018. The variables in this study consisted of the population literacy rate variable aged 15-24 years (X1); literacy rate ≥15 years (X2); The gross enrollment rate of children attending early childhood education (X3); Higher Education Gross Enrollment Rate (X4); Net Enrollment Rate (NER) population of the lowest 40% expenditure group is Elementary School level (X5); NER population of the lowest 40% expenditure group is junior high school level (X6); NER population of the lowest 40% expenditure group is high school level (X7); population of the lowest 40% expenditure group at the Vocational High School level (X8); Number of villages with primary school facilities (X9); Number of villages that have junior high school facilities (X10); Number of villages with senior high school facilities (X11), number of villages with higher education facilities (X12), GER ratio at the higher education level (X13), average years of schooling for the population aged ≥15 years (X14).

Research Stages
This study's stages consisted of data standardization, multicollinearity testing, a dendrogram of hierarchical cluster analysis method, and the best method's determination based on the average standard deviation ratio in the cluster to the standard deviation between groups.

Standardization of data
The standardization process was carried out for the study variables that had significant differences in unit sizes. Striking unit differences can result in invalid calculations in cluster analysis. Therefore, the standardization process needs to be done by transforming the original data before further analysis. The z-score result transforms variables 1 , 2 , ⋯ , into new variable variables, 1 , 2 , ⋯ , which are uncorrelated using the formula = [ − ̅ ] where is the ith eigenvector obtained from the principal component analysis [16].

Multicollinearity testing
The use of data in cluster analysis should not be correlated so that there is no multicollinearity. In the cluster analysis, each variable is given the same weight in the calculation of the distance. If some of the variables are correlated, it will cause an unbalanced weighting. As a result, these conditions will affect the results of the analysis in grouping objects. According to [17], a very high correlation between independent variables would result in a regression model estimator that is biased, unstable, and perhaps far from its predictive value.
Identify the presence or absence of multicollinearity for each research variable based on the variance inflation factor (VIF). If the value of VIF≤10 and the value of tolerance ≥0.10, then the regression is free from multicollinearity conditions [18], [19], [20]. According to [21], a VIF value greater than 10 identifies a severe multicollinearity problem. According to [22], for the VIF≤10 value so that high multicollinearity occurs, the study variables should theoretically not be used in the OLS (Ordinary Least Square) regression model functions as a non-significant variable. Also, multicollinearity conditions can be identified based on a coefficient matrix with the correlation between the independent variables less than 0.5 [23]. In this study, multicollinearity testing used Variance Inflation Factor (VIF).

Hierarchy cluster analysis method
The grouping analysis in this study used a hierarchical method consisting of the Average Linkage and Ward methods. In general, according to [24], hierarchical cluster analysis is grouping N objects with the following procedure. The first step, starting with the number of N clusters. Each cluster contains a single element and asymmetrical matrix = { } is Euclid's distance using , = 1,2, ⋯ , and = 1,2, ⋯ , . Second, determining the closest cluster pair distance with represents the closest distance to clusters U and V. Third, combining clusters U and V by identifying the new cluster formed with (UV) and recalculating the new distance matrix. The fourth step, repeating the second step as many as N-1 iterations so that all objects are in a single cluster.
The Average Linkage cluster method or the average linkage method is a method with the average distance principle. This cluster method's basic rule is the average distance between observations with grouping starting from the center or pairs of keeping with the average length. According to [25], this method begins with finding another member of = ( ) and combining the corresponding objects, for example, and , to become ( ). Then, the distance between ( ) and another group, namely , is written in the formula ( ) = ∑ ∑ ( ) calculate the distance between the two regions that merge into one group with another region; the next one combines the clusters most similar to form the second cluster. It is then calculated using formula ( ) to create a matrix with a new distance, repeating the second and third steps N-1 times, where N is the number of provincial objects.
The following cluster analysis method is the Ward method. This clustering method uses complete calculations and maximizes homogeneity within one group. In this method, the distance between two clusters is the squares' sum between the two clusters for all variables [26]. This method tends to be used to combine groups with small numbers. The formula used is = , where denotes the i-th object's value where = 1,2,3, ⋯ is in the j-group; k means the number of groups per step, and represents the number of groups i in group j. The stages in cluster analysis using the Ward method consist of the initial steps taking into account N clusters with one province per cluster (all provinces are considered clusters) with ESS of zero. Second, the first cluster is formed by selecting two of the N clusters with the smallest ESS value. Third, re-identifying the N-1 cluster clusters to determine two of these clusters, which can minimize heterogeneity so that N-1 systematically reduces N-clusters. The fourth step repeats the second and third steps until one cluster is obtained or all provinces merge into one cluster.

Determination of the best method
The step is to determine the best cluster analysis method by grouping based on distance measurements and then comparing them. The method selection is based on the average standard deviation ratio in the cluster to the standard deviation between groups to produce the best grouping quality. The average standard deviation in the group is written as , represented by the formula = 1 ∑ =1 . While the standard deviation between clusters ( ) is formulated as = homogeneity [27]. The smaller the value and the greater the value, the method has good accuracy.

Descriptive Data Analysis
Data grouping in this study is based on education indicators with the number of provinces in Indonesia. Secondary data collection in 2018 is based on fourteen hands consisting of fourteen provinces as the research sample. Based on the data from the grouping results, an analysis was then carried out to obtain a summary of the results of the descriptive analysis for each education indicator below. Based on the results of descriptive data analysis in Table 1 above, it shows that each research variable has a minimum and maximum data and mean, median value, and standard deviation. Table  1 shows that the median and mean values for each variable or indicator are relatively the same, except for the variables X8, X9, X10, X11, and X12. In this case, it shows that the distribution is almost symmetrical. Meanwhile, the minimum and maximum values for each variable are pretty far apart. Therefore, the use of data sizes for the variables in this study has quite a significant difference, so it is necessary to transform the initial data into a z-score.

Multicollinearity testing
The cluster analysis process by calculating the distance gives the same weight to each variable in the study. So that if there are variables that are mutually correlated, it will cause an unbalanced weighting. As a result, these conditions will affect the results of the analysis in object grouping. Therefore, the collinearity testing process is carried out to identify the presence or absence of collinearity between variables. The following shows the results of calculating the VIF value for each research variable in table 2 below.
Data grouping in this study is based on education indicators with the number of provinces in Indonesia. Secondary data collection in 2018 is based on fourteen hands consisting of fourteen provinces as the research sample. Based on the data from the grouping results, an analysis was then carried out to obtain a summary of the results of the descriptive analysis for each education indicator below. Based on the results of calculating the VIF value in Table 2 above, it shows that there are research variables with a VIF value greater than 10, namely the variables X8, X9, and X10. It indicates that these variables indicate multicollinearity [21], [22]. Furthermore, according to [22], the variable with a VIF value is theoretically non-significant, so it is not used in the following analysis. Thus, eliminating variables with a VIF value greater than ten results in 11 research variables for determining the grouping of provinces in Indonesia based on education indicators. The elimination of these indicators or variables consists of the net enrollment rate (NER) of the population of the lowest 40% of the Vocational High School level (X8), the number of villages that have primary school facilities (X9), and the number of towns that have junior high school facilities (X10). Dendrogram of hierarchical cluster analysis method According to Mayr et al. (1953) in [28], the dendrogram is an illustration based on a diagram of relation about the level of similarity. In this conceptual relationship, by combining two data based on the similarities that exist in the data [23]. Merging continues for those that have similarities to other data. According to [29], this merger forms a tree-like appearance, called the agglomerative method. The agglomerative method is a classification method starting from one set of stands and then combining the grouping results with other perspectives into a cluster.

Interpretation of cluster characteristics
The dendrogram results show cluster analysis results by identifying the closest distance between objects as information on grouping objects with similar characteristics-illustration of two objects with the same elements based on two points with the most relative position. The closer the two objects are, the object has the same similarity. However, suppose the object's two ends are further away. In that case, the object is more and more different given the cluster analysis dendrogram using a hierarchical method consisting of four clusters in Figure 1 below. (a) education indicators. The next stage is the interpretation of cluster characteristics using each cluster's average for each variable (centroid). The interpretation process can use a centroid cluster [30]-understanding of cluster characteristics using the Average Linkage and Ward methods. In the following, the centroid values for each variable in the first and second clusters are given below.  Table 3 above shows that the centroid value's determination is only for the two clusters because the third and fourth clusters each only consist of one object. Based on the centroid value for the first cluster to the variables X2, X4, X7, and X13, they have the highest value compared to the second cluster. It shows that for the provinces in the first grouping compared to the second grouping, it shows that the majority of people aged ≥ 15 years are still literate. Those with the lowest 40% expenditure for Senior High School level are still less participating. However, if it is viewed from the community participation perspective in continuing their studies at higher education institutions, it is greater than the people in the second grouping.
The variables X1, X3, X5, X6, and X12 each have the centroid value for the first cluster, which has the lowest value compared to the second cluster. It shows that the first grouping people lacked participation in starting their children's education in the Early Childhood Education program. However, the communities with the lowest 40% expenditure on primary and junior secondary school levels still have more intense participation when compared to communities in the second grouping. Besides, for school facilities at the tertiary level, the villages' number is greater than the villages in the second cluster. The centroid values for each variable in the first, second, and third clusters are given in below.  Table 4 above shows that three clusters only determine the centroid value, with the fourth cluster consisting of only one object. Table 4 shows that the first cluster based on variables X1, X3, and X5 each has the lowest value than the other clusters. It indicates that the community's reading frequency in the first group is higher than the two different groups. However, for the literate society, the frequency was highest compared to the other two clusters. Community participation to include Early Childhood Education is still very low compared to the community in the other two clusters. Likewise, for community participation for the most subordinate 40% expenditure groups at the primary school level.
Meanwhile, the variables X2, X4, X7, and X14 each have the highest value among the other two clusters. Another variable shows that the first group's community has higher participation in continuing their studies than the other two clusters. However, for the variable length of schooling, residents over 15 years of age have a more significant percentage of students in the long period in completing their studies.
Then, for the second cluster, the variables X6, X11, and X12 each show the lowest mean compared to the other two clusters' provincial communities. It shows that the community for the lowest 40% expenditure group at the junior high school level has less participation and participation in continuing education in tertiary institutions. Also, several school facilities at the college level are still minimal, and the frequency of people over 15 years of age in the second cluster is more literate.
Furthermore, for the third cluster, the variables X1, X3, X5, X6, X11, and X12 each have the most significant value than the other two clusters. It suggests that the frequency of literate people is higher than the other clusters and also the lack of community participation for the lowest 40% expenditure groups at the primary and junior high school levels. On the other hand, the community in this third grouping is the number of villages with a large number of higher education-level school facilities and community participation to continue their studies at tertiary institutions.
Next, the variables X2, X4, X7, X13, and X14 have the lowest average among the other two clusters. It indicates that the number of people aged 15 and over is less than the other two clusters, and the community's participation in the expenditure group for the lowest 40% at the Senior High School level. Then, in this cluster, there is still a lack of community participation to continue their studies. The people in this cluster also have a small number of average years of schooling aged 15 years and over.

Determination of the best method
The determination of the number of clusters and cluster members using the two hierarchical methods above provides information regarding the number of provincial clusters based on education indicators. The next stage is the interpretation of cluster characteristics using each cluster's average for each variable (centroid). The interpretation process can use a centroid cluster [30]-understanding of cluster characteristics using the Average Linkage and Ward methods. In the following, the centroid values for each variable in the first and second clusters are given in below.  Table 5 above shows the standard deviation values for the first and second clusters for each research variable. Then, calculating each deviation value in the cluster by finding the square root of the sum of the difference between the standard deviation value and the standard deviation mean for each research variable obtained = . and = . , respectively. The average standard deviation in the cluster for each cluster is obtained by dividing the total standard deviation value by the number of research variables obtained by ̅ = .
and ̅ = . . In contrast, each cluster's mean is obtained by dividing the sum of the average standard deviation in the cluster by the number of clusters to get ̅ = . .
Then, calculating the standard deviation value for each cluster using the Ward method is given below.  Table 6 above shows the standard deviation value for the three clusters for each research variable. The calculation of each deviation value in the cluster by finding the square root of the sum of the difference between the standard deviation values and the mean standard deviation for the eleven research variables = . , = . , and = .
. Then, the average standard deviation in the cluster for each cluster is obtained by dividing the total standard deviation value results and the number of research variables for each obtained ̅ = . , ̅ = . , and ̅ = . . Meanwhile, the average for each cluster, by dividing the mean, standard deviation in the cluster, and the number of clusters, is obtained ̅ = . . Furthermore, calculating the ratio using the Average Linkage and Ward methods is given in Based on table 7 above, the SW value is obtained by dividing the total standard deviation value results between the cluster and the number of clusters using the Average Linkage and Ward methods of 67.21 and 141.66, respectively. Then, the SB value is obtained by calculating the comparison of the sum of the squares of the difference in the mean deviations in the cluster, and the mean for each cluster with the number of clusters reduced by one is obtained for the Average Linkage method of 63.63. In contrast, for the Ward Linkage method it is obtained 10020.48.
Furthermore, by calculating the value of the SW and SB ratio for the Average Linkage method, it is obtained that it is 1.06. Meanwhile, the SW and SB ratio using the Ward Linkage method is 0.01 and smaller than the Average Linkage method. In this case, the Ward method produces a more homogeneous group so that the resulting ratio value is smaller. It means that the Ward method has better group accuracy quality than the Average Linkage method. These results indicate the same thing according to [31], which states that the Ward method is the most optimal method for similarity analysis.
This study discusses the analysis of provincial clusters based on education indicators using the Average Linkage and Ward methods. The results show that the Ward method has better classification accuracy than the Average Linkage method. However, cluster analysis using the Average Linkage method and the Ward method for Respondent Data for Unit Link Life Insurance Customers [15]. It shows the study results that the Average Linkage method has better performance than the Ward method with the respective SB and SW ratio values of 0.486. and 0.710. Subsequent research, cluster analysis using the Average Linkage and Ward methods in the case study of the Human Development Index in South Sulawesi Province by [14] using the Dunn index. It obtained grouping using the Average Linkage method resulting in the best Dunn index of 0.55 compared to the Ward method of 0.43.
It shows that determining the best method depends on the use of the research variable indicators and the procedures or stages in the study. Research by [15] has research stages consisting of standardizing data, selecting distance measurements, and implementing the hierarchical method's steps. Meanwhile, the research by [14] has sets consisting of data standardization, determining the size of the similarity or dissimilarity between data, the clustering process with the distance matrix determining the number of clusters and their members, looking at the characteristics of the cluster results formed. In this research, the research stages were carried out by data standardization, multicollinearity testing, hierarchical method cluster analysis dendrogram, interpretation of cluster characteristics, and determination of the best method. However, the results showed that the Ward method has better classification accuracy than the Average Linkage method.

CONCLUSIONS AND SUGGESTIONS
Analysis of the grouping of provinces in Indonesia based on educational indicators uses the Average Linkage and Ward methods. The reduction in the number of variables in this study based on the VIF value consisted of three research variables. They were eliminated for further analysis in the grouping of provinces. Then, the SW and SB ratios' acquisition uses the Ward Linkage method of 0.01, which is smaller than the Average Linkage method of 1.05. It shows that the Ward method's grouping analysis produces a more homogeneous group with a smaller ratio value. Thus, the Ward method has better group accuracy quality than the Average Linkage method. Meanwhile, suggestions in research in determining the best agglomerative method depend on the use of research variable indicators and procedures or stages in the study. In this research, the research stages were carried out by data standardization, multicollinearity testing, hierarchical method cluster analysis dendrogram, interpretation of cluster characteristics, and determination of the best method.