A COMPARISON OF CLUSTERING BY IMPUTATION AND SPECIAL CLUSTERING ALGORITHMS ON THE REAL INCOMPLETE DATA

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.


Introduction
In the big data, the value of all objects observed is often not obtained completely. It is frequently caused by missing values. Missing values are the lacking information on an object in some of the indicator of measure [1]. Missing values were divided into three categories, namely missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) [2]. MCAR happens when it has no relation among missing values in the same variable or the different variable. In the second type of missing value, if MAR happen so there is a probability where the missing values are depended by available values but there is no relationship among missing values. The last category, NMAR happens when there is not any information about existence of missing values and its cause.
Missing values will really inhibit some analyses that will be used on the dataset, one of them is clustering analysis. To overcome it, some of scientists have found several solutions, namely marginalisation, imputation, and special clustering algorithms. Marginalisation just delete objects or variables that contain missing values [3]. Consequently, the data size significantly decrease if the missing values spread in many objects or variables. The second solution is imputation that is a method to complete missing values with certain values. Finally, the special clustering algorithms are certain clustering algorithms that are applied for handling missing values in the dataset.
On the imputation method, clustering analysis is carried out after the dataset becomes the complete data. Imputation can be executed by filling missing values with their respective variables means, zero values, or other values that are obtained from imputation algorithms. There are two categories for the process of the imputation algorithms that are deterministic and stochastic. The result of the deterministic process in the imputation is consistency different from the stochastic process. Some of the deterministic imputation algorithms that have been found are distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation-maximization free multiple imputation (EMSVD), and biplot imputation (BI). Those algorithms have been researched in simulation study. Apart from their capability, imputation values will certainly cause bias because they are a guess of a number that is not exact. To know the quality of imputation values, Ananda et.al. have proposed the method to measure the goodness-of-fit of imputation data [4].
For the special clustering algorithms, some of scientists have been proposed them to process clus-tering in the incomplete dataset. Four algorithms of modified fuzzy c-means (FCM) have been proposed by Hathaway and Bezdek that are whole-data strategy (WDS), partial distance strategy (PDS), optimal completion strategy (OCS), and nearest prototype strategy (NPS) [5]. K-means soft constraints (KSC) has been proposed by Wagstaff [6]. Distance estimation strategy fuzzy c-means (DESFCM) has been proposed by Himmelspach and Conrad [7]. Recently, k-means soft constraints imputed-observed (KSC-IO) have been proposed by Mesquita et.al. [8].
To know quality of clustering needs clustering validity that are categorized into internal clustering validity and external clustering validity. External clustering validity uses the reference clustering to evaluate the clusters obtained. Whereas internal clustering validity only measures quality of clustering obtained that is based on dissimilar measure among objects because there is no reference clustering [9]. In generally, dissimilar measure used on the internal clustering validity is Euclidean distance [10]. The previous researches mostly used the simulation data that contained the reference clustering so the external clustering validity was used in this case. The problem that has not been addressed in previous works is to measure the quality of clustering in the real incomplete dataset that has no the reference clustering. Therefore, this paper will be showed the use of the internal clustering validity in the real incomplete data where in previously it have been tested its capability based on the simulation data. Datasets used in this paper are the 2018 environmental performance index (EPI) and the simulation datasets that are iris, wine, and seeds dataset. Finally, the results of the clustering on the real incomplete data and the simulation data were compared to obtain the optimal clustering. This paper is arranged as follows. Section II describes material and method used in this research. Section III describes results and discussion. Conclusions and suggestions are on the last section.

Datasets
The datasets used in this paper are the 2018 EPI data and the simulation data. The 2018 EPI data is a project led by Yale University, Columbia University, Samuel Family Foundation, McCall MacBain Foundation, and the World Economic Forum. The data ranks performance of countries where it is based on high-priority environmental issues in two areas, protection of human health (HLT) and protection of

Issues Symbol Description
HLT HAD Measures the actual outcomes from exposure to indoor air pollution from household use of solid fuels.

PME
Measures the average annual concentration of PM 2.5 to which the typical citizen of each country is exposed.

PMW
Measures the weighted percentage of a countrys population exposed to annual concentrations of PM 2.5

UWD
Measures the actual outcomes from lack of access or use of improved sources of drinking water.

USD
Measures the actual outcomes from lack of access or use of improved sanitation facilities.

PBD
Measures the actual outcomes from lead exposure EC0 MPA Measures the percent of a countrys Economic Exclusion Zone (EEZ) set aside as a marine protected area (MPA).

TBN
Measures the percent of a countrys biomes in terrestrial protected areas (TPAs), weighted by the prevalence of different biome types within that country.

TBG
Measures the percent of a countrys biomes in terrestrial protected areas (TPAs) weighted by the prevalence of different biome types around the world.

SPI
Measures the average area of species distributions in a country under protection, weighted by a country's stewardship for each species.

PAR
Measures the extent to which a countrys protected areas are ecologically representative.

SHI
Measures the average loss in suitable habitat for species in a country, weighted by the countrys stewardship for that species.

TCL
Measures the five-year moving average of percent of forested land lost. Forested land is defined as having ≥ 30% canopy cover.

FSS
Measures the percentage of a countrys total catch that come from taxa that are classified as either over-exploited or collapsed.

MTR
Measures the trends in the Regional Marine Trophic Indices of a country, or mean trophic level of the fish catch in each region of the Economic Exclusion Zones. ecosystems (ECO). The 2018 EPI dataset is quantitative data with value in 0 until 100. The data is represented in matrix dataset with ordo 180×24. The data has 237(5.49%) missing value on 89(49.44%) objects and 7(29.17%) variables. Type of missing values in the 2018 EPI data is NMAR because there is not any information about existence of them and its cause. Table 1 shows the list of variables on the 2018 EPI dataset, whereas the list of the objects is showed in Table 2.
68 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information) volume 13, issue 2, June 2020 The simulation dataset used in this paper are Iris, Wine, and Seeds dataset. The Iris dataset is commonly known and consists of 150 objects of Iris plants that are divided into three classes and measured on four variables. The Wine dataset consists of 178 objects that are also divided into three classes and measured on thirteen variables. And also, the Seeds dataset consists of 210 objects that are divided into three classes and measured on seven variables. The incomplete datasets were created from those datasets. The ratio of missing value are 5, 6, 7, 8, 9, and 10% from all data. The missing value on Iris, Wine, and Seeds dataset were randomly spread in 1(25.00%), 4(30.77%), and 2(28.57%) variables respectively where the spreads were matched with the spread of missing values in the 2018 EPI data.

Imputation algorithms
Imputation algorithms used in this paper were distribution free multiple imputation (DFMI) [11], Gabriel eigen (GE) [12], expectation-maximization singular value decomposition (EMSVD) [13], and biplot imputation (BI) [14]. Those algorithms are based on singular value decomposition (SVD) and using multiple regression model. Imputation data obtained is computed the proximity matrix and the covariance matrix to measure the goodness-of-fit of them. Suppose that X is the proximity matrix of the imputation data and Y is the proximity matrix of the initial data. The goodness-of-fit of proximity matrix is obtained by using Equation 1.
Where r and σ ii (i = 1, 2, · · · , r) is rank and singular value respectively from X T Y T atau Y T X T .
R. Ananda et.al, A Comparison of Clustering by Imputation and Special Clustering Algorithms 69 X T is X matrix after the translation-normalization procedure. The measure of GoF p (X, Y) belong to the interval of [0, 1], if GoF p (X, Y) ≈ 1 so it means that has a good approximation to represent the dissimilarity measures among objects in the initial data. Conversely, if GoF p (X, Y) ≈ 0 so it means that has a bad approximation [4]. Average of the goodnessof-fit of proximity matrix and covariance matrix is decided to be the goodness-of-fit of imputation data in this paper. These that will be processed in clustering analysis are the results of imputation algorithms that the goodness-of-fit of imputation data have more than 0.900. Clustering algorithms used in the imputation data is k-means algorithm.

K-means clustering algorithm
Clustering algorithms are algorithms that are utilized to put objects into a group based on the similarity measure. Objects in the same group have the high resemblance different from objects in the different group. One of the clustering algorithms that is popular enough is k-means algorithms. K-means is an algorithm that assigns each objects to the cluster having the nearest prototype. The newly researches by using k-means had been done in some areas such as the recommendation system in the selection of specialization course [15], the measure of mangrove areas [16], analysis of education quality in senior high school [17], and mapping the quality of education based on the results of the 2019 national exam in Banyumas Regency [18]. The process of k-means is composed in three steps. The first, partition the objects into k initial clusters arbitrarily or by using certain analysis. The second, assigning an object is to the cluster whose prototype is nearest. Then recalculate the prototype for the cluster receiving the new item and for the cluster losing the item. Finally, repeat the second step until no more reassignments take place.

Special clustering algorithms for incomplete data
In this paper, we present the seven clustering algorithms for incomplete data. Let any data matrix n X p with n objects and p variables has missing values in some variables. Suppose that [c 1 , c 2 , c 3 , · · · , c k ] are prototype of the obtained clusters.
2.4.1. Whole-data strategy. Whole-data strategy (WDS) classifies objects that have complete data by using FCM algorithm. Then, objects with missing value are classified into certain cluster and based on the nearest prototype. To know the nearest prototype for ith object that contains missing value, we use Equation 2.
where x i is data of ith object, c j is jth prototype, x ik is the value of the kth variable on ith object, and c jk is the value of the kth variable on the jth prototype. w ijk is weight that be 0 if x ik is missing or 1 otherwise.
2.4.2. Partial distance strategy. Partial distance strategy (PDS) classifies objects by using FCM algorithm where there is modification in prototype computation. Prototype is computed by using Equation 3.
c kj is the value of the kth prototype in the jth variable. u ik is the value of membership of ith object in kth prototype. x ij is the value of the ith object in jth variables. w ij is weight that be 0 if x ij is missing or 1 otherwise.

Optimal completion strategy.
Optimal completion strategy (OCS) estimates missing values and classifies object into certain cluster simultaneously by optimizing its objective function. Basically, OCS algorithm adopts FCM algorithm on the its process. Imputation process in the OCS algorithm is performed after prototype computation by using Equation 4.
where x * ij is missing value on the ith object in the jth variable. u il is the value of membership of ith object on the lth prototype. c lj is the lth prototype in the jth variables.

Nearest prototype strategy.
Nearest prototype strategy (NPS) is similar to the OCS in all steps. Every missing value on the ith object in the jth variable, x * ij is substituted with respective values of the nearest prototype. To know the nearest prototype, we use Equation 2.
2.4.5. K-means soft constraints. K-means soft constraints (KSC) is obtained by the idea where is to define the soft constraints on variables with missing values and to use there as additional information. Suppose that n X p is incomplete data, the variables of the data is divided into the dataset of completely 70 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information) volume 13, issue 2, June 2020 observed variables that is n X q and the dataset of variables with missing values that is n X q where p = q + r. A soft constraint s ij between x i and x j in n X q is computet by using Equation 5 if i, j ∈ {1, 2, · · · , n}, ∀k ∈ {1, 2, · · · , r}, x ik ∪ x jk = ∅.
and s ij is zero for otherwise. Henceforth, objects in the n X q dataset are classified by using k-means algorithms where the distance between the ith object and the kth prototype is computed by using Equation 6. Where w ∈ [0, 1], δ ij is binary variables that be 1 if x i and x j are assigned to the same cluster and 0 otherwise.
2.4.6. K-means soft constraints imputedobserved. K-means soft constraints imputedobserved (KSC-IO) was presented by Mesquita et.al. to use information from partially complete objects in n X r dataset in KSC algorithm [12]. The method developed the KSC algorithm by adding new soft constraints on partially complete objects that were ignored and also on imputed values. For n X r dataset we obtain n X * r = (x * ik ) that is imputation data. A soft constraint s * ij between x i and x j in n X r is computed by using Equation 7, if i, j ∈ {1, 2, · · · , n}, ∃k ∈ {1, 2, · · · r}, x ik ∪ x ik = ∅.
and s * ij is zero for otherwise. Henceforth, objects in the n X q dataset are classified by using k-means algorithms, where the distance between the ith object and the kth prototype is computed by using Equation 8. d Where w 1 + w 2 + w 3 = 1, ∀i, w i ∈ [0, 1] and δ ij is binary variables that be 1 if x i and x j are assigned to the same cluster and 0 otherwise.

2.4.7.
Distance estimation strategy fuzzy cmeans. Distance estimation strategy fuzzy c-means (DESFCM) was development from FCM by using another variant of the FCM as basis. The variant is based on the membership degrees of data items to the certain prototype. In the first step, n X p is divided into the dataset with completely observed object that is n1 X p = (x ij ) and the dataset with objects contained missing values that is n2 X p . The initial membership matrix that is n1 U k = (u ij ) is initialised randomly. The second step, the cluster prototypes are calculate by using Equation 9.
where c kj is the kth prototype on the jth variable, u ik is the value of membership degree of ith object that is from n1 X p on the kth prototype, andx ij is the ith object that is from n1 X p on the jth variable. Then n D k = (d ik ) dataset where d ik is distance between the ith object and the kth prototype are calculated by using Equation 10.
for all i and j. If l=1 u lk where x ij is the ith object on jth variable of dataset, c kj is the kth prototype on the jth variable, u lk is the value of membership degree of lth object that is from n1 X p on the kth prototype, andx ij is the lth object on the jth variable in n1 X p . Then new membership matrix n1 U k is calculated by using Equation 11.
The process is iterated to the second step until the residual sum of squares (RSS) between n D

Clustering validity
Missing values in the 2018 EPI dataset is not ignored because they are categorized into MNAR [2]. Because of that, marginalization is not utilized in this paper. Then, clustering validity used in the data must be using internal clustering validity because there is no reference clustering. Internal clustering validity used is Silhouette index because the validity is better than the other validity [19] [20]. Silhouette index is obtained from the average of Silhouette value each objects that given by using Equation 12.
where a(i) is average of the distance between ith object and other objects in the same cluster. Whereas, b(i) is average of the distance between ith object and other objects that is in nearest cluster. s(i) is value of the Silhouette index that belong to the interval of [−1, 1]. If s(i) ≈ 1, it means that ith object is well matched to its cluster. Conversely, if s(i) ≈ −1, it means that ith object is not well matched to its cluster [21]. Peladeau et.al. mentioned that process of clustering must be shaped by using dissimilarity measure from the initial data [22]. That statement reinforce motivation to use the dissimilarity measure among objects as the foundation in the internal clustering validity. However, the existence of missing value in data will certainly inhibit computation of dissimilarity measure among objects. The problem is solved if computation of dissimilarity measure uses weighted Euclidean distance that is proposed by Gower [23] and formulated by Equation 13.
where d (x i , x j ) is weighted Euclidean distance between ith object and jth object. p is the total number of variables in data. x is is the value of the ith object on sth variable. w ijs is weight that be 0 if x is or x js are missing and 1 otherwise. The computation indirectly uses marginalization method where for all pair of objects that will be computed weight Euclidean distance, it will exactly remove certain variables that at least, one of objects pair has missing value. As a results of that, the distance of every objects pair is not precise if many variables that are removed because of missing value. Exactly, it will also decrease capability of the internal clustering validity because of the dissimilarity measure as its foundation. Therefore, it is very needed method to know capability of the internal clustering validity.
In this paper, we use the simulation data that has reference clustering to measure capability of Silhouette index obtained by using weighted Euclidean distance. We validate the results of clustering by using external clustering validity and Silhouette index and then compute their correlation. If the correlation approximate to 1, it means that silhouette index has the good suitability to external clustering validity so it is a reasonably faithful validity to be used. If the correlation approximate to −1, it means that Silhouette index has the converse suitability to external clustering validity where if the external clustering validity shows the highest value for the good clustering so Silhouette index will certainly show the lowest value for it. The second state may be a reasonably faithful validity but be careful to interpret the value of Silhouette index. The unpleasant condition is occurred if the approximate correlation is about 0 where it show that there is not any relationship between Silhouette index and the external clustering validity. Consequently, Silhouette index can not be said a reasonably faithful validity. In this condition, Silhouette index is not suggested. The formula of the correlation uses Equation 14.
where x and y are vector of Silhouette index and the external clustering validity results respectively where each of element in those vector is validity measure of the results of clustering from each algorithms used in the same order. x i is ith element of vector of Silhouette index results.x is the average of the element of vector of Silhouette index results. y i is ith element of vector of the external clustering validity results.ȳ is the average of the element of vector of the external clustering validity results. m is the total number of the element of the vector.

Research flow
The research steps used in this paper consists of several stages. Firstly, the clustering results by using the special clustering algorithms and the imputation data by using imputation algorithms will be computed from the 2018 EPI dataset simultaneously. Secondly, it is needed to see the goodness-of-fit of imputation data to know the faithful data imputation to clustering process by using k-means algorithm. Next, the results of clustering are validated by using Silhouette index where in previously Silhouette index has been examined its capability. Then the optimal clustering is determined based on the value of Silhouette index. In the last stage, the result of the optimal clustering is interpreted and given

The missing value distribution on the 2018 EPI dataset
We have known that the 2018 EPI dataset has 237(5.49%) missing value on 89(49.44%) objects and 7(29.17%) variables. Table 3 shows distribution of missing values in objects and Table 4 shows distribution of missing value in variables. From Table  3 we know that 2 countries have missing values in 5(20.83%) variables from 24 measured variables. we also know that 3 missing values is in the most countries and the least missing values is in 25 countries. Furthermore, Table 4 show that variables that have quite a lot of missing value are DPT, MTR, FSS, MPA, and TCL i.e. more than or equal 30 missing values.

Capability of Silhouette index
This paper uses the correlation between the internal clustering validity and the external clustering validity to measure capability of the internal clustering validity. The internal clustering validity used is  Silhouette index and the external clustering validity used are Rand index, Jaccard index, F-measure, and Purity. Figure 2 shows graph of average of the correlation of Silhouette index with the external clustering validity used in the simulation data. From the Figure, we know that the correlation of Silhouette index in dataset used for all ratio of missing value is more than 0.600. Moreover, average of the correlation of Silhouette index is generally about 0.836. It means that Silhouette index has the good suitability to external clustering validity in those simulation dataset so it is a reasonably faithful validity to be used in the incomplete dataset that has the same characteristic with those simulation dataset. Because the spread of missing values in those simulation dataset were matched with the spread of missing values in the 2018 EPI data so Silhouette index is also reasonably used in the 2018 EPI dataset.

The goodness-of-fit of imputation data
Imputation algorithms used in this paper are DFMI, GE, EMSVD, and BI in the 2018 EPI dataset. Suppose that the goodness-of-fit of imputation data is symbolized as GoF p , the goodness-of-fit of proximity matrix as D, and the goodness-of-fit of covariance matrix as Σ, then Table 5 shows the goodness-of-fit of imputation data for imputation algorithms used. From the table, we know that the goodness-of-fit of every imputation data is more than 0.900. It means that all of imputation data obtained are used in the process of clustering by using kmeans algorithm.

Comparison of the clustering results
The clustering results from each of clustering algorithms used are compared to know the optimal clustering results. In the simulation data, we use the external validity cluster and Silhouette index to validate the clustering obtained. Whereas in the 2018 EPI dataset, we only use Silhouette index because there is no reference clustering in the dataset. Figure  3 shows average of validation of the clustering results each algorithms in Iris, Wine, and Seeds dataset respectively. In those figures, we know that WDS has the lowest quality of clustering results in generally, whereas the most of clustering results have quality that are quite similar. Furthermore in the 2018 EPI dataset, Table 6 shows average of the clustering quality obtained from repetition 100 times for each algorithms. The table shows that the optimal clustering results is obtained k-means algorithms from imputation data by using biplot imputation (BI). Whereas the lowest clustering quality is obtained by WDS where it is like the results on the simulation dataset. In the simulation data, we also know that quality of k-means from imputation data by using BI has good quality.   Table 7 shows average of the time complexity that is also obtained from repetition 100 times. From the table, we know that DESFCM has the highest time complexity, then DFMI is in the second order. Whereas the low time complexity is obtained by EMSVD, as well as BI where their values are quite similar.
Based on the clustering result obtained, if we choose the results of k-means with BI algorithms because of the optimal clustering result in the 2018 EPI dataset and the low time complexity, so we will obtain the visualization of the clustering results in Figure 4. Futhermore, there are three clusters obtained that are the first cluster consists of 58(32.2%) countries, 67(37.2%) in the second cluster, and 55(30.6%) countries in the third cluster.

Conclusion
In this paper, we have compared the clustering result by using imputation data and special clustering algorithms on the incomplete dataset, the 2018 EPI dataset and the simulation dataset. The result show that Silhouette index has good ability to validate the clustering results in the real incomplete dataset based on its correlation with the external clustering validity in the simulation dataset. The optimal clustering in the simulation dataset is quite similar on the most clustering results. The optimal clustering in the 2018 EPI dataset is obtained by k-means with BI algorithm and the time complexity of k-means with BI is quite small. Based on the results, k-means with BI algorithm is suggested processing clustering analysis in the 2018 EPI dataset.