A Bonferroni Mean Based Fuzzy K-Nearest Centroid Neighbor Classifier

K-nearest neighbor (KNN) is an effective nonparametric classifier that determines the neighbors of a point based only on distance proximity. The classification performance of KNN is disadvantaged by the presence of outliers in small sample size datasets and its performance deteriorates on datasets with class imbalance. We propose a local Bonferroni Mean based Fuzzy K-Nearest Centroid Neighbor (BM-FKNCN) classifier that determines the class label of an unclassified sample dependent on the nearest local mean vector obtained using the Nearest Centroid Neighborhood (NCN) concept to better represent the underlying statistic of the dataset. The proposed classifier is robust towards outliers because the NCN concept also considers spatial distribution and geometrical placement of the neighbors. Also, the proposed classifier can overcome class domination of its neighbors in datasets with class imbalance because it averages all the centroid vectors from each class to adequately interpret the distribution of the classes. The BM-FKNCN classifier is tested on datasets that are taken from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and benchmarked with classification results from the KNN, Fuzzy-KNN (FKNN), BM-FKNN and FKNCN classifiers. The experimental results show that the BM-FKNCN achieves the highest overall average classification accuracy of 89.86% compared to the other four classifiers.


Introduction
Classification is the process of assigning a collection of things into categories or classes. The KNN algorithm that was first proposed in [1] is a nonparametric classification method which in data mining is recognized as a top 10 algorithm [2]. It has been used in many fields such as pattern recognition [3][4][5], regression [5][6], feature selection [7], outlier detection [8][9][10] among other data mining areas. KNN decides the class label of an unclassified sample by obtaining its k-nearest neighbors based on its distance proximity to training samples and the unclassified sample is given the class label that is the majority between its neighbors.
The KNN algorithm has a few drawbacks that need to be addressed. The first matter is its concept of neighborhood which is based only on distance proximity. In some practical cases, the geometrical placement of points to an unclassified sample may become a more important determiner than the distance proximity of points to an unclassified sample in forming the neighborhood in order to classify the unclassified sample based on its neighborhood [11]. The NCN concept that was introduced by Chaudhuri [12] forms the neighborhood of an unclassified sample by using both the closeness of a point and a centroid criterion as a determiner. The centroid criterion is used to accommodate the importance of the geometrical placement of points to an unclassified sample. An extension of the NCN concept called the KNCN algorithm was proposed by Sánchez et al. [11] and was proven to surpass the classification performance of KNN. Even though there is no theoretical justification to form the neighborhood of an unclassified sample by issue 1, February 2021 considering both the distance proximity and the geometrical placement between points to an unclassified sample, this may improve classification accuracy in special situations, such as finite samples, where the already categorized samples do not fully represent the underlying statistics of the dataset [11]. Other selection methods of the nearest neighbors were proposed where the neighborhood information of the training instances are taken into consideration. Liu et al. [13] developed the Mutual K-Nearest Neighbor (MKNN) algorithm where a training instance is selected as a k-nearest neighbor only if the unclassified sample is also a k-nearest neighbor of the training instance.
The second matter is that in the classic KNN algorithm, samples that have already been classified are considered equally important in the process of classifying a new sample [14]. When the neighborhood of an unclassified sample has more than one neighbor, a tie of classes between the neighbors may occur. In the occurrence of this possibility, the assignment is arbitrary in the classification process [15]. In cases where the class distribution is imbalanced, this arbitrary assignment can deteriorate the classification performance [16]. To solve this, Keller et al. [15] proposed the FKNN and introduced the theory of fuzzy sets into the KNN algorithm and developed a fuzzy version of KNN. The FKNN technique executes the class label assignment process by giving the unclassified sample a degree of membership in each class [17]. This was further developed by Sarkar [18] in which the class assignment process was correlated with the uncertainty caused by overlapping classes and insufficient features. This is a sensitive issue and the extent of it is often worsened by the simple majority vote among the k-nearest neighbors. In general, each sample should not be considered equally important in the classification process [19]. Dudani [20] developed the Distance-Weighted K-Nearest Neighbor (WKNN) algorithm. If a point in the group of the k-nearest neighbors is closer to the unclassified sample, then it is given a larger weight. A lot of researches have focused on methods to calculate the weight in the voting procedure of the class assignment such as in [21][22][23]. The advantages of the KNCN and FKNN algorithm was used by Rosdi et al. [24] to develop the FKNCN algorithm. The proposed classifier used the NCN concept to determine the nearest neighbors and used the theory of fuzzy sets to assign membership degrees for each class.
The third matter is the classification performance of the KNN algorithm is inadequately affected in the occurrence of outliers, especially in datasets with small training sample size [25]. To remedy this, Mitani and Hamamoto [26] explored classifying unclassified samples dependent on local mean vectors and proposed a local mean based KNN algorithm (LMKNN) [27]. Local mean vectors for each class are calculated from the k-nearest neighbors and an unclassified sample is grouped into the class with the closest mean vector. By using local mean vectors in the assignment process, KNN becomes more robust to outliers during classification [27]. Zeng et al. [28] proposed a Pseudo Nearest Neighbor (PNN) classifier where the distance from its k-nearest neighbors to the unclassified sample is weighted. The weighted distances between the unclassified sample and the k-nearest neighbors from each class are summed thus resulting in the pseudo nearest neighbors. The class label of the closest pseudo nearest neighbor to the unclassified sample is assigned to the unclassified sample. Kumbure et al. [17] used the Bonferroni mean aggregator to calculate the local mean vector due to it having parameters that can be changed depending on the presented problem to develop the BM-FKNN.
The fourth matter is its majority vote principle when applied to datasets with class imbalance. When classifying an unclassified sample, the class with a larger number of training samples has a greater chance of the unclassified sample being assigned to it because among the k-nearest neighbors, its samples are abundant [29]. The use of local mean vectors can help to overcome this problem. Assigning the class of an unclassified sample based on the closest local mean vector in favor of the original majority vote principle, will lessen the domination of the majority class over the minority class [17].
The Bonferroni mean is a multi-criteria aggregation operator that was proposed by Italian mathematician, Carlo Bonferroni, in 1950. The use of the arithmetic mean can be nonoptimal with classifiers. Instead, the use of the Bonferroni mean is a solution to this due to it having parameters that can be changed depending on the presented problem. The Bonferroni mean has parameters thus initiating the possibility of changing these parameters in order to find an optimal solution. The use of the Bonferroni mean and the flexibility of changing its parameter to calculate the local mean vectors can enhance the classification performance. The Bonferroni mean can also be used as other well-known means such as the common arithmetic mean, the geometric mean and the power mean by setting the appropriate parameter values. Although there exist researches that have used the NCN concept or a local Bonferroni mean vector separately to find the k-nearest neighbors that better represent the underlying statistics of the dataset, none have used both concepts together to increase the classification accuracy by obtaining an even a more representative k-nearest neighbors of the unclassified sample.
In this paper, we propose a local Bonferroni Mean based Fuzzy K-Nearest Centroid Neighbor (BM-FKNCN) classifier that determines the class label of an unclassified sample dependent on the nearest local mean vector obtained using the NCN concept to better represent the underlying statistic of the dataset. The proposed classifier is robust towards outliers because the NCN concept also considers spatial distribution and geometrical placement of the neighbors. Also, the proposed classifier can overcome class domination of its neighbors in datasets with class imbalance because it averages all the centroid vectors from each class to adequately interpret the distribution of the classes. The proposed method is effective and significantly improves the performance of KNN in the presence of outliers especially in small sample size datasets and also for imbalanced learning. This paper provides a comparison of classification accuracy to the classic KNN algorithm and other KNN based algorithms which are the FKNN, BM-FKNN and the FKNCN algorithms.

Methods
In this study, the classification process of the BM-FKNCN is based on four main stages, namely 1) preprocessing, 2) finding the nearest neighbors using the NCN concept, 3) calculating the local mean vectors from the set of k-nearest centroid neighbors for each class using the Bonferroni mean and 4) assigning fuzzy member- ship based on the local Bonferroni mean vectors. The unclassified sample is grouped into the class that has the highest degree of membership. The flowchart for the BM-FKNCN algorithm is shown in figure 1.

Datasets
We selected datasets that can assist the performance analysis of our proposed BM-FKNCN classifier in the presence of outliers in small sample size datasets and on datasets with class imbalance. In this study, we used eight datasets taken from the KEEL repository [30]. To evaluate the performance on imbalanced datasets, we selected three datasets that represent an imbalance of class for the training instances which are the E.coli, Glass and Appendicitis datasets. Also, to validate that the proposed method overcomes the presence of outliers in small sample size datasets, we selected datasets with a low range of training instances (100-1000 instances). The Glass and Appendicitis datasets are a representation of both imbalanced and small sample size datasets. We also selected datasets with various numbers of classes. Among the eight datasets, there are five with two-classes and the others are datasets with more than two classes. The characteristics of the datasets that were selected are shown in table 1.

Parameters
There are four input parameters for the BM-FKNCN, which is the number of nearest neighbors, for calculation of the fuzzy membership degree, and also and which are parameters for the Bonferroni mean. For the value of in assigning the fuzzy membership degree, we used = 2 as recommended in [17]. We searched for the optimal value of each dataset by ranging the value of between [2,10]. We ranged between [2,10] due to the fact that when the value of > 10, the computational complexity is high but does not make a significant difference in investigating the performance of the classifiers compared to 1 < ≤ 10. In the process of finding the optimal value, we used the value of = = 1 for the Bonferroni mean parameters. This means that for each dataset, to find the optimal value we carried out 10 simulations. After the optimal value was obtained, we used this value of to find the optimal combination of and . The optimal parameters obtained are shown in table 2.

K-Nearest Centroid Neighbor
The difference in concept between KNN and KNCN is that the KNCN method considers both the closeness in distance and geometrical placements of the nearest neighbors to the unclassified sample [31]. Let = { 1 , … , } be the set of training instances and be an unclassified sample. The Euclidean distance metrics is used to calculate the distance from a point in the training set to where is the dimension of the dataset using equation (1).
The first step is to find the first neighbor ( 1 ) of which is the closest point in the training set to . In the case of a tie between more than one point, choose arbitrarily. The next step is finding the next -th neighbor ( ) where 2 ≤ ≤ . The selected is the point where the centroid of and the previously selected neighbors ( 1 , … , ) is closest to . The centroid of a given set of points = ( 1 , … , ) can be calculated using equation (2).
The difference in concept between KNN and KNCN is shown in figure 2 and the process of finding the -th nearest neighbor is shown in figure 3.

Local Bonferroni mean vector
The Bonferroni mean is a multi-criteria aggregation function used to present information [32]. It is used in multi-criteria problems. It has parameters that can be set dependent of the problem. Let = { 1 , … , }, ∈ [0,1] be a vector with at least one ≠ 0 and , ≥ 0 be parameters then the Bonferroni mean of can be calculated using equation (3). After obtaining the k-nearest centroid neighbors, split the set of k-nearest centroid neighbors and group them into the class they belong to. The calculation of the local mean vector of each class is carried out based on the grouped k-nearest centroid neighbors using the Bonferroni mean shown in equation (3).

Fuzzy membership
The purpose of this method is to calculate a membership value of an unclassified sample to all the classes that are available. The unclassified sample has a degree of association to each class. The membership degree to each of the class provides a level of confidence to accompany the resultant classification. By using membership degrees to each class, the assignment is never arbitrary [14]. Assign the membership degree ( ) to class by using equation (4).
. Where is the number of classes, is the local Bonferroni mean vector for class and is 1 for known class and 0 for other classes. The class with the highest membership degree is the predicted class. In the rare case where there occurs a tie in membership degrees of classes, in this paper we select the first class.

Results and Analysis
In this section, we will discuss the experiment results and analysis of the proposed method. First, we describe the datasets and parameters that we used. Then, we explain the results and analysis of the experiment.
We compare the performance of BM-FKNCN with KNN and other state-of-the-art KNN algorithms such as FKNN, FKNCN, and BM-FKNN. We used the same training and testing data for all classifiers and used the same optimal input parameters for all classifiers obtained from the process beforehand. We compared the average classification accuracy from the 10-folds crossvalidation for each of the datasets that were used. Then we averaged the classification accuracy of all the datasets to obtain an overall performance of each classifier. A classifier that achieves a higher classification accuracy is considered to have a better performance. The comparison of classification accuracy between the classifiers are shown in table 3.
The experiment results show that the proposed method is effective in successfully assigning the class label of the unclassified samples in the presence of outliers in the Vehicle, Ionosphere, Glass, Wine, Iris and Appendictis datasets. On the eight small sample size datasets used in the experiment, the BM-FKNCN achieved the highest classification accuracy in six of the datasets. In the case of the other two datasets where the BM-FKNCN did not achieve the highest classification accuracy, in the E.coli dataset it was only exceeded by the FKNCN and FKNN methods by a margin of 0.2941% and in the Mammogram dataset it was only exceeded by the BM-FKNN method by a margin of 0.209407% which are insignificant margins. The BM-FKNCN also obtained the highest overall average classification accuracy by a margin of 3.2% which is quite a significant margin. This shows that the BM-FKNCN is more robust towards outliers in most datasets. This is because both the distance proximity to the unclassified sample and the geometrical placement of the nearest neighbors were taken into consideration in the NCN concept and also the class labeling process was done based on the nearest local mean vectors obtained with the centroid criterion.
The experiment results also show that the proposed method can overcome class domination of the nearest neighbors in datasets with class imbalance in the Glass and Appendictis dataset. On the three datasets with class imbalance, the BM-FKNCN achieved the highest classification accuracy in two of the datasets. In the E.coli dataset, the BM-FKNCN was exceeded by the FKNCN and FKNN methods by a margin of 0.2941% which is an insignificant margin. The proposed method is able to better relieve class domination in a datasest because the proposed method averages all the centroid vectors from each class to adequately interpret the distribution of the classes.

Conclusion
This research proposes a classifier based on the KNN algorithm, named the BM-FKNCN classifier, that assigns the class label of an unclassified sample dependent on the nearest local mean vector obtained using the NCN concept to better represent the underlying statistic of the dataset. The experimental result shows that the proposed BM-FKNCN method achieved the highest overall average classification accuracy of 89.8653% from the eight datasets used. The overall average classification accuracy of the proposed method surpassed previous methods by a margin of 3.2%. This shows that the proposed method can better overcome the problem of outliers and class imbalance in most datasets and achieve a better overall accuracy. In cases where the accuarcy of the proposed method is surpassed by other state-of-the-art KNN algorithms, it is only surpassed by an insignificant margin. The BM-FKNCN can be further developed by finding a different method of calculating the mean vector for each class other than using the Bonferroni aggregation operator in an attempt to achieve better classification results. In future works, the computational complexity of the proposed method can be investigated and analyzed.