Facial Expression Recognition using Residual Convnet with Image Augmentations

During the COVID-19 pandemic, many ofﬂine activities are turned into online activities via video meetings to prevent the spread of the COVID-19 virus. In the online video meeting, some micro-interactions are missing when compared to direct social interactions. The use of machines to assist facial expression recognition in online video meetings is expected to increase understanding of the interactions among users. Many studies have shown that CNN-based neural networks are quite effective and accurate in image classiﬁcation. In this study, some open facial expression datasets were used to train CNN-based neural networks with a total number of training data of 342,497 images. This study gets the best results using ResNet-50 architecture with Mish activation function and Accuracy Booster Plus block. This architecture is trained using the Ranger and Gradient Centralization optimization method for 60000 steps with a batch size of 256. The best results from the training result in accuracy of AffectNet validation data of 0.5972, FERPlus validation data of 0.8636, FERPlus test data of 0.8488, and RAF-DB test data of 0.8879. From this study, the proposed method outperformed plain ResNet in all test scenarios without transfer learning, and there is a potential for better performance with the pre-training model. The code is available at https: // github . com / yusufrahadika / facial-expressions-essay.


Introduction
Human behavior recognition is one of the growing research topics in computer vision and pattern recognition. Human behavior recognition is usually applied in machine learning to monitoring human activities and getting insight from them [1]. The behavioral examination can help solve many problems in indoor as well as outdoor surveillance systems. The numbers of video surveillance systems have been increasing every day to monitor, track and analyze the behaviors in different areas [2]. There are several applications of human behavior detection. Some of them are motion detection and facial expression recognition. Facial expression analysis is one of the most prominent clues to determine the behavior of an individual. However, it is very challenging due to many variations in face poses, illuminations, and different facial tones [3]. Facial expression recognition itself can be applied using images or videos that are extracted into images. Emotional facial images are direct/indirectly associated with other human behavior such as kindness, decision-making, awareness, memory, and learning. This emotion can be read mainly through facial emotion in an efficient way [4].
Facial expression recognition is a technique to understand human emotions from expressions shown as a reaction to something that occurs from the environment. In this digital world, facial expression recognition can be widely applied. For example, it can be used to understand human expressions in online video meetings. In online video meetings, there are missing micro-interaction aspects when compared to direct social interactions. Facial expression recognition in online video meetings is expected to increase understanding of users' interactions. The use of online video meetings currently reaches 300 issue 2, June 2021 million meetings per day. This indicates that the video meeting has become commonplace in today's digital world, especially during the COVID-19 pandemic [5].
Video meetings are generally preferred to audioonly because it has several benefits that cannot be obtained through audio-only meetings. Users can better understand by seeing the speaker's lips movements, tongues, jaw, and facial expressions and can help understand the speaker's intention [6] [7]. The problem faced in video meetings is the limitation of humans who cannot focus on many things. For example, when a teacher is delivering learning material in a video meeting, at the same time, the teacher cannot observe the reactions given by all of his students. Even though the students' reactions themselves need to be understood by the teacher to get insight from the use of the learning methods. Facial expressions are one of the most important non-verbal means for humans to understand human emotions and can be analyzed into meaningful insight [8]. From the example above, by gathering insights from student reactions, teachers can immediately look for the best learning method according to their students in conducting online learning to be done more effectively.
A facial recognition competition ever held using the FER2013 dataset, which was released into scientific papers. The best result of that competition is constructed by combining three methods using sparse filtering for feature learning, random forests for feature selection, and support vector machine for classification with 70.22% accuracy [9]. Other research that has been conducted using the Convolutional Neural Network architecture for facial expression recognition objects using the same dataset has the best accuracy of 72.7% with the VGG architecture [10]. Meanwhile, research using other deep neural networks has also been held using VGG and ResNet with various training methods resulting in the best accuracy of 84.986% [11]. Some recent research also shows that image augmentation has become a training method to improve model accuracy. Recent research on image augmentation using a novel method called FMix can increase model accuracy by up to 2% on the ImageNet dataset [12].
An accurate model built with deep learning or deep artificial neural network can be applied to solve the existing problems stated before by recognizing and classify human facial expressions. As an example, it can be used to detect student facial expressions in online learning via video meetings. Using a good classification model can help teachers observe their students and get feedback or insights from the learning methods. However, it is possible to implement a facial expression classification model in other fields and cases like online learning and the online hiring process.
The model must be light enough but at the same time must have good accuracy because many facial expressions need to be classified simultaneously. Furthermore, the model must be applied easily in various environmental conditions. In this study, we used a deep learning model, a machine learning field inspired by the human neural network that is arranged in a chain and performs a specific function [13]. In addition, image augmentation can also be used in the training phase to improve model accuracy and train models to adapt better and generalize new data. Moreover, open facial datasets on the internet are generally imbalanced in each class, so class weighting is required on the loss function or sampling process during training. Thus, in this research, our contributions are three folds:

Datasets
The datasets used in this paper are collected from many popular facial expression datasets such as AffectNet [8], FERPlus [11], facial expressions [14], and RAF-DB [15] [16]. This merged dataset is divided into eight classes that are neutral, happy, surprise, sad, anger, disgust, fear, and contempt. Image samples from each class are shown in Figure  1 consecutively.
2.1.1. AffectNet. AffectNet is the largest dataset of facial expression image datasets to date. This dataset consists of 1 million face images comprised of approximately 420 thousand images manually labeled by humans and 580 thousand images labeled automatically using models trained using images labeled by humans. This dataset is divided into two types: class expressions and dimensional in numerics representing facial expressions' value [8]. FERPlus. The FERPlus dataset is an improved dataset from FER2013, where the data is re-labeled with ten annotators, thus achieving a higher agreement percentage of up to 90% [11]. This dataset is available in the form of the number of votes by each annotator. In this study, a majority voting scheme is used to decide the final label.

facial expressions. The facial expressions
dataset is an open dataset in a public GitHub repository. This dataset is not explicitly divided. In this study, all data will be used as training data [14].

RAF-DB. The Real-world Affective Faces
Database (RAF-DB) is a dataset of facial expressions with around 30 thousand data retrieved from the internet. The data collected were independently labeled by 40 annotators [16].

Proposed Method
In this study, we used the residual network as it has shortcut connection that is intended to solve the vanishing gradient problem [17]. By utilizing the residual network as the base network, we also extend it with Accuracy Booster Plus Block, and change the original activation with the Mish function.

Accuracy Booster.
Accuracy Booster is an additional block to be appended to the residual block in the ResNet architecture. This block is a development from SENet, where the fully connected layer is replaced with the CNN and batch normalization layer. We used this block to recalibrate the features extracted from each residual block as mentioned by the original author. Based on experiments on ImageNet dataset, Accuracy Booster has shown an increase in performance compared to SENet while keeping computation costs almost the same. The performance of Accuracy Booster has outperformed SENet by about 0.3% in ImageNet classification with a class number of 1000 [18]. There are two variants of this block: Accuracy Booster (using depth-wise CNN, in Figure 2a) and Accuracy Booster Plus (using CNN, in Figure 2b).

Fig. 2. Illustration of Accuracy Booster block [18]
2.2.2. Mish. Mish is a novel self-regularized nonmonotonic activation function that can replace the ReLU activation function commonly found in many neural network architectures. Mish is related to the Swish function, where both have similar formula. We choose this activation as in some experiments especially on ImageNet dataset, Mish outperformed Swish and can generalize better [19]. The equation of Mish function is written in Equation 1.
Where x is an input value.

Image Augmentation.
In addition to the proposed method, we also used image augmentation to prevent neural network learning too quickly and to generate more varied training data. Image augmentation can be done by directly manipulating the pixel elements in the image such as flipping, rotating, cropping, color manipulation, and image perspective modification. We also used an advanced image augmentation method called FMix.

FMix.
FMix is a form of image augmentation by combining two images into one image, also known as Mixed Sample Data Augmentation (MSDA). Merging is based on masks generated from binary random numbers forming a single form or continuous region. Next, the number 0 will be filled with the value from image 1, and the number 1 will be filled with the value from image 2 or issue 2, June 2021 vice versa [12]. We choosed this method because it produces asymmetric merging patterns so it can help the artificial neural network to learn important features or details better. The illustrations of FMix augmentation on the face image dataset are shown in Figure 3. To evaluate how the proposed model perform during training process in this study, we used the most commonly used loss function in classification problem, namely log (cross-entropy) loss [20]. The equation of cross-entropy loss is written in Equation 2.
where y is the classification target and o is the model output.
2.2.6. Class Weighting. We analyzed that the datasets we have used have imbalanced data in each class. To overcome this problem, we apply class weighting to the loss function. The weighting formula we have used is defined in two forms: class weighting with normalization is written in Equation  3 and class weighting without normalization is written in Equation 4.
Where W i is weight calculation of class i.

Validation and Evaluation
Validation and evaluation are the final stage of this study. This stage will determine whether the proposed method has better performance or not.
We will evaluate our proposed method result on validation and test set using accuracy metrics and confusion matrix.
During experiments, we calculated accuracy in two ways. First, in the mixed section, we mixed all validation and test sets from each dataset into one and then passed it through the network as a small batch. Second, in the AffectNet, FERPlus, and RAF-DB datasets, we calculated accuracy by separating them into their original partition of each dataset. For example, we used AffectNet and FERPlus validation set separately and then calculated accuracy for each set when we still develop the network. And then, in the final test, we used the AffectNet validation set, FERPlus test set, and RAF-DB test set separately and also calculate accuracy for each set.

Research Flow
This research will conduct experiments on the implementation of Residual ConvNet and image augmentation on facial expression classification. The research flow is shown in Figure 5. The hyperparameters that we tune in this research are learning rate (step size or how fast network weight updated), beta1 (exponential decay rates for the first-moment estimates), and beta2 (exponential decay rates for the second-moment estimates) [21]. Various hyperparameters and optimization methods were also used during training and testing to find the best result of that combination. The optimization methods used in this study are Stochastic Gradient Descent, Lookahead [22] + Rectified Adam [23] (this combination is also known as Ranger), and Lookahead [22] + Rectified Adam [23] + Gradient Centralization [24]. After training using certain combinations, the model is evaluated using accuracy metrics on validation data and test data.
As we have stated before, the proposed architecture block based on the residual network. And then, we have changed the default activation of the residual network from ReLU to Mish, and Accuracy Booster Plus block is also appended after the residual branch. We preserved the original form of the residual network that has two forms: basic block and bottleneck that showed in Figures 4a and 4b. In this research, we used only two types of the residual networks: ResNet-18 and ResNet-50. We also preserved the original architecture of the residual network, as shown in Table 1.

Result and Discussion
During training and testing hyperparameters and architecture, weighted loss with normalization is  During experiments, all training and testing use a step number of 60000, so all training using various models and configurations has the same treat- Fig. 5. Research flow ment. Some image augmentations also applied in this study, such as brightness, contrast, hue, saturation, rotation, shear, and FMix. FMix is applied to all training processes as we found that this method can help to reduce overfitting in loss value and escalate accuracy in mixed validation set when the training step is quite large as shown in Figure 6. The FMix parameters that we used in all experiments are default parameters of official implementation, decay power (decay power for frequency decay prop 1/f d ) = 3 and alpha (alpha value for beta distribution from which to sample mean of mask) = 1 [12]. Simultaneously, the augmentation methods used during training processes and the value of each image augmentation is shown in Table 2.   Table 3.  Table 4.  Table 5.

Model Architecture Testing and Evaluation
The model architecture determines the capacity of the neural network for learning. Better architecture can learn patterns better from the data. Moreover, a model that has been previously trained using more extensive data (pre-trained) can help the model get better accuracy. The evaluation results of changing model architecture are shown in Table 6.
From the evaluation results in Table 6, pretrained ResNet-18 produce a better result on two Af-fectNet and FERPlus validation sets than the model initialized from random value. Moreover, on the mixed dataset, this model is still has lower accuracy than the other models. Note that, pre-trained ResNet-18 has been trained using the ImageNet dataset with 1000 classes and a total of about 14 million images. This result proves that fine-tuning from a larger dataset to a smaller dataset helps the model to produce a better performance model since the feature extraction layer in the pre-trained model has better capabilities than the model initialized with random weights. However, although the pre-trained model has shown better results, this model still seems to  On the other hand, the proposed model showed positive results. Compared to the standard ResNet-18, we found that the proposed model outperforms the performance in terms of testing against training data and test data by a small margin. For example, in AffectNet and FERPlus datasets. However, we also found that sometimes the methods we proposed show no improvement in accuracy metrics. Figure 8 shows that the overfitting phenomenon also appears when compared to standard ResNet-18. Based on this result, we conclude that this model shows the potential to be developed in further studies for more significant improvements. For example, some regularizer methods can be added to the proposed model to reduce overfitting. Another suggestion is doing transfer learning from larger datasets, as the previous result has been mentioned. Thus, the models can reduce overfitting, find more optimum weights, and improve testing accuracy.

Specific Dataset Testing and Evaluation
Referring to the evaluation results in Table 6, the best model of training is obtained using ResNet-18 + Mish + Accuracy Booster Plus. Furthermore, to increase the model capacity, ResNet-50 + Mish + Accuracy Booster Plus was chosen to test and evaluate with a specific dataset. When training the AffectNet evaluation model, weighted loss without normalization is used. Besides, when training the FERPlus and RAF-DB evaluation model, the model is trained without weighted loss. The comparisons of our network to the previous study can be seen in Table 7. The evaluation results using confusion matrix from each dataset, namely AffectNet validation data, FERPlus test data, and RAF-DB test data, can be seen respectively in Table 8, Table 9, and Table  10. From the tables below, all testing scenarios still show the worst results in class with fewer data.

Discussion
Based on experiments on this study, many aspects can be improved. First, adding the pre-training method before the modified model is trained to classify the original dataset to make the model have better weights at the beginning of the training. Second, adding the augmentation method exploration scheme so the best combination of augmentation

Conclusion
During experiments in this study, the best results obtained using normalized weighted-loss with an accuracy of 0.7641 are obtained using Lookahead + RAdam + Gradient Centralization, the learning rate of 0.001, beta1 of 0.9, and beta2 of 0.999. We also observe that transfer learning using the ImageNet dataset brings accuracy improvement over the model generated from random values. The evaluation results show that the model produces fairly good accuracy when the data is imbalanced, where some facial expressions that rarely appear in the dataset also rarely appear in the real world. Meanwhile, the addition of the Mish activation function and the Accuracy Booster Plus block shows an improvement from the original model on the ResNet-18 architecture in all validation data and test data used in the study. The best evaluation results of the ResNet-50 model with the Mish activation function and the Accuracy Booster Plus block to the AffectNet validation data of 0.5972, the FERPlus validation data of 0.8636, the FERPlus test data of 0.8488, and the RAF-DB of 0.8879.