FULLY CONVOLUTIONAL VARIATIONAL AUTOENCODER FOR FEATURE EXTRACTION OF FIRE DETECTION SYSTEM

This paper proposes a fully convolutional variational autoencoder (VAE) for features extraction from a large-scale dataset of fire images. The dataset will be used to train the deep learning algorithm to detect fire and smoke. The features extraction is used to tackle the curse of dimensionality, which is the common issue in training deep learning with huge datasets. Features extraction aims to reduce the dimension of the dataset significantly without losing too much essential information. Variational autoencoders (VAEs) are powerfull generative model, which can be used for dimension reduction. VAEs work better than any other methods available for this purpose because they can explore variations on the data in a specific direction


Introduction
Fire detection is commonly performed visually by using ultraviolet (UV) camera [1], infrared (IR) camera [2], or visible light camera [3]. UV-based and IR-based fire detection has high sensitivity and fast response, yet prone to disturbance from other UV and IR source light [4] [5]. Hence, this paper focuses on the fire detection system based on the visible light camera which uses a charged-coupled device (CCD) sensor. CCD sensor records a glimpse of fire in the form of video or static images as the data. Computer vision techniques then preprocess the data prior to data training. The data training exploits Deep Learning algorithm to detect whether or not the flame exists in the captured video or images. The deep learning algorithm has been implemented to solve many complex problems [6] [7]. The algorithm can increase the accuracy of detection from any kind of fire and smoke in the captured videos or images [8]. However, the algorithm needs a huge number of datasets in order to obtain high accuracy detection and hence costs computationally expensive. Therefore, it is desirable to extract only the important features of the captured videos or images, such that the dimension of the datasets can be reduced while the most of the information in the data is still preserved.
One of the methods for fire detection using volume 13, issue 1, February 2020 CCD sensor is by using a rule-based generic color model, which uses YCbCr color space to separate the illuminance from the chominance [9]. The using of YCbCr is indeed more effective than using RGB color space to separate illuminance. This method produces high fire detection accuracy and reasonable false alarm rate. However, this method only relies on the color detection of fire. Other important features of fire, for example smoke, and other color of fire, i.e., blue fire, cannot be detected quite well. The proposed method in this paper aims to detect all important features of fire. In order to extract all the important features of fire, many techniques have been developed for the purpose of feature extraction, such as auto-encoder [10], Isomap [11], Nonlinear Dimensionality Reduction (NLDR) [11], Multifactor Dimensionality Reduction (MDR) [13], and Principal Component Analysis (PCA) [14].
Principal Component Analysis (PCA) has been widely used for feature extraction. However, PCA only attempts to discover a lower dimensional hyper lane which describes the original dataset. In other word, PCA only tries to learn linear manifolds of datasets. This will result in the feature extracted can loose many information. On the other side, a neural network-based feature extraction, for example auto encoder, is capable to learn nonlinear relationships or manifolds from datasets.  [15] Auto-encoder has been used widely for feature extraction in images datasets because of its robustness to the noise and disturbance in the images [14]. As one of the examples, the stacked denoising auto-encoder has been implemented for feature extraction and classification of hyperspectral images [15]. Furthermore, the introduction of the stacked convolutional autoencoders has proven to significantly reduce the computation cost for feature extraction process [16].
Auto-encoder consists of a pair of two connected networks: an encoder and a decoder. An encoder takes an input and then converts it into a hidden representation which has significantly smaller dimension compared to the input vector. This hidden representation refers to the features which are extracted from the given input. It is then mapped back by a decoder to obtain the output of the network which reconstruct or generate the given input with high probability [17]. The autoencoder output will not exactly reconstruct the input because of the existence of the reconstruction error. The reconstruction error function is usually either the mean-squared error [18] or the crossentropy [19] which penalizes the network for creating outputs different from the input. It depends on the dimension of the hidden representation or the extracted features. The smaller the dimension of the hidden representation, the bigger the reconstruction error becomes. This create the tradeoff between the dimension of the features and the information loss. It is desirable that the dimension of the features can be minimized while most of the information in the data is still retained.
Standard plain auto-encoder indeed is able to generate a dense representation and reconstruct the input well. However, it is limited to a certain implementation only. The fundamental problem with the standard auto-encoder is that the latent space (the dense hidden representation/decoded vectors) and the encoded vectors may not be continuous, or even though they are continuous, they may be difficult to interpolate [20]. For example, auto-encoder works well for replicating the MNIST [21] or Fashion-MNIST dataset [22]. This is caused by the characteristic of the image datasets from MNIST and Fashion-MNIST is relatively simple and easy to distinguish between background and foreground. However, when dealing with more complex image datasets and the generative model, i.e., generating variations on the input dataset from the latent space, standard autoencoder will not work well because of the discontinuities in the latent space [23]. As we know, fire has no standard distinguishable form, has many colors, and sometimes is covered by smoke, creating difficulties to extract useful features using plain auto-encoder. For this reason, this paper proposes feature extraction method using variational auto-encoder (VAE).
The organization of this paper is as follows. The fundamental concept of Variational Auto-Encoder (VAE) is introduced in Section 2. Section 3 presents the proposed architecture of the Variational Auto-Encoder used for feature extraction from fire images, which is Fully Convolutional Auto Encoder. The implementation results of the proposed Fully Convolutional Auto Encoder for fire feature extraction is shown in Section 4. In Section 5, the implications of the proposed method are presented. Finally, the conclusion is given in Section 6.

Variational Autoencoder (VAE)
Variational Autoencoder network is a pair of two connected network -a network that takes in an input and produce smaller representation (encoder), and a network that convert back the smaller representation to the original input (decoder) that have continuous latent space, easy random sampling and interpolation because its encoder outputting two vectors -a vector of means (μ) and a vector of standard deviation/variance (σ) as illustrated in Figure 2. As the encoding has far less units than the input, the challenge is getting the model to learn a meaningful and generalizable latent space. VAE encoder describes a probability distribution for each latent attribute. The two vectors form the parameters of a vector of random variables of length n, with the i th element of and being the mean and standard deviation of the -th random variable, , from which we sample, to obtain the sampled encoding. For the same input, although the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling. The mean vector controls where the encoding of an input should be centered around, while the standard deviation controls the "area", how much from the mean the encoding can vary. The Kullback-Leibler divergence that is used in loss function as a regularizer allowing smooth interpolation and enabling the construction of new samples.
The decoder network then subsequently takes random sample from each latent state distribution to generate a vector as input for our decoder model and attempts to recreate the original input. Backpropagation which is usually used to calculate the relationship of each parameter in the network with respect to the final output loss, cannot be used for random sampling process, thus reparametrizes is used instead. Using reparameterization, parameter of the distribution is optimized while still maintaining the ability to randomly sample from that distribution.

Proposed Fully Convolutional Autoencoder (FCAE) Architecture
The architecture of our network is summarized in Figure 3. It contains three main structure: the encoder, the bottleneck, and the decoder. Since the architecture proposed in this paper is fully convolutional variational autoencoder, all layers in the network architecture is convolutional layers. The input of the network is the given fire image taken randomly from the fire image datasets, while the output of the network is the reconstruction image from the given input image. Both the imput and output images are RGB images.

The Encoder Structure
The encoder structure is a sequential network consisting of four convolutional layers with ReLU non-linearity for each respected convolutional layer. The kernel size used in the convolutional layer is 4, with stride 1 and no padding. Figure 3 shows the illustration of the encoder structure.
Consider the input image has × pixels dimension. It means that where defines the padding size, specifies the filter/kernel size, and specifies the stride size. From equation (1)(2) we know that because of the zero-padding used, the resulting output of this encoder sequential network has lower dimention compared to the input image.

The Bottleneck Structure
This bottleneck structure is what unique from variational auto-encoder (VAE), compared to plain auto-encoder. While in plain auto-encoder the decoder will give one output of latent vector, VAE gives two outputs of vector means and vector variance with the same dimension. In the fully convolutional architecture, both of mean and variance vector are convolutional layers. The dimension of mean and variance vector specifies the dimension of the feature points extracted from the given image. To find the mean and variance vector, most literature use the Kullback-Leibler divergence (KL divergence [24]) as the loss function. Minimizing KL divergence means optimizing the probability distribution parameter to closely resemble the target distribution. For VAE, the KL divergence loss function is shown in the following equation: where specifies the variance, and specifies the mean vector.

The Decoder Structure
The decoder structure is a transpose of the encoder structure. It is also a sequential network consisting of four convolutional layers with ReLU nonlinearity for each respected convolutional layer. The same stride and padding used in the encoder is used in the decoder as well. However, we must carefully set the kernel size for each convolutional layer using equation (1)(2) to reconstruct the images with the same pixel size compared to the input image. Figure 4 shows the illustration of the decoder structure.

Feature Extraction: Result and Analysis
Unlike face or any landscape background, the nature of fire makes it is hard to extract the feature of fire from an image or video. Fire has many distinctive forms and colors. Sometimes, the fire is covered by smoke which makes it more difficult to extract its features. In this section, we present the result of the feature extraction of a dataset containing 10793 RGB images that mostly contain fire, but has 10% outlier images, i.e. images which do not contain fire. The image resolution is 480×480 pixels. Thus, these raw images have initially 3×230400 feature points each. It is a relatively large feature dimension compared to a small resolution image. This fact shows the importance of feature reduction to save computation power. Figure 6 shows the extracted feature points taken from randomly chosen images from the dataset. Each box in Figure 6 (a) consists of feature points from 64 randomly chosen images (8×8 image matrix) for a certain VAE training iteration. From the comparison between each feature from each iteration, we can observe that the VAE algorithm tries to learn which important features should be saved and which information may be omitted. In the iteration 12, we can observe that the extracted features of every image are different with each other. The resulting extracted features from this process can be used as a substitution for the initial image for deep learning-based fire detector. However, we should confirm first that we can reconstruct the initial images from these features. If we can distinguish the image with fire or not from the reconstruted images, then it means the features can be used as the substitution for the initial images in the dataset. Figure 7 shows the reconstruction result using the features obtained for each iteration in Figure 6. As the VAE learn to extract important features, the reconstructed images become clearer and more distinguishable. We can observe from the reconstruction result using features from iteration 12, we can distinguish image with fire and image without fire. This means, the feature points in the iteration 12 contain enough important information. Therefore, they can be used to substitute initial images from the dataset.

Implications of the Proposed Method
This method can be implemented for fire detection in buildings, as an addition for the already existing fire detection sensor. CCTV are already common to be placed in the surrounding of the building, and therefore can be used as a fire detection system.

Conclusion and Future Research
Fully Convolutional Variational Autoencoder (VAE) is suitable to extract features from a given fire images dataset. Even though the nature of fire makes hard to extract the feature of fire from an image, Fully Convolutional VAE can actually extract enough important features. The resulting extracted features then can be reconstructed and still can be distinguished between images which contain fire or not. From this reconstruction images, we can determine the suitable latent vector which results in the smallest feature points without losing too much important information. This suitable latent vector then can be used to substitute the initial images in the dataset. This latent vector by nature has significantly smaller dimension compared to the initial image. For future work, it is interesting to compare this algorithm to another feature extraction method such as Isomap, nonlinear dimensionality reduction (NLDR), or multifactor dimensionality reduction (MDR).