ANALYZING DEPTHWISE CONVOLUTION BASED NEURAL NETWORK: STUDY CASE IN SHIP DETECTION AND LAND COVER CLASSIFICATION

Various methods are available to perform feature extraction on satellite image. Among the available alternatives, deep convolutional neural network (ConvNet) is the state of the art method. Although previous studies have reported successful attempts on developing and implementing ConvNet on remote sensing application, several issues are not well explored, such as the use of depthwise convolution, ﬁnal pooling layer size, and comparison between grayscale and Red Green Blue (RGB) settings. The objective of this study is to perform analysis to address these issues. Two feature learning algorithms were proposed, namely ConvNet which represents the current state of the art for satellite image classiﬁcation and Gray Level Co-occurence Matrix (GLCM) which represents a classic unsupervised feature extraction method. The experiment demonstrated consistent result with previous studies that ConvNet is superior in most cases compared to GLCM, especially with 3x3xn ﬁnal pooling. The performance of the learning algorithms are much higher on features from RGB channels, except for ConvNet with relatively small number of features.


Introduction
The use of earth surface photograph which obtained via satellite serves a wide range of applica-tions.For instance, in meteorology, satellite image provide useful information for analyzing cloud cover [1].In oceanography, some examples are coastal hazard and sea surface temperature estimation [2].
104 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), volume 12, issue 2, June 2019 Among other examples are ship detection and landuse recognition.
In land usage identification, many feature extraction methods are available.The first example is dictionary learning with mutual incoherence K-Singular Value Decomposition [3].Secondly, texture feature extractors serve as a useful predictor as shown by several studies.An experiment to compare Gray Level Co-ocurrence Matrix (GLCM) and other texture feature extractions was performed on inhabited region identification.The study shows that GLCM is comparable to Gabor and wavelet with compact feature vector [4].GLCM combined with object-based classification was proposed to analyze TerraSAR-X satellite images and superior to the texture followed by pixel based classification [5].
Several studies on satellite image ship detection also demonstrated that texture feature provides useful information.Incorporation of gray level nonuniformity as a result of feature selection was proposed on the first stage of small ship classification [6].Texture based ship representation using GLCM was used post fuzzy c-means based segmentation for classification [7].
Although low level representation such as texture is useful in practice, efforts have been carried to lower the gap between low and high level representation.An example of successful attempt is object detectors based on histogram of oriented gradient, which succesfully outperformed other methods [8].
Besides object detectors, a method that systematically learn from low to high level representation is deep convolutional neural network.In deep neural network, the first layer learn simple low level representation of the image.The following layer incorporate information from the previous layer to learn higher level feature representation.
Deep convolutional neural networks (ConvNet) have been studied for many applications on satellite image analysis.Evaluated on two remote sensing land use datasets, a study confirmed that finetuned GoogLeNet outperformed CaffeNet and other learning algorithms [9].Other study also confirmed that ConvNet outperformed other methods such as Spatial Pyramid Matching Kernel (SPMK) , Sparse Coding, and Bag of Visual Words (BoVW) [10].ConvNet was proposed to identify terrains and structures which is useful for poverty mapping [11].In synthetic aperture radar (SAR) based maritime target detection, ConvNet is useful for land masking [12] and object detection (such as cargo, harbor, and tanker) [13].
Most previous studies presented convolutional neural network as a robust methods for segmenting and classifying satellite image.Despite of the success, certain issues have not been addressed.
Firstly, there are many methods which performance have not been reported.For example, despite [14] and [9] discussed popular architecture such as Xception, DenseNet, and ResNet, other network has not been studied.One example is MobilNet, an architecture which utilized depthwise separable convolution to improve computation efficiency [15].Other example is Gray Level Co-occurence Matrix, which performance has not been discussed such as in [9] and [10].
Secondly, although reducing the feature into 1x1xn-channels before classification layer is an option for ConvNet implementation, the impact of maintaining some spatial resolution before classification layer is still unknown.
Finally, learning on multiband / multichannel image resulted on better model performance in most cases.However, previous studies have not been specifically discussed how learning on multichannel image affect model performance compared to single grayscale image.
The objective of this study is to perform experiments and analysis to address these issues.This paper is organized as follows.Section 1 presented background and objective.The methodology is explained in Section 2. Next, the experiment results are presented and discussed in Section 3. Finally, the conclusion is mentioned in Section 4.

Dataset
Two datasets were used for evaluation.The problem proposed by the first dataset is about recognizing an object, while the second dataset address a more general image classification of earth surface photograph.
2.1.1.Ship Detection.The task on ship detection is to detect the presence of ship on an image patch.The photos were taken from Planet Open California satellite imagery, depicting area of San Fransisco Bay and San Pedro Bay.The dataset is available via Kaggle dataset repository.With PlanetScope visual scene, the image spatial resolution is 3 meters [16] [17].
The images are classified into positive and negative samples.The first 1000 images, identified by its ID, are ship images.The rest 3000 samples are negative class images which divided equivalently into (1) landcover (such as building and water), (2) partially captured ship, and (3) previously misclassified instance by machine learning algorithms.A few samples are shown in Figure 1.  1. Figure 2 [18] shows one sample for each class.The feaures from each image were extracted with three methods.The first two are convolutional neural networks (ConvNet-1 and ConvNet-2).The last is GLCM.
The ConvNets used the training subset for training and the validation subset to validate the model.The weights were obtained by learning only from the datasets without any pre-training process.Models with the lowest validation error were selected for feature extraction purpose.No data augmentation performed on the training and validation process.In case of GLCM, the training, validation, and testing subsets were directly processed because GLCM does not require any supervised training process.
After the features had been extracted, classifiers were trained to evaluate each feature extractor.The features were first normalized with mean normalization before applying learning algorithm as shown in Equation 1.The normalization was performed feature-wise.The = 10 −30 was added to avoid division by zero.
The training and validation subsets were joined to train a linear support vector machine (SVM) model.Then, the SVM model was evaluated on the testing subset.
Several metrics were evaluated, namely acuracy, precision, recall, and F1-score.Besides performance evaluation, principal component analysis (PCA) one the feature was also performed to visualize the test results.Besides that, the principal component histograms were observed.
Both ConvNet-1 and ConvNet-2 used the same architecture.Their difference is on the final pooling layer.The input image batch is first processed by four convolutional blocks.After convolution, the process continued with adaptive pooling and flattening to obtain feature vector.The feature then used to predict class label by the fully connected layer.The architecture is illustrated in Figure 3.The networks were implemented using PyTorch deep learning library [19].
Among the four convolutional blocks, only Block 1 is different, as shown in The parameter used for adaptive maximum pooling is the only factor that contrasts ConvNet-1 and ConvNet-2.Adaptive pooling on ConvNet-1 reduces the n-channel output from Block 4 into 3 x 3 x n tensor.In contrast to the first, adaptive pooling on ConvNet-2 is computed entirely per channel, which resulted into 1 x 1 x n tensor.Consequently, ConvNet-1 still retains some spatial location information of the feature (in 3x3 size) while ConvNet-2 does not.Besides, the former has nine times more features than the later.Different parameter settings were applied for Ship recognition and EuroSAT dataset.Nevertheless, some parameters are identical accross accross networks in this experiment.Dropout probability is set to 10 %.The maximum pooling size is set to 2 x 2. The detail of the parameters are shown in Table 3.
The third feature extraction method is gray level co-occurence matrix (GLCM).GLCM was selected because according to previously discussed studies, texture is a reliable predictor and GLCM was one among the presented texture extractors.GLCM works by constructing co-occurence matrix which values represent spatial relationship among pixel values.Several features could be obtained from the co-occurence matrix [20].
GLCM has several parameters that must be set on the algorithm.The parameter of GLCM was set  to be identical accross datasets.First, the image pixels were converted from 256 levels into 4 levels of intensity per channels.The co-occurence were computed for pixels with distance of 1, 2, 4, and 8.The angles which considered for co-occurence are 0, π 2 , 3π 4 and π.The order of value pair was ignored (resulted in symmetric matrix) and the matrix was normalized before feature computation.
With four values of pixel distance, four values of angles, and six types of features, there are 96 features extracted for a single channel.For experiment with RGB image, the number of evaluated feature is 96 x 3 = 288.GLCM library from scikit-learn library to implement the method [21].

Results
Both Ship and EuroSAT dataset were evaluated in grayscale and RGB.In each case, three feature extraction methods were tested.Thus, there are twelve models in total.The accuracy of each model is summarized in Table 4.
Principal component visualization for EuroSAT dataset is provided in Figure 5 4 to Table 8.The only case where GLCM performed nearly as good as convolutional neural network is on Ship recognition dataset, where the resulting accuracy is approximately equal to Conv-Net2.Although the performance is similar, ConvNet-2 utilized much smaller number of feature (48) compared to GLCM (96).
Besides measuring performance, principal component analysis was also performed for both visualization and observing feature value.Because the model performance are relatively high on Ship dataset, there is no interesting pattern to be presented and discussed.Figure 5 depicts the first two principal components of the feature learned on Eu-roSAT dataset.The visualization provided in the figure clearly shows that a more separable pattern is created by convolutional neural network compared to GLCM.The separability is consistent with the performance measure, where convolutional neural network based methods performed better than GLCM.
The distribution of feature principal components is shown by histograms on Figure 6 and 7 for Ship and EuroSAT dataset respectively.There is an interesting pattern visualized by the histograms.A very high zero frequency is shown by principal component of ConvNet features.On the other hand, there is no clear distribution shape in GLCM.Possible cause of the distribution shape could be the use of ReLU activation (which set negative values to zero) or the ability convolutional neural network to learn features efficiently (represent the pattern with minimum number of non-zero component).Nonetheless, these possibilities needs verification by further study with more datasets and network architectures.

The effect of Adaptive Pooling Output.
Compared to ConvNet-2, ConvNet-1 achieved better performance as indicated by higher score on a lot of cases.For example, the precision and recall of ConvNet-1 is higher in Table 5 and 6.ConvNet-2 only outperformed ConvNet-1 slightly at some metrics of some classes on EuroSAT dataset, as indicated by Table 7.
Specifically on both grayscale and RGB Ship Dataset, ConvNet-1 outperformed other method significantly.In relation to spatial information, this result is rational because the positive sample must be a full ship object, as shown in Figure 1 with ID 131 and 889.A partial ship object, such as ID 2434, is classified as negative sample.Therefore, spatial information is useful to detect ship boundary.
The reason for difference in performance is difficult to be explained given the limited number of experiment.However, by considering the model and the case, there are several possibilities.First, spatial resolution does matter.This implies removing spatial information completely with global maximum pooling (adaptive pooling with 1x1xn output) resulted in less performance compared to retaining spatial information with 3x3xn adaptive pooling.Second, with 3x3 pooling output, ConvNet-1 has nine times more feature than ConvNet-2.Therefore, a more complex pattern could be learned.

RGB and Grayscale
Performance.Models trained on RGB version of the dataset performed better than the grayscale counterparts.The difference is on the gain of performance.
For example, on Ship dataset, the improvement gained from training on RGB with respect to grayscale is small for convolutional neural network (0.964 to 0.971 for ConvNet-1 and 0.919 to 0.925 for ConvNet-2) compared to GLCM (0.89 to 0.92).On grayscale alone, the smallest accuracy of all methods is 0.89.This indicates that a single channel texture  In EuroSAT dataset, the result is rather different.ConvNet-2 gained small improvement, which likely caused by the use of identical network architecture for both grayscale and RGB.With identical architecture, the number of extracted feature is equal.As shown in Table 4, only a very small improvement was gained (from 0.712 to 0.726) because ConvNet-2 provides only 64 number of features for both grayscale and RGB.ConvNet-1 improves quite significantly (from 0.716 to 0.786) possibly because ConvNet-1 has significantly more feature (9x64).GLCM features also gained significant improvement on RGB dataset likely with the same reason as ConvNet-1.For the GLCM experiment, as feature is extracted per channel, RGB has three times features than grayscale.

Conclusion
The study presented performance evaluation of models which learned features produced by Conv-Nets and GLCM.In contrast to previous study, the proposed network utilized depthwise separable convolution and was trained with no transfer learning.The result is consistent with previous studies that convolutional neural network is superior to classic method such as GLCM in terms of most of the metrics (accuracy, recall, precision, and F1-score) for both of the evaluated datasets.
Two similar ConvNets are evaluated.The difference between both networks is on the final adaptive pooling layer.The result shows that the network with 3x3xn pooling output demonstrated better performance compared to the network with 1x1xn.
Training on RGB image improved model performance on most of the evaluated cases.However, the amount of improvement is varied accross all The improvement also depends on the complexity of the pattern.In our experiment on Ship Dataset, for example, ConvNets gained small improvement as the method learned the pattern from single channel texture optimally.

1 .
Model Performance.The experiment results indicate consistency accross datasets.First of all, features extracted from the ConvNets are roughly more predictive than GLCM as indicated by accuracy, precision, recall, and F1-score shown from Table

TABLE 1
2.2.1.Research flow.The experiment began with randomly sampling the dataset into three subsets: training, validation, and testing.The distribution of sample for both dataset are summarized in Table2.After splitting the dataset, the process is continued with feature extraction and image classification.The images were evaluated in grayscale and RGB.

TABLE 3
. The figure depicts 108 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), volume 12, issue 2, June 2019

TABLE 4
Table 5 shows model recall, precision, and F1-score on grayscale samples of Ship Dataset.The results on RGB samples are shown in Table 6.

TABLE 8
The ConvNet with 1x1xn polling, which consequently has the smallest number of features, exhibited the smallest improvement on both datasets.