Reducing Adversarial Vulnerability through Adaptive Training Batch Size

Neural networks possess an ability to generalize well to data distribution, to an extent that they are capable of fitting to a randomly labeled data. But they are also known to be extremely sensitive to adversarial examples. Batch Normalization (BatchNorm), very commonly part of deep learning architecture, has been found to increase adversarial vulnerability. Fixup Initialization (Fixup Init) has been shown as an alternative to BatchNorm, which can considerably strengthen the networks against adversarial examples. This robustness can be improved further by employing smaller batch size in training. The latter, however, comes with a tradeoff in the form of a significant increase of training time (up to ten times longer when reducing batch size from the default 128 to 8 for ResNet-56). In this paper, we propose a workaround to this problem by starting the training with a small batch size and gradually increase it to larger ones during training. We empirically show that our proposal can still improve adversarial robustness (up to 5.73%) of ResNet-56 with Fixup Init and default batch size of 128. At the same time, our proposal keeps the training time considerably shorter (only 4 times longer, instead of 10 times).


Introduction
Deep learning has progressed rapidly for the last couple of years, capable of achieving super human performance. In the field of computer vision, Convolutional neural network (ConvNet) has gained momentum after AlexNet [1] won the ImageNet Challenge in 2012 [2], surpassing the performance of traditional computer vision. Despite its lack of interpretability, it has made its way to securityor safety-critical systems such as medical analysis [3,4], face recognition [5,6] and autonomous cars [7].
In 2013, Szegedy et al. found surprising properties of neural network [8]. One of those is that because how ConvNet processes an image, we could craft an adversarial image which has imperceptible change to human eye, but enough to make ConvNet misclassify an image. In 2014, Goodfellow et al. proposed a simple yet efficient technique, called Fast Gradient Sign Method, to craft an adversarial examples [9]. Several stronger methods have been proposed (e.g. Projected Gradient Descent Attack [10], Carlini & Wagner Attack [11], Momentum Iterative Fast Gradient Sign Method Attack [12]).
These stronger attacks with larger perturbation value can reduce the accuracy of undefended ConvNet models to zero. One of the extreme examples is one-pixel attack that only modified one pixel using differential evolution algorithm [13]. One-pixel attack does not require the gradient of the model; only the probability of each label is needed to compute the perturbation and it has 31.40% success rate when performed against VGG [13]. Transferrability of adversarial perturbations between ConvNet architectures also has been observed [14].
In 2019, Galloway et al. studied the effect of Batch Normalization (BatchNorm) [16] on adversarial robustness [15] and found that by avoiding the usage of BatchNorm, adversarial robustness can be increased. They proposed to use Fixup Initialization (Fixup Init) [17] as an alternative to increase adversarial robustness.
The use of a larger batch size has been observed to reduce accuracy [18,19]. Similarly, our experiments show that using smaller batch size reduces adversarial vulnerability, i.e., lessens the accuracy reduction due to adversarial examples. However, it makes training longer (up to 10 times from the default batch size of 128 to 8). Consequently, reduc-28 Jurnal Ilmu Komputer dan Informasi (Journal of Computer Science and Information), volume 14, issue 1, February 2021  [9] ing batch size may not feasible in some scenario, especially when a large dataset such as ImageNet is being used. Increasing batch size has a similar effect to decaying learning rate [20]. Based on these observations we raise the following question: Can we increase adversarial robustness by reducing the starting batch size and gradually increase it during training? In this paper, we show that adversarial robustness of Fixup Init can be further increased by modifying the learning schedule while keeping training time reasonably low.

Adversarial Examples
It has been argued that deep learning works by encoding a non-local generalization prior over the input space [21]. It assumes that the target function is smooth or can be approximated with a smooth function. Given a training example (x i , y i ), a function f (x j ) will outputs y i if x j is within a small radius of x i . This helps the generalization because x j might represent x i from different point of view or different scale. By exploiting the smoothness prior, we can make a perturbation in the input space, which is small enough for human eyes not to notice but large enough to make the output jump to a region of the input space with non significant probabilities that contains no training examples in their vicinity [8]. The perturbation is usually constrained by to ensure that the perturbation is small enough. Szegedy et al. demonstrated this intriguing property by proposing a method to compute the perturbation using L-BFGS algorithm [22]. This method is able to fool ConvNet although rather weak and, considering the result, is expensive to compute by today's standard.
One of the simplest yet very efficient adversarial attack algorithms is Fast Gradient Sign Method (FGSM) [9]. This method is a pure one-shot optimization problem. It computes adversarial examples by adding one-pixel-wide perturbation in the opposite direction of the gradient of the cost function for the input. The magnitude of the perturbation is scaled by a constraint which makes the magnitude of the gradient is not important and only the direction (sign) is used. The formula to compute FGSM is defined as follows: where Z adv is the perturbed image, Z is the original image, is the constraint, J is the cost function for the input, θ is the weight of the model, and y is the original label. This is an untargeted attack because the goal is only to maximize the error that would change the prediction. Kurakin et al. modified FGSM to be an iterative method [10]. This method is often called Basic Iterative Method (BIM) or Projected Gradients Descent (PGD) without random start, which is formulated as follows.
where α is the magnitude of the perturbation for each iteration and N is the number of iteration.

Batch Normalization
BatchNorm can be considered one of best discoveries for the progress of deep learning [16]. Nevertheless despite its indisputable success, there is no consensus where the benefit of using BatchNorm comes from [23,24]. The authors of the original paper hypothesized that BatchNorm reduces the internal covariate shift problem, the distribution of activation tends to drift during training which affects subsequent layers. BatchNorm attempts to stabilize the distributions of layer inputs by controlling its mean and variance. BatchNorm can be formulated as: where µ is the mini-batch mean, σ 2 is the mini-batch variance, and δ is a small constant added for numerical stability. After normalizing, affine transformation is applied.X where γ and β are learnable parameters, used to scale and shift the values respectively. Santurkar et al. argues that BatchNorm might not even reduce internal covariate shift [23]. The success of BatchNorm comes from how BatchNorm regularize the optimization problem itself that makes the gradient smoother, thus more predictive. That is how BatchNorm allows the use of larger learning rate and faster convergence.  As we can see in Figure 2, in residual block, skip connection adds the original value to the final value of the block. Combining ReLU [26] and skip connection solves the vanishing gradient problem where during backpropagation, gradient information will be lost as it passes through many layers. Skip connection can be formulated as:

Fixup Initialization
where θ is the weight of the layers inside the block. However, combined by He Initialization [27], skip connection may double the magnitude of the variance of input to each layer. This causes the gradient to grow exponentially: Balduzzi et al. demonstrated that we can solve the exploding gradient problem by scaling down the above input to each subsequent layer [28]: On the other hand, in the original ResNet, BatchNorm is employed to solve the above problem [25].
Fixup Init solves the exploding gradient problem at early stage of training by scaling down the initialization of the first convolution layer from residual block by the number of layers ( √ L). Scaling down alone is enough to be able to train deep network with skip connections, but the network will not be able to perform as good as ResNet with normalization layer. To match the performance of ResNet with normalization layer, Fixup Init performs the following further modifications: • Initialize classification layer and the last convolution layer from residual block to 0.
• Add learnable parameter as multiplier to the last convolution layer from residual block.
• Add learnable parameter as bias to every linear, convolution and activation layer.
The latter two modifications above are similar to the affine transformation performed by BatchNorm.

Increasing Adversarial Robustness of Fixup Initialization
Our proposed solution to increase adversarial robustness of ResNet with Fixup Init through two main steps: a reduction of initial batch size, and then gradually increasing the batch size during training.

Reduce Starting Batch Size
As has been observed before, using larger batch size tends to reduce accuracy [18,19]. Therefore first we simply reduce the batch size from 128 (FRN-56-BS128) to 32 (FRN-56-BS32) and 8 (FRN-56-BS8). We also use Linear Scaling Rule as proposed by Goyal et al. which states that "When the minibatch size is multiplied by k, multiply the learning rate by k." [19] The following formula is used to calculate the initial learning rate: where η is the current learning rate, η orig is the original learning rate (0.1), bs is the current batch size, and bs orig is the original batch size (128). Linear Scaling Rule is also used by the original Fixup Init implementation, although the intention is to increase learning rate if larger batch size is used. Both batch size and learning rate are multiplied by the number of GPUs.

Increase Batch Size during Training
Reducing batch size to 8 significantly increases training time. Training Fixup ResNet-56 with batch size of 128 takes 36:53, while with batch size of 8 takes 6:00:26. To reduce the training time, we borrow the idea from Smith et al. to increase batch size during training [20], they proposed to increase the batch size when learning rate should be decayed and keep learning rate constant. The difference from what Smith et al. proposed is that we reduce starting batch size, and gradually increase to the original batch size before epoch 100, when the learning rate starts to be decayed. We consider the following learning schedule: • Batch Size 8 Schedule 1 (FRN56-BS8-S1): Reduce the starting batch size to 8, gradually increase it to 512, multiply by 2 at epoch 20, 40, 60, 80, 100, and 150. The idea is to divide evenly for the first 100 epochs to increase batch size. So we multiply by 2 every 20 epochs.
• Batch Size 32 Schedule 1 (FRN56-BS32-S1): Reduce the starting batch size to 32, gradually increase it to 512, multiply by 2 at epoch 20, 40, 100, and 150. In this scenario, we simply increase batch size as early as possible.
Because at epoch 100 and 150, batch size is multiplied by 2, learning rate is decayed by a factor of 5 as opposed to 10 from the original implementation of ResNet and Fixup ResNet [17,25]. This will help to speed up the training while still adhering to the Linear Scaling Rule.

Experiment Configuration
The code of the experiment is written in Python 3.5.2, using PyTorch 1.3.1 and we run the experiments on a machine with Intel Core i7 8700 CPU and 2 NVIDIA RTX 2080 TI GPUs. Unless stated otherwise, we use the same configuration as the original paper for ResNet-56 [25] and Fixup ResNet-56 [17]. Fixup Init experiment does not use mixup, a data augmentation technique which can be combined with Fixup Init to improve accuracy as proposed in [17]. We set Python's random seed, NumPy seed, and PyTorch seed to 0. We use CIFAR-10 dataset with the commonly used data augmentation technique for the dataset, random horizontal flip and a 32x32 sample is randomly cropped from the 4-pixel padded image.
We have observed that network without Batch-Norm is more sensitive to input normalization, that is if we scale the input to [0,1] range compared to normalize the input using z-score with input mean and input standard deviation, the latter has higher accuracy. But normalizing each channel with input mean and input standard deviation makes it more difficult to correctly compute the perturbation within the constraint. To simplify the computation, we normalize the input using mean of 0.   For PGD-∞ , we use step size of /10, iteration of 20, and disable random initialization. Using both one-shot attack and iterative attack with the same epsilons will show that the proposed solution does not exhibit the obfuscated gradient problem [30]. The code snippets to generate the attack are listed in Appendix B.
Naively modifying the learning schedule significantly reduces robustness. As seen on Figure 5a and the same figures we can see that FRN56-BS32-S1's performance is similar to FRN56-BS128.
As we can see from Figure 6a and Figure 6b, delaying the starting batch size multiplier to the point where test accuracy has stabilized can improve robustness. FRN56-BS8-S2's performance is the closest to FRN-56-BS8's performance, albeit slightly lower. As seen on Figure 6b, for some , FRN56-BS8-S2 even has better performance than FRN-56-BS8. And FRN56-BS8-S2 only takes 2:19:07, which is significantly faster than the time it takes to train FRN56-BS8. FRN56-BS32-S2 is another example of blindly modifying the learning schedule, where the starting increment is taken from FRN56-BS8-S2, the result shows that FRN56-BS32-S2 fails spectacularly, performs worse than even FRN56-BS128. But FRN56-BS32-S3 which is chosen by looking at Figure 4b surprisingly has better robustness than FRN56-BS32 when attacked with FGSM, and slightly lower compared to FRN56-BS32 when attacked with PGD-∞ . The detailed result of the attacks can be seen in Appendix A. Choosing the correct schedule also affects accuracy. The accuracy is shown in 1, where the accuracy of FRN56-BS32-S1 drops 1.16% from FRN56-BS32, but FRN56-BS32-S3 and FRN56-BS32 have similar accuracy (+0.1%).

Related Work
A large number of research works have been conducted to devise a defense method against adversarial examples. Several techniques has been studied, such as adversarial training [9,31], modifying network architecture [32], and modifying input [33,34]. Defensive quantization has been proposed to improve both robustness and efficiency [35]. More recently, a number of certified defence researches have been proposed [36][37][38]. Certified defence is a type of defense with a proof that prediction at any point inside a small norm-bounded ball around point x will be constant.
Stutz et al. has disentangled the relationship between robustness and generalization [39]. But they also showed that on-manifold adversarial examples are a result of generalization errors, if we train with an intent to reduce on-manifold adversarial examples, it would also increase test accuracy.
In  [30]. We do not try to improve robustness by devising a specific defence technique, but by simply naturally increase it. The increment is not large, but combined with an adversarial defence technique, it might be able to retain a degree of accuracy when the defence technique fails.
The differences between each proposed schedules are as follow.
• Schedule 1 (S1) with Batch Size of 8 divides evenly for the first 100 epochs. S1 with Batch Size of 32 simply follows the schedule but with larger starting batch size.
• Schedule 2 (S2) with Batch Size of 8 delay the starting batch size multiplier due to test accuracy is still increasing at epoch 40. Again, S2 with Batch Size of 8 simply follows the schedule. As our experiment has shown, we can improve the adversarial robustness of Fixup ResNet56 by simply reducing batch size. But reducing batch size from the default 128 to 8 increases training time by a magnitude of approximately 10 (from 37:02 to 6:00:43). Modifying the learning schedule by increasing batch size during training can greatly reduce training time, while keeping the adversarial robustness closer to FRN56-BS8. However, randomly picking the learning schedule tends to worsen adversarial robustness (i.e., Schedule 1). Table 1 shows the training time and clean accuracy of each proposed schedule and the baseline model. Training time is an average of three with standard deviation shown next to it. As we can see from Table 1, compared to Fixup ResNet-56 with a batch size of 8, can halve the training time while slightly reducing adversarial robustness and even in some cases improve robustness and accuracy. Compared to RN56, FRN56 has a faster training time due to less layers in the architecture (i.e., BatchNorm layer is removed).
We have empirically proven that this method increases robustness. We use one architecture (ResNet-56) and one dataset (CIFAR-10), this is a standard benchmark dataset for training many deep learning architectures. We need to conduct in-depth analysis to better understand this behaviour. Furthermore, the schedule is handpicked by looking at the progress of test accuracy during normal training, we might be able to automate this by implementing a technique similar to early stopping but increase the batch size when test accuracy has not been improved for a number of epochs. issue 1, February 2021 Appendix A. Details of Adversarial Robustness

Appendix B. Attack Implementation
The code is written in python using PyTorch and to attack we use AdverTorch by Ding et al. [29]. A code snippet to iterate and construct the FGSM attack object is shown below: