BN, batch normalization, is to keep the input of each layer of neural network closely distributed during the training process of deep neural network.
Are the parameters of BN training and testing the same?
For BN, during training, each batch of training data is normalized, that is, the mean and variance of each batch of data are used.
When testing, such as predicting a sample, there is no concept of batch. So the mean and variance used at this time are the mean and variance of all training data, which can be obtained by moving average method.
For BN, when training a model, all its parameters are determined, including mean and variance, gamma and bata.
Why not use the mean and variance of the whole training set when training BN?
In the first complete epoch of training, the mean and variance of the whole training set of other layers except the input layer cannot be obtained, so the mean and variance of the training batch can only be obtained in the forward propagation process. Can the mean and variance of a complete data set be used after a complete epoch?
For BN, each batch of data is normalized to the same distribution, and the mean and variance of each batch of data will be different, instead of using a fixed value. This difference can actually increase the robustness of the model and reduce over-fitting to some extent.
However, the difference of mean and variance between a batch of data and all data is too big to represent the distribution of training set. Therefore, BN usually requires to completely disrupt the training set and use a larger batch value to narrow the difference with the total data.
Dropout deactivates neurons with a certain probability in the training process, that is, the output is 0, so as to improve the generalization ability of the model and reduce over-fitting.
Do dropouts need it for training and testing?
Dropout is used in training to reduce the dependence of neurons on some upper neurons, which is similar to integrating multiple models with different network structures to reduce the risk of over-fitting.
The whole trained model is used in the test, so there is no need to quit halfway.
How to balance the difference between training and testing when dropping out of school?
Dropout, in training, activates neurons with a certain probability, which actually makes the output of the corresponding neurons zero.
Suppose the inactivation probability is p, that is, every neuron in this layer has a probability of P inactivation. In the three-layer network structure shown in the figure below, if the inactivation probability is 0.5, three neurons will be inactivated in each training, so each neuron in the output layer has only three inputs, but there will be no packet loss in the actual test, and each neuron in the output layer has six inputs, so that the sum of inputs of each neuron in the output layer will be different from expectations by orders of magnitude during training and testing.
Therefore, in training, the output data of the second layer should be divided by (1-p) and then transmitted to the neurons of the output layer as compensation for the inactivation of neurons, so that the inputs of all layers have roughly the same expectations during training and testing.
Drop-out reference:/program _ developer/article/details/80737724
Both BN and Dropout can slow down the training speed of over-quasi merging, but if they are used together, they will not produce the effect of 1+ 1 >: 2, on the contrary, they may get worse results than if they are used alone.
Related research reference papers: Understanding the disorder between droplet and batch standardization through variance transfer.
The author of this paper finds that the key to understand the conflict between Dropout and BN is the inconsistent behavior of neural variance in the process of network state switching. Imagine that if there is a neural response X in figure 1, when the network changes from training to testing, Dropout can scale the response through its random inactivation retention rate (that is, P) and change the variance of neurons in the learning process, while BN still maintains the statistical sliding variance of X. This variance mismatch may lead to numerical instability (see the red curve below). With the deepening of the network, the numerical deviation of the final prediction may accumulate, thus reducing the performance of the system. For simplicity, the author named this phenomenon "variance shift". In fact, if there is no packet loss, the neuron variance in the actual feedforward will be very close to the sliding variance accumulated by BN (see the blue curve below), which also ensures its high test accuracy.
The author uses two strategies to explore how to break this limitation. One is to use Dropout after all BN layers, and the other is to modify the formula of Dropout to make it less sensitive to variance, that is, Gaussian Dropout.
The first scheme is relatively simple, just putting Dropout at the back of all BN layers, so that there will be no variance deviation problem, but it actually feels like avoiding the problem.
The second scheme comes from Gaussian Dropout mentioned in the original Dropout, which is an extension of the Dropout form. The author further extends Gaussian leakage and puts forward uniform distribution leakage, which brings a benefit that this form of leakage (also known as "Uout") is not very sensitive to the deviation of variance, and the variance is not so strong as a whole.