День 4: Важность пакетной нормализации

~~Which~~Для ~~purpose~~какой doцели ~~neural~~существуют ~~networks~~нейронные ~~serve~~сети? ~~for?~~Нейронные ~~Neural~~сети ~~networks~~- ~~are~~это ~~learnable~~обучаемые ~~models.~~модели. ~~Their~~Их ~~ultimate~~главная ~~goal~~цель is- toприблизиться ~~approach~~или orдаже ~~even~~превзойти ~~surpass~~человеческие ~~human~~способности ~~cognitive~~восприятия. ~~abilities.~~Ричард AsСуттон ~~Richard~~говорит, ~~Sutton~~что ~~puts~~самый ~~it,~~большой ~~'The~~уро, ~~biggest~~который ~~lesson~~может ~~that~~быть ~~can~~получен ~~be read from~~с 70 ~~years~~годов ofразработки AIИИ, ~~research~~в isтом, ~~that~~что ~~general~~общие ~~methods~~методы ~~that~~использующие ~~leverage~~вычисления ~~computation~~очень ~~are ultimately the most effective'~~эффективны. InВ ~~his~~его ~~essay,~~эссе, ~~Sutton~~Суттон ~~argues~~говорит, ~~that~~что ~~only~~только ~~models~~модели ~~without~~без ~~encoded~~зашифрованного ~~human-knowledge~~человеческого ~~can~~знания ~~outperform~~могут ~~human-centeric~~превзойти ~~approaches.~~человеко-ориентированные ~~Indeed,~~подходы. ~~neural~~Так, ~~networks~~что ~~are~~нейронные ~~general~~сети ~~enough~~в ~~and~~общем ~~they~~будут ~~leverage~~достаточны ~~computation.~~для ~~Then,~~использования itвычислений. isЗатем, ~~not~~не ~~surprising~~удивительно, ~~how~~что ~~they~~они ~~can~~могут ~~exhibit~~выставить ~~millions~~миллионы ofобучаемых ~~learnable~~степеней ~~degrees of freedom.~~свободы.

~~The~~Самый ~~biggest~~большой ~~challenge~~вызов ~~with~~для ~~neural~~нейронных ~~networks~~сетей: ~~is two-fold: (~~1) ~~how~~как toобучить ~~train~~эти ~~those~~миллионы ~~millions~~параметров, ~~of parameters, and (~~2) ~~how~~как toих ~~interpret~~интерпритировать. ~~them.~~Пакетная ~~Batch normalization~~ нормализация(~~shortly~~ batchnorm) ~~was~~был ~~introduced~~представлен asв anкачестве ~~attempt~~попытки toсделать ~~make~~обучение ~~training~~еще ~~more~~эффективнее. ~~efficient.~~Метод ~~The~~может ~~method~~сильно ~~can~~сократить ~~dramatically~~число ~~reduce~~итераций ~~the~~обучения. ~~number~~Даже ~~of training epochs. Moreover,~~больше, batchnorm is- ~~perhaps~~возможо ~~the~~является ~~key~~ключем, ~~ingredient~~который ~~that~~даст ~~made~~возможность ~~possible~~обучать ~~training~~определенные ofархитектуры ~~certain~~такие ~~architectures~~как ~~such~~бинарная asнейронная ~~binarized~~сеть. ~~neural networks. Finally,~~Наконец, batchnorm isодна ~~one~~из ofсамых ~~the~~последних ~~most~~преимуществ ~~recent~~нейронной ~~neural network advances1.~~сети.

~~Previous~~Пакетная ~~posts~~нормализация в кратце

~~Day~~Что 1:такие ~~Learning~~пакет? ~~Neural~~До ~~Networks~~сих ~~The~~пор, ~~Hard~~вы ~~Way~~смотрели ~~Day~~на 2:игрущечные ~~What~~наборы Doданных. ~~Hidden~~Они ~~Layers~~были ~~Do?~~настолько ~~Day~~малы, 3:что ~~Haskell~~могли ~~Guide~~быть Toлегко ~~Neural~~помещены ~~Networks~~в ~~The~~память. ~~source~~Однако, ~~code~~в ~~from~~реальности ~~this~~существует ~~post~~огромные isнаборы ~~available~~занимающие onсотни ~~Github.~~гигабайтов памяти, такие как Imagenet. Они часто не помещаются в память. В этом случае, имеет смысл разделить наборы данных в маленькие пакеты. Во время обучения обрабатывается только один пакет.

~~Batch~~Как ~~Normalization~~предполагает In Short What Is A Batch? Until now, we have looked on toy datasets. These were so small that they could completely fit into the memory. However, in the real world exist huge databases occupying hundreds of gigabytes of memory, such as Imagenet for example. Those often would not fit into the memory. In that case, it makes more sense to split a dataset into smaller mini-batches. During a forward/backward pass, only one batch is typically processed.

~~As the name suggests,~~имя, batchnorm ~~transformation~~преобразование isэто ~~acting~~действие onнад ~~individual~~отдельными ~~batches~~пакетами ~~of data. The outputs of linear layers may cause activation function saturation/'dead neurons'~~данных. ~~For~~Вывод ~~instance,~~линейного inслоя ~~case~~может ofбыть причиной деградации активационной функции. Например: в случае ReLU ~~(rectified linear unit)~~ f(x)=max(0,x) ~~activation,~~активация, ~~all~~все ~~negative~~негативные ~~values~~значения ~~will~~будут ~~result~~приводить inк ~~zero~~нулевой ~~activations.~~активации. ~~Therefore,~~Отсюда, itхорошая isидея a- ~~good~~нормализовать ~~idea~~эти toзначения ~~normalize~~с ~~those~~помощью ~~values~~вычитания ~~by subtracting the batch mean~~среднего μ пакета. ~~Similarly,~~Похожим ~~division~~образом, byделение ~~standard~~по ~~deviation~~стандартному отклоненю √var ~~scales~~масштабирует ~~the~~амплитуту, ~~amplitudes,~~которая ~~which~~особенно isвыгодня ~~especially~~для ~~beneficial~~активация ~~for~~сигмоидного ~~sigmoid-like activations.~~вида.

TrainingОбучение Andи Batchnorm

~~The~~Процеедура ~~batch~~пакетной ~~normalization~~нормализации ~~procedure~~имеет ~~differs~~отличия ~~between~~на ~~the~~этапах ~~training~~обучения ~~and~~и ~~inference~~вывода. ~~phases.~~Во ~~During~~время ~~the~~обучения, ~~training,~~каждый ~~for~~слой, ~~each~~где ~~layer~~мы ~~where~~хотим weприменить ~~want~~batchnorm, toсначала ~~apply~~вычисляем ~~batchnorm~~средний weмини ~~first compute the mini-batch mean:~~пакет:

~~where~~где Xi ~~is the~~это i thвектор-свойство ~~feature-vector~~идущий ~~coming~~из ~~from~~прошлого ~~the previous layer;~~слоя; i=1…m, ~~where~~где m>1 isэто ~~the~~размер ~~batch~~пакета. ~~size.~~Так Weже ~~also~~получаем ~~obtain~~ди:сперсию ~~the~~для ~~mini-batch variance:~~пакета

~~Now,~~

Теперь ~~the~~batchnorm ~~batchnorm's~~ядро, ~~heart,~~сама ~~normalization itself:~~нормализация: ~~where~~где aмаленькая ~~small constant~~постоянная ϵ isдобавляет ~~added~~числовую ~~for~~стабильность. ~~numerical~~Что ~~stability.~~если ~~What~~нормализация ifданного ~~normalization~~слоя ofбыла ~~the~~вредна? ~~given~~Алгоритм ~~layer~~предоставит ~~was~~два ~~harmful?~~обучаемых ~~The~~параметра, ~~algorithm~~которые ~~provides~~в ~~two~~худшем ~~learnable~~случае ~~parameters~~могут ~~that~~отменить inэффект ~~the~~пакетной ~~worst~~нормализации: ~~case~~смасштабирует ~~scenario can undo the effect of batch normalization: scaling parameter~~параметр γ ~~and~~и ~~shift~~увеличит β. ~~After~~После ~~(optionally)~~применения ~~applying~~таковых ~~those,~~мы weполучим ~~obtain~~вывод ~~the~~слоя ~~output of the batchnorm layer:~~batchnorm: ~~Please~~Отметим, ~~note~~что ~~that~~оба: ~~both~~среднее ~~mean~~значениее μ ~~and~~и ~~variance~~распределение var ~~are~~векторы ~~vectors~~количество ofкоторых asтакое ~~many~~же ~~elements~~большое asкак ~~neurons~~и inнейронов aв ~~given~~данном ~~hidden~~скрытом ~~layer.~~слое. ~~Operator~~Оператор ∗ ~~denotes~~говорит ~~element-wise~~о ~~multiplication.~~поэлементном умножении.

InferenceВывод

InВ ~~the~~стадии ~~inference~~вывода ~~phase~~лучше itвсего isиметь ~~perfectly~~одну ~~normal~~выборку toданных ~~have~~за ~~one~~раз. ~~data~~Как ~~sample~~же atсосчитать aсреднюю ~~time.~~величину Soпакета ~~how~~если toцелый ~~calculate~~пакет ~~batch~~это ~~mean~~и ifесть ~~the~~выборка ~~whole~~данных? ~~batch~~Для isправильной aобраотки, ~~single~~во ~~sample?~~время Toобучения ~~properly~~мы ~~handle~~указываем ~~this,~~среднее ~~during~~значение ~~training~~и ~~we estimate mean and variance~~распределение (E[X] and Var[X]) overдля allвсех trainingнаборов set.обучения. ThoseЭти vectorsвектора will replaceзаменят μandиvar` ~~during~~во ~~inference,~~время ~~avoiding~~вывода, ~~thus~~таким ~~the~~образом ~~problem~~избегая ofпроблемы ~~normalizing~~нормализации aединичного ~~singleton batch.~~пакета.

HowНасколько Efficient Isэффективен Batchnorm

So far we have played with tasks that provided low-dimensional input features. Now, we are going to test neural networks on a bit more interesting challenge. We will apply our skills to automatically recognize human-written digits on a famous MNIST dataset. This challenge came from the need to have zip-code machine reading for more efficient postal services.

We construct two neural networks, each having two fully-connected hidden layers (300 and 50 neurons). Both networks receive 28×28=784 inputs, the number of image pixels, and give back 10 outputs, the number of recognized classes (digits). As in-between layer activations we apply ReLU f(x)=max(0,x). To obtain the classification probabilities vector in the result, we use softmax activation . One of the networks in addition performs batch normalization before ReLUs. Then, we train those using stochastic gradient descent2 with learning rate α=0.01^3 and batch size m=100.

Neural network training on MNIST data. Training with batchnorm (blue) leads to high accuracies faster than without batchnorm (orange).

From the figure above we see that the neural network with batch normalization reaches about 98 % accuracy in ten epochs, whereas the other one is struggling to reach comparable performance in fifty epochs! Similar results can be obtained for other architectures.

It is worth mentioning that we still may lack understanding how exactly does batchnorm help. In its original paper, it was hypothesized that batchnorm is reducing internal covariate shift. Recently, it was shown that that is not necessary true. The best up-to-date explanation is that batchnorm makes optimization landscape smooth, thus making gradient descent training more efficient. This, in its turn, allows using higher learning rates than without batchnorm!

Implementing batchnorm

We will base our effort on the code previously introduced on Day 2. First, we will redefined the Layer data structure making it more granular:

data Layer a = -- Linear layer with weights and biases
               Linear (Matrix a) (Vector a)
               -- Same as Linear, but without biases
               | Linear' (Matrix a)
               -- Batchnorm with running mean, variance, and two
               -- learnable affine parameters
               | Batchnorm1d (Vector a) (Vector a) (Vector a) (Vector a)
               -- Usually non-linear element-wise activation
               | Activation FActivation

Amazing! Now we can distinguish between several kinds of layers: affine (linear), activation, and batchnorm. Since batchnorm already compensates for a bias, we do not actually need biases in the subsequent linear layers. That is why we define a Linear' layer without biases. We also extend Gradients to accommodate our new layers structure:

data Gradients a = -- Weight and bias gradients
                   LinearGradients (Matrix a) (Vector a)
                   -- Weight gradients
                   | Linear'Gradients (Matrix a)
                   -- Batchnorm parameters and gradients
                   | BN1 (Vector a) (Vector a) (Vector a) (Vector a)
                   -- No learnable parameters
                   | NoGrad

Next, we want to extend the neural network propagation function _pass, depending on the layer. That is easy with pattern matching. Here is how we match the Batchnorm1d layer and its parameters:

    _pass inp (Batchnorm1d mu variance gamma beta:layers)
        = (dX, pred, BN1 batchMu batchVariance dGamma dBeta:t)
      where

As previously, the _pass function receives an input inp and layer parameters. The second argument is the pattern we are matching against, making our algorithm specific in this case for (Batchnorm1D ...). We will also specify _pass for other kinds of Layer. Thus, we have obtained a polymorphic _passfunction with respect to the layers. Finally, the equation results in a tuple of three: gradients to back propagatedX, predictions pred, and prepended list twith valuesBN1computed in this layer (batch meanbatchMu, variance batchVariance`, and learnable parameters gradients).

The forward pass as illustrated in this post:

        -- Forward
        eps = 1e-12
        b = br (rows inp)  -- Broadcast (replicate) rows from 1 to batch size
        m = recip $ (fromIntegral $ rows inp)

        -- Step 1: mean from Equation (1)
        batchMu :: Vector Float
        batchMu = compute $ m `_scale` (_sumRows inp)

        -- Step 2: mean subtraction
        xmu :: Matrix Float
        xmu = compute $ inp .- b batchMu

        -- Step 3
        sq = compute $ xmu .^ 2

        -- Step 4: variance, Equation (2)
        batchVariance :: Vector Float
        batchVariance = compute $ m `_scale` (_sumRows sq)

        -- Step 5
        sqrtvar = sqrtA $ batchVariance `addC` eps

        -- Step 6
        ivar = compute $ A.map recip sqrtvar

        -- Step 7: normalize, Equation (3)
        xhat = xmu .* b ivar

        -- Step 8: rescale
        gammax = b gamma .* xhat

        -- Step 9: translate, Equation (4)
        out0 :: Matrix Float
        out0 = compute $ gammax .+ b beta

As discussed on Day 2, there is a recurrent call obtaining gradients from the next layer, neural network prediction pred and computed values tail t:

        (dZ, pred, t) = _pass out layers

I prefer to keep the backward pass without any simplifications. That makes clear which step corresponds to which:

        -- Backward

        -- Step 9
        dBeta = compute $ _sumRows dZ

        -- Step 8
        dGamma = compute $ _sumRows (compute $ dZ .* xhat)
        dxhat :: Matrix Float
        dxhat = compute $ dZ .* b gamma

        -- Step 7
        divar = _sumRows $ compute $ dxhat .* xmu
        dxmu1 = dxhat .* b ivar

        -- Step 6
        dsqrtvar = (A.map (negate. recip) (sqrtvar .^ 2)) .* divar

        -- Step 5
        dvar = 0.5 `_scale` ivar .* dsqrtvar

        -- Step 4
        dsq = compute $ m `_scale` dvar

        -- Step 3
        dxmu2 = 2 `_scale` xmu .* b dsq

        -- Step 2
        dx1 = compute $ dxmu1 .+ dxmu2
        dmu = A.map negate $ _sumRows dx1

        -- Step 1
        dx2 = b $ compute (m `_scale` dmu)

        dX = compute $ dx1 .+ dx2

Note that often we need to perform operations like mean subtraction X−μ, where in practice we have a matrix X and a vector μ. How do you subtract a vector from a matrix? Right, you don't. You can subtract only two matrices. Libraries like Numpy may have a broadcasting magic that would implicitly convert a vector to a matrix5. This broadcasting might be useful, but might also obscure different kinds of bugs. We instead perform explicit vector to matrix transformations. For our convenience, we have a shortcut b = br (rows inp) that will expand a vector to the same number of rows as in inp. Where function br ('broadcast') is:

br rows' v = expandWithin Dim2 rows' const v

Here is an example how br works. First, we start an interactive Haskell session and load NeuralNetwork.hs module:

$ stack exec ghci
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Prelude> :l src/NeuralNetwork.hs

Then, we test br function on a vector [1, 2, 3, 4]:

*NeuralNetwork> let a = A.fromList Par [1,2,3,4] :: Vector Float
*NeuralNetwork> a
Array U Par (Sz1 4)
  [ 1.0, 2.0, 3.0, 4.0 ]

*NeuralNetwork> let b = br 3 a
*NeuralNetwork> b
Array D Seq (Sz (3 :. 4))
  [ [ 1.0, 2.0, 3.0, 4.0 ]
  , [ 1.0, 2.0, 3.0, 4.0 ]
  , [ 1.0, 2.0, 3.0, 4.0 ]
  ]

As we can see, a new matrix with three identical rows has been obtained. Note that a has type Array U Seq, meaning that data are stored in an unboxed array. Whereas the result is of type Array D Seq, a so-called delayed array. This delayed array is not an actual array, but rather a promise to compute6 an array in the future. In order to obtain an actual array residing in memory, use compute:

*NeuralNetwork> compute b :: Matrix Float
Array U Seq (Sz (3 :. 4))
  [ [ 1.0, 2.0, 3.0, 4.0 ]
  , [ 1.0, 2.0, 3.0, 4.0 ]
  , [ 1.0, 2.0, 3.0, 4.0 ]
  ]

You will find more information about manipulating arrays in massiv documentation. Similarly to br, there exist several more convenience functions, rowsLike and colsLike. Those are useful in conjunction with _sumRows and _sumCols:

-- | Sum values in each column and produce a delayed 1D Array
_sumRows :: Matrix Float -> Array D Ix1 Float
_sumRows = A.foldlWithin Dim2 (+) 0.0

-- | Sum values in each row and produce a delayed 1D Array
_sumCols :: Matrix Float -> Array D Ix1 Float
_sumCols = A.foldlWithin Dim1 (+) 0.0

Here is an example of _sumCols and colsLike when computing softmax activation

softmax :: Matrix Float -> Matrix Float
softmax x =
  let x0 = compute $ expA x :: Matrix Float
      x1 = compute $ (_sumCols x0) :: Vector Float
      x2 = x1 `colsLike` x
  in (compute $ x0 ./ x2)

Note that softmax is different from element-wise activations. Instead, softmax acts as a fully-connected layer that receives a vector and outputs a vector. Finally, we define our neural network with two hidden linear layers and batch normalization as:

  let net = [ Linear' w1
            , Batchnorm1d (zeros h1) (ones h1) (ones h1) (zeros h1)
            , Activation Relu
            , Linear' w2
            , Batchnorm1d (zeros h2) (ones h2) (ones h2) (zeros h2)
            , Activation Relu
            , Linear' w3
            ]
```haskell
The number of inputs is the total number of `28×28=784`
 image pixels and the number of outputs is the number of classes (ten digits). We randomly generate the initial weights `w1`, `w2`, and `w3`. And set initial batchnorm layer parameters as follows: means to zeroes, variances to ones, scaling parameters to ones, and translation parameters to zeroes:
```haskell
  let [i, h1, h2, o] = [784, 300, 50, 10]
  (w1, b1) <- genWeights (i, h1)
  let ones n = A.replicate Par (Sz1 n) 1 :: Vector Float
      zeros n = A.replicate Par (Sz1 n) 0 :: Vector Float
  (w2, b2) <- genWeights (h1, h2)
  (w3, b3) <- genWeights (h2, o)

Remember that the number of batchnorm parameters equals to the number of neurons. It is a common practice to put batch normalization before activations, however this sequence is not strict: one can put batch normalization after activations too. For comparison, we also specify a neural network with two hidden layers without batch normalization.

  let net2 = [ Linear w1 b1
             , Activation Relu
             , Linear w2 b2
             , Activation Relu
             , Linear w3 b3
             ]

In both cases the output softmax activation is omitted as it is computed together with loss gradients in the final recursive call in _pass:

    _pass inp [] = (loss', pred, [])
      where
        pred = softmax inp
        loss' = compute $ pred .- tgt

Here, [] on the left-hand side signifies an empty list of input layers, and [] on the right-hand side is the empty tail of computed values in the beginning of the backward pass.

The complete project is available on Github. I recommend playing with different neuron networks architectures and parameters. Have fun!

Batchnorm Pitfalls

There are several potential traps when using batchnorm. First, batchnorm is different during training and during inference. That makes your implementation more complicated. Second, batchnorm may fail when training data come from different datasets. To avoid the second pitfall, it is essential to ensure that every batch represents the whole dataset, i.e. it has data coming from the same distribution as the machine learning task you are trying to solve. Read more about that.

Summary

Despite its pitfalls, batchnorm is an important concept and remains a popular method in the context of deep neural networks. Batchnorm's power is that it can substantially reduce the number of training epochs or even help achieving better neural network accuracy. After discussing this, we are prepared for such hot subjects as convolutional neural networks and reinforcement learning. Stay tuned!