Day 7: Real World Deep Learning

So far we have explored neural networks almost in the vacuum. Although we have provided some illustrations for better clarity, relying an existing framework would allow us to benefit from the knowledge of previous contributors. One such framework is called Hasktorch. Among the practical reasons to use Hasktorch is relying on a mature Torch Tensor library. Another good reason is strong GPU acceleration, which is necessary for almost any serious deep learning project. Finally, standard interfaces rather than reinventing the wheel will help to reduce the boilerplate.

Fun fact: one of Hasktorch contributors is Adam Paszke, the original author of Pytorch.

Today's post is also based on

Day 2: What Do Hidden Layers Do? Day 4: The Importance Of Batch Normalization Day 5: Convolutional Neural Networks Tutorial The source code from this post is available on Github.

The Basics

The easiest way to start with Hasktorch is via Docker:

  docker run --gpus all -it --rm -p 8888:8888 \
    -v $(pwd):/home/ubuntu/data \
    htorch/hasktorch-jupyter:latest-cu11

Now, you may open localhost:8888 in your browser to access Jupyterlab notebooks. Note that you need to select Haskell kernel when creating a new notebook.

If you have never used Torch library before, you may also want to review this tutorial.

MNIST Example

Let's take the familiar MNIST example and see how it can be implemented in Hasktorch.

Imports
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE ScopedTypeVariables #-}

import Control.Exception.Safe
  ( SomeException (..),
    try,
  )
import Control.Monad ( forM_, when, (<=<) )
import Control.Monad.Cont ( ContT (..) )
import GHC.Generics
import Pipes hiding ( (~>) )
import qualified Pipes.Prelude as P
import Torch
import Torch.Serialize
import Torch.Typed.Vision ( initMnist )
import qualified Torch.Vision as V
import Prelude hiding ( exp )

The most notable import is the Torch module itself. There are also related helpers such Torch.Vision to handle image data. The function initMnist has type

initMnist :: String -> IO (MnistData, MnistData)

The function is loading MNIST train and test datasets, similar to loadMNIST from previous posts.

It might be also useful to pay attention to Pipes module. It is an alternative to previously used Streamly, which also allows building streaming components.

We also import functions from Control.Monad, which are useful for IO operations.

Finally, we hide exp function in favor of Torch exp, which operates on tensors (arrays)1 rather than floating point scalars:

Torch.exp :: Tensor -> Tensor

Defining Neural Network Architecture

First we define a neural network data structure that contains trained parameters (neural network weights). In the simplest case, it can be a multilayer perceptron (MLP).

data MLP = MLP
  { fc1 :: Linear,
    fc2 :: Linear,
    fc3 :: Linear
  }
  deriving (Generic, Show, Parameterized)

This MLP contains three linear layers. Next, we may define a data structure that specifies the number of neurons in each layer:

data MLPSpec = MLPSpec
  { i :: Int,
    h1 :: Int,
    h2 :: Int,
    o :: Int
  }
  deriving (Show, Eq)

Now, we can define a neural network as a function, similar as we did on Day 5 with a "reversed" composition operator (~>).

(~>) :: (a -> b) -> (b -> c) -> a -> c
f ~> g = g. f

mlp :: MLP -> Tensor -> Tensor
mlp MLP {..} =
  -- Layer 1
  linear fc1
  ~> relu

  -- Layer 2
  ~> linear fc2
  ~> relu

  -- Layer 3
  ~> linear fc3
  ~> logSoftmax (Dim 1)

We finish by a (log) softmax layer reducing the tensor's dimension 1 (Dim 1). Derivatives of linear, relu, and logSoftmax are already handled by Torch library.

Initial Weights

How do we generate initial random weights? As you may remember from Day 5, we could create a function such as this one:

randNetwork = do
  let [i, h1, h2, o] = [784, 64, 32, 10]
  fc1 <- randLinear (Sz2 i h1)
  fc2 <- randLinear (Sz2 h1 h2)
  fc3 <- randLinear (Sz2 h2 o)
  return $
     MLP {  fc1 = fc1
          , fc2 = fc2
          , fc3 = fc3
          }

In our example we do almost the same, except we benefit from applicative functors and Randomizable.

instance Randomizable MLPSpec MLP where
  sample MLPSpec {..} =
    MLP
      <$> sample (LinearSpec i h1)
      <*> sample (LinearSpec h1 h2)
      <*> sample (LinearSpec h2 o)

We say above that MLP is an instance of the Randomizable typeclass, parametrized by MLPSpec. All we needed to define this instance was to implement a sample function. To generate initial MLP weights, later we can simply write

let spec = MLPSpec 784 64 32 10
net <- sample spec

Train Loop

The core of the neural network training is trainLoop, which enables a single training "epoch". Let us first inspect its type signature.

trainLoop :: Optimizer o => MLP -> o -> ListT IO (Tensor, Tensor) -> IO MLP

This signifies that the function accepts an initial neural network configuration, an optimizer, and a dataset. The optimizer can be a gradient descent (GD), Adam, or other optimizer. The result of the function is a new MLP configuration, as a result of IO call. IO is necessary for instance if we want to print the loss after each iteration. Now, let's take a look at the implementation:

trainLoop model optimizer = P.foldM step begin done. enumerateData First, we enumerate the dataset with enumerateData. Then, we iterate over (fold) the batches. The step function is an analogy to a step in the gradient descent algorithm:

  where
    step :: MLP -> ((Tensor, Tensor), Int) -> IO MLP
    step model ((input, label), iter) = do
      let loss = nllLoss' label $ mlp model input
      -- Print loss every 50 batches
      when (iter `mod` 50 == 0) $ do
        putStrLn $ "Iteration: " ++ show iter ++ " | Loss: " ++ show loss
      (newParam, _) <- runStep model optimizer loss 1e-3
      return newParam

We calculate a negative log likelihood loss nllLoss' between the ground truth label and the output of our MLP. Note that model is the parameter, i.e. weights of the MLP network. Then, we take advantage of the iteration number iter to print the loss every 50 iterations. Finally, we perform a gradient descent step using our optimizer via runStep :: ... => model -> optimizer -> Loss -> LearningRate -> IO (model, optimizer) and keep only new model newParam. The learning rate here is 1e-3, but can be eventually changed.

The done function is (trivial in this case) finalization of foldM iterations over the MLP model and begin are the initial weights (we use pure to satisfy the type m x requirement).

    done = pure
    begin = pure model

Putting It All Together

The remaining part is simple. We load the data into batches, specify the number of neurons in our MLP, choose an optimizer, and initialize the random weights.

main = do
  (trainData, testData) <- initMnist "data"
  let trainMnist = V.MNIST {batchSize = 256, mnistData = trainData}
      testMnist = V.MNIST {batchSize = 1, mnistData = testData}
      spec = MLPSpec 784 64 32 10
      optimizer = GD
  net <- sample spec

Then, we train the network for 5 epochs:

  net' <- foldLoop net 5 $ \model _ ->
      runContT (streamFromMap (datasetOpts 2) trainMnist) $ trainLoop model optimizer. fst

Finally, we may examine the model on test images

  forM_ [0 .. 10] $ displayImages net' <=< getItem testMnist

For this purpose may use a function such as

displayImages :: MLP -> (Tensor, Tensor) -> IO ()
displayImages model (testImg, testLabel) = do
  V.dispImage testImg
  putStrLn $ "Model        : " ++ (show. argmax (Dim 1) RemoveDim. exp $ mlp model testImg)
  putStrLn $ "Ground Truth : " ++ show testLabel

Running

Iteration: 0 | Loss: Tensor Float []  12.3775   
Iteration: 50 | Loss: Tensor Float []  1.0952   
Iteration: 100 | Loss: Tensor Float []  0.5626   
Iteration: 150 | Loss: Tensor Float []  0.6660   
Iteration: 200 | Loss: Tensor Float []  0.4771   
Iteration: 0 | Loss: Tensor Float []  0.5012   
Iteration: 50 | Loss: Tensor Float []  0.4058   
Iteration: 100 | Loss: Tensor Float []  0.3095   
Iteration: 150 | Loss: Tensor Float []  0.4237   
Iteration: 200 | Loss: Tensor Float []  0.3433   
Iteration: 0 | Loss: Tensor Float []  0.3671   
Iteration: 50 | Loss: Tensor Float []  0.3206   
Iteration: 100 | Loss: Tensor Float []  0.2467   
Iteration: 150 | Loss: Tensor Float []  0.3420   
Iteration: 200 | Loss: Tensor Float []  0.2737   
Iteration: 0 | Loss: Tensor Float []  0.3054   
Iteration: 50 | Loss: Tensor Float []  0.2779   
Iteration: 100 | Loss: Tensor Float []  0.2161   
Iteration: 150 | Loss: Tensor Float []  0.2933   
Iteration: 200 | Loss: Tensor Float []  0.2289   
Iteration: 0 | Loss: Tensor Float []  0.2693   
Iteration: 50 | Loss: Tensor Float []  0.2530   
Iteration: 100 | Loss: Tensor Float []  0.1979   
Iteration: 150 | Loss: Tensor Float []  0.2616   
Iteration: 200 | Loss: Tensor Float []  0.1986   
              
              
              
              
   #%%*****   
      ::: %   
         %:   
        :%    
        #:    
       :%     
       %.     
      #=      
     :%.      
     =#       
Model        : Tensor Int64 [1] [ 7]
Ground Truth : Tensor Int64 [1] [ 7]
              
              
     %%%#     
    %#  %     
    .  #%     
      :%:     
      %+      
     *%       
     %=       
    %%        
    %%%%++%%%=
     ==%%=.   
              
              
Model        : Tensor Int64 [1] [ 2]
Ground Truth : Tensor Int64 [1] [ 2]
              
              
        .-    
        =     
        %     
       .#     
       =:     
       @      
       #      
      ++      
      %:      
      %       
              
              
Model        : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
              
              
       %.     
      *%-     
     %%%%#    
    :%%+:%-   
    %%   -%.  
    %    .@+  
    %    %%.  
    %   #%*   
    %%%%%%    
    :%%%-     
              
              
Model        : Tensor Int64 [1] [ 0]
Ground Truth : Tensor Int64 [1] [ 0]
              
              
              
     =    +   
     %    %   
    +.    %   
    %    %:   
    +    %    
    %--=*%    
     :: +%    
        =%    
        =%    
        *     
              
Model        : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
              
              
              
        %@    
        @:    
       =@     
       @%     
       @      
      :@      
      %#      
      @       
      @       
      +       
              
Model        : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
              
              
              
     %     %  
    %     %   
   +#    -+   
   +%*::*%    
    :%==%+    
        %     
       ++     
       %      
       %-+    
       *      
              
Model        : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
              
              
              
      +       
     %%+      
    .%*%%     
    -: *%     
    -#-%%.    
     %% =#    
         %    
         .%   
          #.  
           %  
              
Model        : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
              
              
         ..=. 
      .%%%%%% 
     ::%+:    
    %         
   %          
   %=         
   %%%%%%+    
     :%%%%    
      %%%%    
       %#     
              
              
Model        : Tensor Int64 [1] [ 6]
Ground Truth : Tensor Int64 [1] [ 5]
              
              
              
              
      +%%%#   
    +%*  .%%  
   :%.  .#%+  
    %@%%%%*   
       +%-    
      -%#     
      %%      
     %%       
     %=       
     @        
Model        : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
              
              
       ==:    
     %%**%%   
    .%    %:  
    *-    +#  
    %     :#  
    #     :#  
   -#     +#  
   -#    .%   
    #   +%:   
    #%%%%=    
              
              
Model        : Tensor Int64 [1] [ 0]
Ground Truth : Tensor Int64 [1] [ 0]

See the complete project on Github. For suggestions about the content feel free to open a new issue.

Summary

Today we have learned the basics of Hasktorch library. The most important is that the principles from our previous days still apply. Therefore, the transition to the new library was quite straightforward. With a few minor changes, this example could be run on a graphics processing unit accelerator.