10 дневный курс: Машинное обучение

День 1: Изучение Нейронной сети сложным путем.
Нейронная сеть - тема которая переодически появляется в моей жизни. Однажды, когда я был студентом, я был поглощен идеей построить интеллектуальную машину. Я провел пару бессонных ночей в думах. Я прочел несколько эссе проливающих свет на этот философский вопрос, среди которых самый видны, возможно, были записки Марвина Мински(Marvin Minsky). Как результат, я наткнулся на идею нейронной сети. Это был 2010 год, и глубокое обучение не было насколько популярным, как сейчас. Более того, ни кто не приложил много усилий, чтобы связать нейронную сеть в курсах математики или линейной алгебры. Даже ребята делая классическую оптимизацию и статистику иногда казались в затруднении слыша о нейронных сетях. Через несколько лет. я реализовал простую нейронную сеть с сигмоидной активацией как часть курса "принятия решения". В это же время я понял. что существующее положение нашего знания был до сих пор недостатком при создании "думающешго компьютера" на практике. 
 Это был 2012, конференция в Крыме, Украина где я присутствовал при разговоре с проф. Лореном Ларгером(Laurent Larger). Он объяснил как построить высоко-скоростное оборудование для распознавания речи используя лазер. Разговор меня вдохновил и годом позже я начал докторскую работу с целью разработать пластовые вычисления, рекурсивные нейронные сети реализуемые напрямую в железе. Наконец, теперь я использую нейронные сети как часть своей работы. 
 В это серии постов, я буду освещать некоторые интерсные детали проблемы решаемые с помощью нейронной сети. Я очень против повторяющихся усилий, поэтому я постараюсь обойти повторения сторон описанных много раз где-то еще. Для тех же, кто новичек в данной теме, вместо вступления в нейронные сети я буду ссылаться на главу 1.2 своей статьи: Theory and Modeling of Complex Nonlinear Delay Dynamics Applied to Neuromorphic Computing . Здесь, я всего лишь подведу итог, что нейронные сети были вдохновлены биологическими нейронами. Походя на своего билогического двойника, искуственный нейрон получает множество входных данных, проивзодит нелинейную транформацию, и делает вывод. Выражение ниже формализует тип поведения: где N это число входов x , w вес синапсиса, а y это результат. Удивительно, как может показаться, что в современной нейронной сети f может быть практически любой нелинейной. Эта нелинейная фнукия часто называемая функцией активации. Поздравляю, вы дошли до нейроной сети 1950 года. 
 Для продолжения чтения этой статьи, будут полезны, базовые математические знания, но не обязательно. Разрабатывать может любой у кого есть интуиция на тему нейронных сетей! 
 Слово о Haskell 
 Это серия постов практически иллюстрирует все концепты на языке программирования Haskell. Чтобы смотивировать вас на это, вот несколько вопросов и ответов, которые вы хотели знать. 
 Вопрос: Есть что, что делает Haskell практически лучше для нейронной сети или вы просто делаете это потому что вы предпочли Haskell? 
 Ответ: Неронные сети это просто "функциональный объекты". Сеть это просто большая композиция функци. А это естественная вещь для функционального языка. 
 
 Вопрос: Так как я абсолютный новичек в Haskell, хочу знать преимуства использвания Haskell перед python или другим языком? 
 Ответ: Вот основые преимущества: 
 
 Прост в понимании что делает ваша программа 
 Он хорош в переработке существующей кодовой базы. 
 Haskell направляет ваше мышление непосредственно к решению проблемы практическим путем. 
 Программы на Haskell быстры. 
 
 
 Вопрос: Мой недостаток знания Haskell будет мешать в чтении кода, но ваши примеры кода до сих пор имеют смысл для меня. 
 Ответ: Спасибо! Я нахожу Haskell интуитивно понятным при объяснени нейронных сеток. 
 Гадиентный спуск: Первый курс CS 
 Основная идея обучения нейронных сетей и глубокого обучения это логические оптимизационные методы известные как "Градиентный спуск". Для тех, кому сложно вспомнить первый курс, просто посмотрите видео объясняющее идею. 
 Как идея, простая как градиентный спуск может работать для нейронной сети? Пожалуй, нейронная сеть это всего лишь функция, сопоставляющая вход и выход. Сравнивая вывод нейронной сети с некоторым желаемым выводом, онам ожет получить другую функцию, известную как ошибочная функция. Это ошибочная фунцкия имеет конкретный ландшафт ошибок , как горы. Используя градиентный спуск, мы изменяем, нашу нейронную сеть так, будто мы спускаемся с этого ландафта. Далее, наша цель найти ошибку минимума. Ключевая идея метода оптимизации это, то что градиент ошибки дает нам направление в котором нам необходимо изменить нашу сеть. Похожим образом мы сможем спустится с холма покрытого густым туманом. В обоих случаях локальный градиент(наклон) нам доступен. 
 Градиентный спуск может быть описан формулойл: где постоянная γ это то что называется скорость обучения в машинном обучении, таким образо количество обучения в одной итерации - n . В простом случае x скалярная величина. Метод градиентного спуска может быть применен в нескольких строках кода. 
 descend gradF iterN gamma x0 = take iterN (iterate step x0)
 where
 step x = x - gamma * gradF(x)
 gradF_test x = 2 * (x - 3) 
 main = print (descend gradF_test iterN gamma 0.0)
where
iterN = 10
gamma = 0.4
 
 На выходе мы получим следующую последовательность: 
 [0.0,2.4000000000000004,2.88,2.976,2.9952,2.99904,2.999808,2.9999616,2.99999232,2.999998464]
 
 Выходит. значение минимизации функции f(x) равно 3, то есть min f(x) = f(3) = (x-3)^2 = 0 . А последовательность постепенно сходится к этому числу. Давайте посмотрим пристально на то, что делает код выше. 
 
 
 Линии 1-3: Мы определяем метод градиентного спуска, который пошагово применяет функцию step реализующую выражение 2. Мы предоставляем промежуточные результаты берущие первые iterN значения. 
 
 
 Линии 5: Предположим, мы хотим оптимизировать функцию f(x) = (x -3)^2 . Тогда этот градиент gradF_test становится: 
 
 
 
 
 Линии 7-10: И вот, мы запустили наш градиентный спуск установив сколкрость обучения γ = 0.4 . 
 
 Ключевой момент - понять как значение γ действует на сходимость. Когда γ слишком маленькое, алгоритм произведет множество итераций для схождения, однако если γ слишком большое, алгоритм никогда не сойдется. Я не пытаюсь показать простой путь как определить γ для заданной проблемы. Отсюда, должны использоваться различные значения γ . Вы можете поиграться с кодом выше и посмотреть что получается. Для примера вы можете попробовать различные значения γ , к примеру: 0.01, 0.1, 1.1. Метод применим ко множеству N . В далнейшем мы заменим gradF_test функцию скорей векторной, чем скалярной. 
 Части нейронной сети для задачи классификации 
 После того как мы поняли, как градиентный спуск работает, мы может быть захотим научить умеренно полезуню сетку. Скажем, мы хотим определять типа ириса используя явные черты: длина чашелистика, ширина чашелистика, длинна лепестка, ширина лепестка. Существует три типа цветов мы хотим узнавать: 
 
 Ирис щетинестый(Setosa) 
 Ирис разноцветный(Versicolour) 
 Ирис вергинский(Virginica) Теперь проблема закодировать эти классы чтобы наша нейронная сеть могда обработать их. 
 
 Наивное решение и почему оно не работает. 
 Самое простое решение указать на отличия это использовать натуральные числа. Для примера, ирис щетинистый может быть зашифрован как 1 , разноцветный 2 , виргинский 3 . Однако, проблема с этим типом кодирования заключается в том что он предвзятый. Первое кодируются в качестве числа, мы навязали линейный порядок для трех класов. Это значит, что мы начинаем наш отсчет с ириса щетинистого, затем разноцветный и только потом виргинский. Однако, в реальности это не важно чем мы заканчиваем виргинским или разноцветным. Второе, мы так же предполагаем, что расстояние между виргинским и щетинистым ирисами: 3-1=2 - больше чем между виргинским и разноцветным ирисами 3-2=1 , что априори не правильно. 
 Одно горячее кодирование. 
 Какой же тип кодировки нам нужен? Первое, мы не хотим влиять на ограничения порядка, и второе, мы хотим уровнять разницу расстояний между типами. Далее, мы предпочитаем кодировать каждый тип ортогонально, то есть независимо от двух других. Это становится возможно если мы используем вектор трех измерений(так как у нас три типа). Теперь, ирис щетинистый закодирован как [1,0,0], разноцветный [0,1,0] и виргинский [0,0,1]. Эвклидово расстояние между любыми парми классов равны sqrt(2) Для примера, расстояние между щетинистым и разноцветным вычисляется как: 
 Складываем всё вместе. 
 Так как мы уже знакомы с базой нейронной сети и градиентного спуска и так же имеем некоторые данные, чтобы поиграться, пусть веселуха начнется! 
 Сначала мы создаем сеть из трех нейронов. Чтобы это сделать, мы обощим формулу (1): 
 
 где четырех мерный входной вектор, веса матрицы синопсов, и результирующий трехмерный вектор. Говоря в общем, мы производим матрично-векторную манипуляцию с последующей поэлементной активацией. 
 
 Как не линейную активационную функцию f мы будем использовать функцию сигмоид Мы будем использовать hmatrix Haskell библиотеку для операций линейной алгебры таких как работа с матрицами. С помощью hmatrix , выражение (4) может быть переписано как: 
 import Numeric.LinearAlgebra as LA
 sigmoid = cmap f
where
f x = recip $ 1.0 + exp (-x) 
 forward x w =
let h = x LA.<> w
y = sigmoid h
in [h, y]
 
 где <> обозначает матричный результат функции из модуля LA . Отметим что x может быть вектором, но он так же может быть матричным набором. В последнем случае, forward будет преобразовывать наш полный набор данных. Отметим, что мы предоставляем не только результат наших вычислений y , но так же промежуточный шаг h так как он будет использован повтороно позднее для w градиентных вычислений. 
 Какждый из нейронов y предполжительно сработает когда он "думает" что он заметил один из трех видов. Напрмер, когда мы имеет вывод [0.89, 0.1, 0.2] , мы предполагаем что первый нейрон самый "правилный", то есть, мы интерпретируем результат как ирис щетинистый. Другими словами, этот вывод воспринимаетя как [1,0.0] . Как вы можете увидеть, максимальный элемент был установлен еденицей, а остальные два нулями. Это так называемое правило "Победитель забирает всё". 
 Перед обучением нейронной сети, нам нужно некоторые измерения для уменьшения ошибок или функции потерь. Для примера, мы можем использовать Эвклидову потерю: где - предсказание, а реальный ответ из нашего набора данных: 
 loss y tgt =
 let diff = y - tgt
 in sumElements $ cmap (^2) diff
 
 Для иллюстрации градиентного скуска, мы повторно используем descend функцию определенную выше. Теперь, мы можем определить градиентную функцию для выражения (4) нашей неронной сети. Мы используем. то что называется методом обратного распространия , который естественно результат дифференцирования сложной функции и показан для отдельного нейрона на рискунке ниже: 
 Обратный проход 
 Обратное распространение для единичного нейрона.
Первое, в прямом прохождении, начальные выводы `y` вычеслены. Затем, этот вывод сравнивается с некоторым желаемым выводом, и градиент ошибки `dy` передается назад. Затем, градиент активационной функции `df`использует полученное `dy`. Это приводит к оставшимся градиентам: 
 
 
 Теперь мы можем посчитать веса градиента dW используя метод обратного распространения выше: 
 grad (x, y) w = dW
 where
 [h, y_pred] = forward x w
 dE = loss' y_pred y
 dY = sigmoid' h dE
 dW = linear' x dY
 
 Здесь linear' , sigmoid' , loss' градиенты ленейной операции(перемножения), активация сигмоида и функции потерь. Отметим что операциями над матрицами мы обычно вычисляем не скалярную величину а градиент вектора dW обозначенный как любой вес синопса . Ниже эти "направленные" функциональный определения в haskell используют hmatrix библиотеку: 
 linear' x dy = cmap (/ m) (tr' x LA.<> dy)
 where
 m = fromIntegral $ rows x
 sigmoid' x dY = dY * y * (ones - y)
where
y = sigmoid x
ones = (rows y) >< (cols y) $ repeat 1.0 
 loss' y tgt =
let diff = y - tgt
in cmap (* 2) diff
 
 Чтобы проверить нашу сетку, мы качаем набор данных (два файла: x.dat и y.dat ) и сам код . Как указано в коментарии файла, мы запускаем нашу программу: 
 $ stack --resolver lts-10.6 --install-ghc runghc --package hmatrix-0.18.2.0 Iris.hs
Initial loss 169.33744797846379
Loss after training 61.41242708538934
Some predictions by an untrained network:
(5><3)
[ 8.797633210095851e-2, 0.15127581829026382, 0.9482351750129188
, 0.11279346747947296, 0.1733431584272155, 0.9502442520696124
, 0.10592462402394615, 0.17057190568339017, 0.9367875655363787
, 0.10167941966201806, 0.20651101803783944, 0.9300343579182122
, 8.328154248684484e-2, 0.15568011758813116, 0.940816298954776 ]
Some predictions by a trained network:
(5><3)
[ 0.6989749292681016, 0.14916793398555747, 0.1442697900857393
, 0.678406436711954, 0.1691062984304366, 0.2052955124240905
, 0.6842327447503195, 0.16782087736820395, 0.16721778476233148
, 0.6262988163006756, 0.19656943129188192, 0.17521133197774072
, 0.6905553549763312, 0.15299944611286123, 0.12910826989854146 ]
Targets
(5><3)
[ 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0
, 1.0, 0.0, 0.0 ]
 
 На сегодня всё. Вы можете поиграться с кодом. Подсказка: вы можете заметить что grad вызывает sigmoid дважды на один и тот же набор данных: один раз в forward и один раз в sigmoid . Попробуйте оптимизировать код чтобы избежать эту избыточность. 
 Как только вы поймете основы нейронных сетей, можете переходит ко второму дню. В следующей статье, вы научитесь использовать ваше нейронную сеть. Во-первых, мы выделим важность многослоной структуры. МЫ так же показывать как нелинейные активаторы важны. Во-вторых, мы улучшим обучение нейронной сети и обсудим начальные веса синопсов.

День 2: Что делают скрытые уровни?
На прошлом дне, мы ввели понятие обучения одноуровней нейронной сети. Сегодня мы узнаем преимущества многослойной нейронной сети, как правильно её организовывать и обучать. 
 Когда я обсуждаю нейронную сетку со студентами, которые только начали открывать технику машшиног обучения: 
 -Я сделал сетку распознавания цифр рукописного ввода. Но моя точность всего лишь Y 
 -Кажется это горазно меньше чем последнее слово техники, - размышляю я. 
 -Вот именно. Может быть прроблема в X ? 
 Обычно X это не причина. Реальная причина должна быть более простая: вместо многослойной нейронной сети студент сделал однослойную нейронную сетку или её эквивалент. Эта сеть работает как линейный классификатор, следовательно она не может выучить не ленейные отношения между входом и желаемым выводом. 
 Так что такое многослойная нейронная сеть и как избежать ловушки с линейной классификацией. 
 Многослойная нейронная сеть. 
 Многослойная нейронная сеть, может быть описана как:
 
где x и y это входные и выходные векторы соответственно, - веса матриц и смещение векторов, и функции активации в i'том слое. Нелинейная функция активации f применяется поэлемментно. 
 В качестве примера мы сегодня будем использовать двумерный набор данных как показано ниже. 
 
 Оба набора свойств двху классов изображена оранжевыми и синими точками. Цель обучения нейронной сети понять, как описать класс новой точки зная её координаты (x1, x2) . Выхоит что это простое задание не может быть решено с помощью однослойной архитектуры, так как оди слой может вмещать только рисование прямой линии во входяем пространестве (x1,x2) . Давайте добавим еще одни слой. 
 Двуслойная сеть. 
 
 Ввод x1 и x2 - меремножаются весами матрицы W1 , затем функция активации f1 применяется поэлементно. Наконец, данные преобразуются с помощью W2 и затем следующей функцией активации f2 (не отображена) чтобы получить вывод y . 
 Картинка выше показывает однослойный скрытый нейронную архитектуру, которую мы будем взаимозамеяемо вызывать двух уровневую сетку:
 
где f1(x) = max(0,x) так же известно как выпрямитель, будет первой активационной функцией и сигмоид f2(x)=σ(x)=[1+e^-1]^-1 - будет второй активационной функцией, которую вы уже видели в прошлой статье. Специфика сигмоида приводить вохдные данные к виду от 0 до 1. Этот вывод имеет тендецию быть ползеным в нашем случае где мы только имеет два класса: синие точки обозначенные как 0 и оранжевые точки как 1. Для минимизации функции потери во время обучения нейронной сети, мы будем использовать функция доичной перекресной энтропии:
 
где y желаемый вывод а ^y предсказание нейронной сети. Эта функция потери была специально создана что эффективного научения нейронной сети. Другое преимущество перекресной энтрпии в том, что в соединении с активацией сигмоида, градиент ошибки является просто разницей между предсказанным и желаемым выводом. 
 С каких значений начать? 
 Первая ошибка может закрастся в полное игнорирование важности первоначального задания весов сети. Естественно, вы хотите созать нейрон способный обучаться различным свойствам. Далее, вы хотите указать веса нейронной сети случайно.
 
 Проблема круга.
Решение о размере границ нейронной сети обучено с помощью метода градиентного спуска с заданными нулевыми весами(1), одинаковыми весами(2) и разными весами(3) 
 Выше мы можем увидеть разрешение границы для этих трех начальных случаем. Первое, когда веса заданы нулевыми, сеть не может чему-то научиться. Во втором случае, веса заданы не нулями а одинаковыми значениями, в этом случае нет ясности между отдельными нейронами. Далее, вся сеть становится одним большим нейроном. Наконец когда в нейронной сети заданы случайные числа, она может учиться. 
 Как обучать? 
 Универсальная теорема приближения говорит, что под конкретным предположением, однослойная скрытая архитектура должна быть достаточной для приближения любой ризонной функции. Однако, она ничего не говорит о том, как получить или насколько сложно получить веся для нейронной сети. Как мы видели на примере задания весов, использование простого обучения градиентного спуска для обучения, наша сеть смогла различить два круга. Теперь, что на счет более сложного примера: спираль, которая сложна в силу своей нелинейности. Посмотрим на анимацию ниже, слева, чтобы увидать как обучение простого градиентного спуска может легко застрять в неоптимальных настройках.
 
 Проблемы спирали. Обе нейронные сети имеют однлослойную скрытую архитектуру с 512 нейронами. Левый обучен с помощью простого градиентного спуска, правый обучен спомощью Adam. 
 Сеть не может определится между классами. Кто-то может предположить что нужна более адекватная нейронная сеть. Что от части верно. Однако, упомянутая выше теорема подсказывает что это может быть не нужным. 
 Выходит, что много разработок нейронной сети и глубого обучения именно том, как настроить веса, это все что есть в неронном алгоритме. Недостаток прямого градиентного спуска, который алгоритр часто затыкается задолго до приемлимого решения. Одна из воможностей облегчить это ввести импельс. Возможно, самый популярный среди алгоритмов с импульсом - Adam . Как мы можем увидеть из картинки выше справа, алгоритм Adam может эффективно строить подходящие рамки решения. Мне кажется более показательным, если мы изобразим вывод нейронной сети во вмеря обучения. 
 
 Проблемы спирали. Вывод двух обучаемых моделение перед пороговым значением. Можно увидеть, что алгоритм GD частично застрял в промежуточном состоянии, где алгоритм Adam эффективно ведет к разделению двух классов. 
 Таким образом мы видим, что сходимость алгоритма Adam быстрее при более детальном рассмотрении. Если вы хотите изучить по подробнее раличные алгоритмы из семейства градиентного спуска, вот вам хорошее сравнение. 
 Как избежать ловушки линейного классификатора? 
 Теперь, что если мы по ошибке предоставили функцию линейной активации. Что случится? Если активация f будет линейной, тогда модел в формуле 2 урежится до однослойной нейроннйо сети. 
 
 где c - константа. W - линейный классификатор. Это может быть видно на результате границ решения. 
 
 Двуслойная сеть с активационной функцией f(x)=x. 
 Мы использовали ту же архитектуру что и выше, за исключением функции активации f(x)=x. Теперь мы видит, что двуслойная нейронная сеть действует как простая однослойная сеть: она может только делить входящее пространство используя прямые линии. Чтобы избежать этого, все функции активации должны оставаться нелинейными. 
 Сколько уровней? 
 Мы решили проблему спирали с помощью одного спрятанного уровня сети. Почему бы не добавить больше уровнеий? Чтобы задать этот вопрос, давайте нарисуем границы реления во время обучения нейронной сети для трех различных архитектур.
Все нейронные сети имеют ReLU активацию вне терминальных уровней и сигмоид на последнем уровне. Как можно увидеть из кратинки ниже, сеть с тремя спрятанными уровнями и с несколькими нейронами в каждом уровне, обучается быстрее. Причина может быть ясна интуитивно. Нейронная сеть производит топологическое преобразование входного пространства. Составляя несколько простых преобразований ведущих к одному более сложному. Отсюда, для таких нелинейных разделенных классов, как в проблеме спиралей, имеем только преимущество от добавления уровней. 
 
 Неронная сеть обучает сходимост(Adam). Архитектуры:(1) один спрятанный уровень с 128 нейронами(513 параметров), (2) один спрятанный уровень с 512 нейронами(2049 параметров), (3) три спрятанных уровня с 40, 25 и 10 нейронами(1416 параметров). Наибольшую скорость схождения показывает последняя архитектура. 
 Из рисунка мы можем заключить, что больше нейроном не всегда лучший выбор. Самая яркая граница указывает на высшую уверенность в меньшем количестве шагов, полученных с помощью трех спрятанных уровней нейронной сети. Эта нейронная сеть имеет 1416 параметров считающих оба веса и отклонения. Это число около 30 процентов, меньше чем во второй архитектуре с одним одним спрятанным уровнем. Конечно, возможно нужно аккуратно выбирать архитектуру (три ссылки) рассматривая различные условия такие как размер набора данных, отклонение, входную размерность, тип задания и другое. С большим количеством уровней, нейронная сеть может более эффективно отражать сложные отношения . С другой стороны, если количество слоев очень большое, возможно придется применить различные приемы чтобы избежать такие проблемы как затухающий градиент. 
 Дана вся вводная информация, как на счет поэкспериментировать с нейронной сетью? В оставшейся части этой статьи мы научимся как строить многослойные нейронные сети на Haskell. 
 Реализация обратного распространения 
 В прошлой статье , мы реализовали обратное распространение для однослойной(без скрытого уровня) нейронной сети. Этот метод легко расширяется на пути движения к многослойное сетке. Есть пример хорошего видео , где можно найти много технических деталей. Ниже мы вспомним как обратное распространение работает для каждого нейрона. 
 
 *Обратное распространение для нейрона. Вычисления нейрона известно как forward pass показано черным цветом. backward pass обратное распространение градиента в обратном направлении обозначено красным. 
 Если вы найдете любой другой пример обратного распространения, шансы что вы сначала реализуете так называемое forward pass и затем, как отдельную функцию backward pass . Проход вперед вычисляет вывод нейронной сети, где обратный обход ведет учет градиента. В прошый раз мы показали пример прохода вперед в однослоной архитектуре: 
 forward x w =
 let h = x LA.<> w
 y = sigmoid h
 in [h, y]
 
 где h=x LA.<> w вычисляет ![] и y= sigmoid h это поэлементная активация . Здесь мы предоставили оба результата вычесления нейронной сети y и промежуточное значение h , которое было последовательно использовано в обратном проходе для вычисления весов градиента dW : 
 dE = loss' y y_target
 dY = sigmoid' h dE
 dW = linear' x dY
 
 Если есть несколько слоев, то два прохода(вперед и назад) обычно вычисляют все промежуточные результаты 
Затем эти средние значения используются в обратном порядке для вычисления dW .

День 3: Haskell путеводитель по нейронным сетям
После того как мы посмотрели, как работает сеть, стало ясно, что понимание градиента жизненно необходимо. Отсюда, пересмотрим нашу стратегию на уровне ниже. Однако, так как нейронные сети становятся сложнее, вычисления градиента в ручном режиме становится еще тем делом. Но всё еще есть выход! Я очень рад, что сегодня мы наконец познакомимся автоматической дифференциацией, естественным инструментом в изучении арсенала глубокого обучнеия. Эта статья написана под впечатлением от Hacker's guide to Neural Networks. Для сравнения так же стоит посмотреть Python версию. 
 Почему случайный локальный поиск не подходит 
 Следуя инструкции от Картпатого, для начала рассмотрим простую цепь умножений. Haskell не Javascript, поэтому перепишем явным образом. 
 forwardMultiplyGate = (*)
 
 Or we could have written 
 forwardMultiplyGate x y = x * y
 
 чтобы сделать функцию более понятной f(x,y)=x⋅y . В любом случае, 
 forwardMultiplyGate (-2) 3
 
 Возвращает -6. Отлично! 
 Теперь вопрос: есть ли возможность изменить (x,y) чтобы улучшить вывод? Один из способов это произвести локальный случайный поиск. 
 _search tweakAmount (x, y, bestOut) = do
 x_try <- (x + ). (tweakAmount *) <$> randomDouble
 y_try <- (y + ). (tweakAmount *) <$> randomDouble
 let out = forwardMultiplyGate x_try y_try
 return $ if out > bestOut
 then (x_try, y_try, out)
 else (x, y, bestOut)
 
 Не удивительно, функция выше отражает простую итерацию цикла for . Что он делает: случайным образом выбирает точки вокруг начальных (x,y) и проверяет увеличился ли вывод. Если да, тогда он обновляет лучшие известные входные и максимальные выходные данные. Чтобы пройтись по значениям, мы можем использовать foldM :: (b -> a -> IO b) -> b -> [a] -> IO b . Эта фукнция удобна так как ожидаем взаимодействие с "реальным миром" в виде случайно сгенерированных чисел. 
 localSearch tweakAmount (x0, y0, out0) =
 foldM (searchStep tweakAmount) (x0, y0, out0) [1..100]
 
 Код говорит нам, что мы наполняем код с какими-то начальными значениями x0 , y0 и out0 и проходимся от 1 до 100. Ядро алгоритма - searchStep 
What the code essentially tells us is that we seed the algorithm with some initial values of x0 , y0 , and out0 and iterate from 1 till 100. The core of the algorithm is searchStep: 
 searchStep ta xyz _ = _search ta xyz
 
 что есть довольно удобная функция, которая склеивает 2 части вместе. Она просто игнорирует итерационные числа и вызывает _search . Теперь, нам нужно случайное число в промежутке [-1; 1) . Из документации, мы знаем, что randomIO производит числа между 0 и 1. Проскалируем его умножая на 2 и вычитая 1. 
 randomDouble :: IO Double
randomDouble = subtract 1. (*2) <$> randomIO
 
 Функция <$> это синоним fmap . Она применяет чистую функцию substract 1. (*2) тип которой Double-Double ко "внешнему" действию randomIO , тип которой IO Double (yes, IO = input/output)1. 
 Хитрость для числа минус бесконечность: 
 inf_ = -1.0 / 0
 
 Теперь запускаем localSearch 0.01 (-2, 3, inf_) несколько раз: 
 (-1.7887454910045664,2.910160042416705,-5.205535653974539)
(-1.7912166830200635,2.89808308735154,-5.19109477484237)
(-1.8216809458018006,2.8372869694452523,-5.168631610010152)
 
 На деле мы видим как вывод изменился с -6 до -5.2. Но улучшение только 0.008 едениц на итерацию. Это очень не эффективный метод. Проблема со случайным поиском в том, что каждый раз он пытается изменить входные данные в случайных направлениях. Если алгоритм делает ошибку, он должен сбросить результат и начать с последней лучшей позиции. Не правда ли лучше было бы, если вместо каждой итерации результат улучшался пусть даже по чуть чуть но постоянно и не приходилось откатываться? 
 Автоматическое дифференцирование 
 Вместо случайного поиска в случайном направлении, мы можем использовать точное направление и количество для изменения входных данных таким образом, чтобы улучшался вывод. И это то что градиент нам говорит. Вместо ручного вычисления градиента каждый раз, мы можем использовать умный алгоритм. Есть можноество подходов: цифровой, символический и автоматическое дифференцирование. В этой статье, Доминик Стейнтц объясняет разницу между ними. Последений подход, автоматическое дифференцирование, именно то что нам нужно: точные градиенты с мнимальным количеством переработок. Кратко поясним идею. 
 За идеей автоматического дифференцирования стоит ясно определенный градиент только для простых базовых операторов. Затем, мы составляем цепь правил комбинируя операторы в нейронную сеть как хотим. Такая стратегия будет влиять на сам градиент. Давайте посмотрим на метод через пример. 
 Ниже можно посмотреть оператор умножения и его градиент используя правило. 
 (x, x') *. (y, y') = (x * y, x * y' + x' * y)
 
 Тоже самое можно сделать со сложением, вычитанием, делением и экспонентой: 
 (x, x') +. (y, y') = (x + y, x' + y')

x -. y = x +. (negate1 y)

negate1 (x, x') = (negate x, negate x')

(x, x') /. (y, y') = (x / y, (y * x' - x * y') / y^2)

exp1 (x, x') = (exp x, x' * exp x)
 
 Мы так же имеем constOp для констант: 
 constOp :: Double -> (Double,Double)
constOp x = (x, 0.0)
 
 Наконец, мы можем определить наш любимый сигмоид σ(x) объединяя те операторы, что были выше: 
 sigmoid1 x = constOp 1 /. (constOp 1 +. exp1 (negate1 x))
 
 теперь давайте посчитаем нейрон f(x,y)=σ(ax+by+c) , где x и y это ввод а a , b и c паарметры. 
 neuron1 [a, b, c, x, y] = sigmoid1 ((a *. x) +. (b *. y) +. c)
 
 Теперь можно получить градиент для a в точке (a=1, b=2, c=−3, x=−1, y=3) 
 abcxy1 :: [(Double, Double)]
abcxy1 = [(1, 1), (2, 0), (-3, 0), (-1, 0), (3, 0)]
 
 neuron1 abcxy1
(0.8807970779778823,-0.1049935854035065)
 
 Вотт первый пример результата вывода нейронно йсети и второй градиент относительно a 
Проверим математику результата:
 
Первое выражение это результат вычислений нейронов, а второй точное аналитическое выражение
Вот и вся магия за словами автоматическая дифференциация. Похожим образом, мы можем получить остаток градиента: 
 neuron1 [(1, 0), (2, 1), (-3, 0), (-1, 0), (3, 0)]
(0.8807970779778823,0.3149807562105195)

neuron1 [(1, 0), (2, 0), (-3, 1), (-1, 0), (3, 0)]
(0.8807970779778823,0.1049935854035065)

neuron1 [(1, 0), (2, 0), (-3, 0), (-1, 1), (3, 0)]
(0.8807970779778823,0.1049935854035065)

neuron1 [(1, 0), (2, 0), (-3, 0), (-1, 0), (3, 1)]
(0.8807970779778823,0.209987170807013)
 
 Введение библиотеки обратного распределения 
 Библиотека обратного распределения была написана специально для дифференциального программирования. Она предоставляет комбинаторов для уменьшения нашей головной боли. В добавок, самые полезные операции арфиметические и тригонометрические, уже были определены в библиотеке. Можно взглянуть на hmatrix-backprop для линейной алгебры. Всё что вам нужно для дифференциального программирования определить несколько функций: 
 neuron
 :: Reifies s W
 => [BVar s Double] -> BVar s Double
neuron [a, b, c, x, y] = sigmoid (a * x + b * y + c)

sigmoid x = 1 / (1 + exp (-x))
 
 Тут BVar обернут в маркер того, что функция дифференцируемая. 
 forwardNeuron = BP.evalBP (neuron. BP.sequenceVar)
 
 Используем изоморфимз sequenceVar для преобразования BVar список в список BVar ов, как того требует выражение neuron . И передаем дальше. 
 backwardNeuron = BP.gradBP (neuron. BP.sequenceVar)

abcxy0 :: [Double]
abcxy0 = [1, 2, (-3), (-1), 3]

forwardNeuron abcxy0
-- 0.8807970779778823

backwardNeuron abcxy0
-- [-0.1049935854035065,0.3149807562105195,0.1049935854035065,0.1049935854035065,0.209987170807013]
 
 Заметим, что все градиенты в одном списке, тип аргумента первого нейрона. 
 Выводы 
 Современная нейронная сеть тяготеет к сложности. Написание градент обратного распределения в ручну может легко стать ужасом. В этом посте мы посмотрели как можно автоматизировать этот процесс при надобности. 
 В следующем посту мы применим автоматическую дифференциацию к реальной сетке. Поговорим о нормализации, других важных методахв глубоком обучении. И затронем сверточные нейнонные сети, котороые помогут нам решить интересные задачи. 
 Что можно почитать. 
 Графический путеводитель по нейронным сетям 
 Документация обратного распределения 
 Article on backpropagation by Dominic Steinitz

День 4: Важность пакетной нормализации
Для какой цели существуют нейронные сети? Нейронные сети - это обучаемые модели. Их главная цель - приблизиться или даже превзойти человеческие способности восприятия. Ричард Суттон говорит, что самый большой уро, который может быть получен с 70 годов разработки ИИ, в том, что общие методы использующие вычисления очень эффективны. В его эссе, Суттон говорит, что только модели без зашифрованного человеческого знания могут превзойти человеко-ориентированные подходы. Так, что нейронные сети в общем будут достаточны для использования вычислений. Затем, не удивительно, что они могут выставить миллионы обучаемых степеней свободы. 
 Самый большой вызов для нейронных сетей: 1) как обучить эти миллионы параметров, 2) как их интерпритировать. Пакетная нормализация(batchnorm) был представлен в качестве попытки сделать обучение еще эффективнее. Метод может сильно сократить число итераций обучения. Даже больше, batchnorm - возможо является ключем, который даст возможность обучать определенные архитектуры такие как бинарная нейронная сеть. Наконец, batchnorm одна из самых последних преимуществ нейронной сети. 
 Пакетная нормализация в кратце 
 Что такие пакет?
До сих пор, вы смотрели на игрущечные наборы данных. Они были настолько малы, что могли быть легко помещены в память. Однако, в реальности существует огромные наборы занимающие сотни гигабайтов памяти, такие как Imagenet. Они часто не помещаются в память. В этом случае, имеет смысл разделить наборы данных в маленькие пакеты. Во время обучения обрабатывается только один пакет. 
 Как предполагает имя, batchnorm преобразование это действие над отдельными пакетами данных. Вывод линейного слоя может быть причиной деградации активационной функции. Например: в случае ReLU f(x)=max(0,x) активация, все негативные значения будут приводить к нулевой активации. Отсюда, хорошая идея - нормализовать эти значения с помощью вычитания среднего μ пакета. Похожим образом, деление по стандартному отклоненю √var масштабирует амплитуту, которая особенно выгодня для активация сигмоидного вида. 
 Обучение и Batchnorm 
 Процеедура пакетной нормализации имеет отличия на этапах обучения и вывода. Во время обучения, каждый слой, где мы хотим применить batchnorm, сначала вычисляем средний мини пакет: 
 
где Xi это i вектор-свойство идущий из прошлого слоя; i=1…m , где m>1 это размер пакета. Так же получаем ди:сперсию для пакета
 
 Теперь batchnorm ядро, сама нормализация:
 
где маленькая постоянная ϵ добавляет числовую стабильность. Что если нормализация данного слоя была вредна? Алгоритм предоставит два обучаемых параметра, которые в худшем случае могут отменить эффект пакетной нормализации: смасштабирует параметр γ и увеличит β . После применения таковых мы получим вывод слоя batchnorm:
 
Отметим, что оба: среднее значениее μ и распределение var векторы количество которых такое же большое как и нейронов в данном скрытом слое. Оператор ∗ говорит о поэлементном умножении. 
 Вывод 
 В стадии вывода лучше всего иметь одну выборку данных за раз. Как же сосчитать среднюю величину пакета если целый пакет это и есть выборка данных? Для правильной обраотки, во время обучения мы указываем среднее значение и распределение ( E[X] and Var[X]) для всех наборов обучения. Эти вектора заменят μ и var` во время вывода, таким образом избегая проблемы нормализации единичного пакета. 
 Насколько эффективен Batchnorm 
 
До этого момента мы успели поиграться с задачами, которые предоставляют свойство низкоразмерного ввода. Теперь, мы собираемся протестировать нейронную сеть проблемой по серьезнее. Мы применим наши навыки для автоматического распознавания написанных человеком цифр на основе известного MNIST набора данных. Эта проблем появляется в следствии нужды чтения кодов с конветов на почте. 
 Мы создадим две нейронные сети, каждая бдует иметь два полностью связанных скрытых слоя (300 и 50 нейронов). Обе сети получают картинку размером 28×28=784 пикселей, и возвращает 10 чисел, распознанное число. Для внутненних слоев мы будем использовать ReLU f(x)=max(0,x) . Для получения возможных классификаторов вектора в результате, в качестве функции активации мы будем использовать . Одна из сетей в дополнение будет производить пакетную нормализацию перед ReLU. Затем, мы будем обучать их использую стохастический градиентный спуск с коэфиициентом обучения α=0.01^3 и размером пакета m=100 . 
 
 Обучение нейронной сети на MNIST данных. 
 Обучение с помощью пакетной нормализации (синий график) ведет к высокой точности быстрее чем без нее(ораньжевый график). 
 Из графика мы видим, что нейронная сеть с пакетной нормализацией достикает точности в 98% за 10 эпох, где второй пыатеся сделать это в течении 50!Похожие результаты могут быть получены другими архитектурами. 
 Надо упомянуть, что мы до сих пор может слабо недостаточно понимать как именно batchnorm помогает. В его оригинальном описание, предполагается, что batchnorm сокращает внутренее ковариантное смещение. Недавно, было показано, что это не обязательно правда. Лучшее на сегодняшний день объяснение то, что batchnorm делает оптимизационные сглаживания, которые делают обучение градиентным спуском эффективнее. Что, в свою очередь, позволяет ускорить процесс обучение с помощью batchnorm! 
 Реализация batchnorm 
 Мы воспользуемся кодом приведенным в дне 2. Первое, мы переопределим структура дынных Layer сделав его более детализированным: 
 data Layer a = -- Linear layer with weights and biases
 Linear (Matrix a) (Vector a)
 -- Same as Linear, but without biases
 | Linear' (Matrix a)
 -- Batchnorm with running mean, variance, and two
 -- learnable affine parameters
 | Batchnorm1d (Vector a) (Vector a) (Vector a) (Vector a)
 -- Usually non-linear element-wise activation
 | Activation FActivation
 
 Превосходно! Теперь мы можем отделить несколько типов слоев: родственный(линейный), активационный и batchnorm. Так как batchnorm уже компенсирует смещения, нам не нужно использовать смещения на последующих слоях. Поэтому мы определяем Linear слой без смещений. Так же расширим Gradients для использования новой структуры слоёв: 
 data Gradients a = -- Weight and bias gradients
 LinearGradients (Matrix a) (Vector a)
 -- Weight gradients
 | Linear'Gradients (Matrix a)
 -- Batchnorm parameters and gradients
 | BN1 (Vector a) (Vector a) (Vector a) (Vector a)
 -- No learnable parameters
 | NoGrad
 
 Теперь, мы хотим расширить функцию распространения нейронной сети _pass , в зависимости от слоя. Это проще всего сделать с помощью с сопоставления с образцом. Вот как мы это сделаем в Batchnorm1d слое и его параметрах: 
 _pass inp (Batchnorm1d mu variance gamma beta:layers)
 = (dX, pred, BN1 batchMu batchVariance dGamma dBeta:t)
 where
 
 Как и раньше, _pass функция получает ввод inp и параметры слоя. Второй аргумент это образец с который мы и проводим сопоставление, создавая наш алгоритм, в данном случае, Batchnorm1D. Мы также указываем _pass для других видово слоев. Отсуда, мы получаем полиморфную функцию _pass для каждого слоя. Наконец, результат выражения является кортеж: градиенты для обратного распространения dX , предсказание pred и дополнительные данные t с вычисленными значениями BN1 в этом слое(средний пакет batchMu , распределение batchVariance и параметры градиентов для обучения). 
 Следующий шаг показан ниже: 
 -- Forward
 eps = 1e-12
 b = br (rows inp) -- Broadcast (replicate) rows from 1 to batch size
 m = recip $ (fromIntegral $ rows inp)

 -- Step 1: mean from Equation (1)
 batchMu :: Vector Float
 batchMu = compute $ m `_scale` (_sumRows inp)

 -- Step 2: mean subtraction
 xmu :: Matrix Float
 xmu = compute $ inp .- b batchMu

 -- Step 3
 sq = compute $ xmu .^ 2

 -- Step 4: variance, Equation (2)
 batchVariance :: Vector Float
 batchVariance = compute $ m `_scale` (_sumRows sq)

 -- Step 5
 sqrtvar = sqrtA $ batchVariance `addC` eps

 -- Step 6
 ivar = compute $ A.map recip sqrtvar

 -- Step 7: normalize, Equation (3)
 xhat = xmu .* b ivar

 -- Step 8: rescale
 gammax = b gamma .* xhat

 -- Step 9: translate, Equation (4)
 out0 :: Matrix Float
 out0 = compute $ gammax .+ b beta
 
 Обсуждая в статье "День 2", мы повторяли вызов получения градиентов следующего слоя, нейронная сеть предсказывала pred и вычисляла значение хвоста t : 
 (dZ, pred, t) = _pass out layers
 
 Я предпочитаю хранить обратную передачу без упрощений. Это помогает прояснить какой из шагов за что отвечает: 
 -- Backward

 -- Step 9
 dBeta = compute $ _sumRows dZ

 -- Step 8
 dGamma = compute $ _sumRows (compute $ dZ .* xhat)
 dxhat :: Matrix Float
 dxhat = compute $ dZ .* b gamma

 -- Step 7
 divar = _sumRows $ compute $ dxhat .* xmu
 dxmu1 = dxhat .* b ivar

 -- Step 6
 dsqrtvar = (A.map (negate. recip) (sqrtvar .^ 2)) .* divar

 -- Step 5
 dvar = 0.5 `_scale` ivar .* dsqrtvar

 -- Step 4
 dsq = compute $ m `_scale` dvar

 -- Step 3
 dxmu2 = 2 `_scale` xmu .* b dsq

 -- Step 2
 dx1 = compute $ dxmu1 .+ dxmu2
 dmu = A.map negate $ _sumRows dx1

 -- Step 1
 dx2 = b $ compute (m `_scale` dmu)

 dX = compute $ dx1 .+ dx2
 
 Часто, нам нужно произвести операции типа вычитания среднего X−μ , где, на практике, у нас есть матрица X и вектор μ . Как вычесть вектор из матрицы? Правильно, никак. Вычитать можно только 2 матрицы. Такие библиотеки как Numpy могут говорить о магии, которая упростит превращение вектора в матрицу. Это может быть полезно, но может выявить различные виды багов. Мы же взамен произведем явное преобразование вектора в матрицу. Для наших нужд, мы имеем скращение b = br (rows inp) , которое мы расширим вектор на то же число рядов как в inp . Где функция br ('broadcast') : 
 br rows' v = expandWithin Dim2 rows' const v
 
 Это пример того, ка кработае br. Для начала, интерактивно запустим сессию и загрузим модуль NeuralNetwork.hs : 
 $ stack exec ghci
GHCi, version 8.2.2: http://www.haskell.org/ghc/ :? for help
Prelude> :l src/NeuralNetwork.hs
 
 Затем проверим br функцию на векторе [1, 2, 3, 4] : 
 *NeuralNetwork> let a = A.fromList Par [1,2,3,4] :: Vector Float
*NeuralNetwork> a
Array U Par (Sz1 4)
 [ 1.0, 2.0, 3.0, 4.0 ]

*NeuralNetwork> let b = br 3 a
*NeuralNetwork> b
Array D Seq (Sz (3 :. 4))
 [ [ 1.0, 2.0, 3.0, 4.0 ]
 , [ 1.0, 2.0, 3.0, 4.0 ]
 , [ 1.0, 2.0, 3.0, 4.0 ]
 ]
 
 Как можно увидеть, была получения новая матрица с тремя идентичными строками. Отметим, что a имеет тип Array U Seq , что значит, что данные хранятся простым распакованным масивом.Где результат типа Array D Seq так называемый отложенный массив. Этот отложенный массив, на самом деле не массив, но так понятнее становится, что это будет массивом в дальнейшейм. Чтобы разместить массив в памяти используйте compute : 
 *NeuralNetwork> compute b :: Matrix Float
Array U Seq (Sz (3 :. 4))
 [ [ 1.0, 2.0, 3.0, 4.0 ]
 , [ 1.0, 2.0, 3.0, 4.0 ]
 , [ 1.0, 2.0, 3.0, 4.0 ]
 ]
 
 Подробную информацию можно получить из большой документации. Так же как и br , существует несколько более удобных функций, rowsLike и colsLike . Они полезны в связке с _sumRows и _sumCols : 
 -- | Sum values in each column and produce a delayed 1D Array
_sumRows :: Matrix Float -> Array D Ix1 Float
_sumRows = A.foldlWithin Dim2 (+) 0.0

-- | Sum values in each row and produce a delayed 1D Array
_sumCols :: Matrix Float -> Array D Ix1 Float
_sumCols = A.foldlWithin Dim1 (+) 0.0
 
 Пример использования _sumCols и colsLike при вычислении softmax активационной функции 
 softmax :: Matrix Float -> Matrix Float
softmax x =
 let x0 = compute $ expA x :: Matrix Float
 x1 = compute $ (_sumCols x0) :: Vector Float
 x2 = x1 `colsLike` x
 in (compute $ x0 ./ x2)
 
 Заметьте, что softmax отличается от поэлементной активации. Взамен, softmax действует как полностью связанный слой, который получает и отдает вектор. Наконец, мы определяем нашу нейронную сеть, с двумя скрытыми слоями и пакетной нормализацией так: 
 let net = [ Linear' w1
 , Batchnorm1d (zeros h1) (ones h1) (ones h1) (zeros h1)
 , Activation Relu
 , Linear' w2
 , Batchnorm1d (zeros h2) (ones h2) (ones h2) (zeros h2)
 , Activation Relu
 , Linear' w3
 ]
 
 Число входов это произведение 28×28=784 пикселей картинки и а выход это число классов т.е. цифр. Мы случайным образом сгенерировали начальные веса w1 , w2 , and w3 . И указали начальный состояние слоя пакетной нормализации следующим образом: среднее число - 0, распределение - 1, параметры масштабирования - 1, и смещение параметров - 0: 
 let [i, h1, h2, o] = [784, 300, 50, 10]
 (w1, b1) <- genWeights (i, h1)
 let ones n = A.replicate Par (Sz1 n) 1 :: Vector Float
 zeros n = A.replicate Par (Sz1 n) 0 :: Vector Float
 (w2, b2) <- genWeights (h1, h2)
 (w3, b3) <- genWeights (h2, o)
 
 Помним, что число пакетной нормализации должно быть равно числу нейронов. Это распространенная практика ставить пакетную нормализацию перед функцие активации, однако эта последовательность не обязательна, можно и на оборот. Для сравнения, мы составим нейронную сеть с двумя спрятанными слоями без пакетной нормализации. 
 let net2 = [ Linear w1 b1
 , Activation Relu
 , Linear w2 b2
 , Activation Relu
 , Linear w3 b3
 ]
 
 В обоих случаях выход активационной softmax избегается так как вычисляются совместно с градиентами потерь в конечном рекурсивном вызове _pass : 
 _pass inp [] = (loss', pred, [])
 where
 pred = softmax inp
 loss' = compute $ pred .- tgt
 
 Здесь [] с левой стороны значения это пустой список входных слоёв, а [] с правой стороны это пустой конец вычисленных значений в начале обратного прохода. 
 Подводные камни пакетной нормализации 
 Есть несколько потенциальных ловушек при использовании пакетной нормализации. 
 
 Пакетная нормализация отличается во время обучения и взаимодействия. Что делает вашу реализацию более сложной. 
 Пакетная нормализация может не получиться при обучении данных которые идут из разных наборов. Чтобы избежать этого, нужно просто убедиться, что любой пакет отражает целый набор данных. То есть все пакеты имеют одинаковую форму. 
 
 Выводы 
 Не смотря на подводные камни, пакетная нормализация важная идея и остается популярным методом в контексте глубого обучения. Сила пакетной нормализации в том, что она может существенно сократить число обучаемых эпох или даже помочь достичь лучше точности нейронной сети. После этого обсуждения, мы готовимся приступь к понятиям сверточные нейронные сети и обучение с подкреплением. Оставайтесь на связи. 
 Что почитать. 
 
 
 Richard Sutton. The Bitter Lesson 
 
 
 Using neural nets to recognize handwritten digits 
 
 
 Batch normalization paper 
 
 
 How Does Batch Normalization Help Optimization? 
 
 
 On The Perils of Batch Norm 
 
 
 Haskell 
 
 
 Why I think Haskell is the best general purpose language 
 
 
 An opinionated beginner's guide to Haskell in mid 2019 
 
 
 Massiv Array Library 
 
 
 Alexey's presentation about massiv

День  5: Сверточные нейронные сети
Сегодня мы поговорим об одной из самых важных архитектур глубокого обучения, так сказать алгоритм всех алгоритвом в компьютерном зрении. Вот как Франсуа Холлет, автор Karas, называет сверточные нейронные сети (CNNs). Сверточные сети - архитектура, как и другие исскуственные нейронные сети, имеют нейрон в качестве наименьшего строительного блока. Оно отличается так, что сеть удобно обучаемая через обратное распределение. Явное свойство CNNs, однако, это связанная структура, выражающаяся в редких связях сверточных слоёв с нейронами которые делят их веса. 
 Для начала, мы собираемся понять основу CNNs. Затем, подробнее рассмотрим на классической архитектуре CNN. После обсуждения различий между сверточными типами слоёв, мы собираемся реализовать сверточную нейронную сеть на Haskell. Посмотрим, как на рукописных цифрах CNN достигает в два раза меньшую ошибку, сравним с полносвязанной структурой. Построим то что мы изучили в прошлых статьях, не стесняйтесь перечитать прошлые статьи. 
 Сверточный оператор 
 В прошлом, мы изучили полносвязанную нейронную сеть. Так же, теоретически они могут приближать реальную функцию и имеют ограничения. Одна из проблем - достижение трансляционной симметрии. Чтобы объяснить это, давайте взглянем на две картинки ниже.
 
 Трансляционная симметрия: Тот же объект в различных местах. 
 Для нас, людей, не важно, если кошка в правом нижнем углу или где-то в верху картинки. В обоих случаях мы найдем кошку. Мы можмем сказать, что наш человеческий детектор котов трансляционно инвариантен. 
 Однако, если мы посмотрим на архитектуру типичной полносвязанной сети, мы можем понять, что нет ничего, что на самом деле может предостеречь эту сеть работать верно только в какой-то части картинки. Вопрос который мы задаем: Есть ли способ сделать нейронную сеть трансляционно инвариантной? 
 
 Полносвязанная нейронная сеть с двумя скрытыми слоями. 
 Давайте взглянем ближе на картинку кошки. Скоро мы поймем, что пиксели отражающие котячую голову более связанны, чем другие пиксели связанные с кошачим хвтостом. Отсюда, мы так же можем захотеть сделать нашу нейронную сеть такой редкой, что нейроны в следующем слое будут связаны только с соотвествующими соседними пикселями. Таким образом, каждый нейрон в следующем слое будут отвечать только за маленькие свойства оригинальной картинки. Участок который видит нейрон назовем полем восприятия.
 
 Соседние пиксели дают более связную информацию чем отдаленные.
Увеличим картинку кошки. 
 Сверточная нейронная сеть CNN или просто ConvNets была спроектированна для решения двух проблем: трансляционная симметрия и положение изображения. Первое, давайте определим понятие свёрточный оператор. 
 Возможно вы не знаете, но очень похоже вы уже встречали сверточные фильтры. Вспомните когда вы впервые играли с графическим редактором типа GIMP или Photoshop. Возможно вы были восхищены возможностью эффектов таких как корректировка четкости, замыливание или определение краёв. Если нет, тогда вы должны это попробовать. Секрет этих фильтров в том, что это сверточное применение ядра картинки. Ядро картинки это матрица 3x3, как представленно ниже.
 
 Простой сверточный шаг:
Не путайте значение пикселя и ядро. 
 Ниже представлен простой сверточный шаг. Этот шаг - скалярное произведение ядра и значений пикселей. Так как все значения ядер исключая вторые единицы в первом нуле, то результат равняется второму значению в первой строке зеленой рамки. т.о. 40⋅0+42⋅1+46⋅0+…+58⋅0=42. Сверточный оператор берет картинку и действует внутри зеленого перемещающегося окна для того чтобы произвести скалярное произвдение всех частей картинки. Результат новая отфильтрованная картинка. Метематически, прямой сверточный оператор (*) между картинкой and a kernel можно выразить следующим образом
 где and Для лучшего понимания что происходит, можно поиграться с другой картинкой. 
 В чем идея применения оператора двигающегося окна или свертки? В основании идеи лежит биология. Человеческий глаз имеет ограниченное поле четкого зрения. Мы воспринимаем объекты в целом постоянно передвигая взгляд по объекту. Эти быстрые движения называются саккады. Отсюда, сверточный оператор может быть описан как простая модель естественное восприятие объекта глазом. Важный момент, что свертка достигает постоянство сдвига благодаря методу движемого окна. Даже больше, так как каждый результат связан через ядро с очень ограниченным числом пикселей в начальном изображении, сверточная связь очень редки. Поэтому, используя свертки в нейронных сетях мы достигаем инвариантность и разреженость связей. Давайте посмотрим, как это работает на практике. 
 Архитектура сверточной нейронной сети. 
 
 Интересное своейтсво сверточного слоя заключается в том, что если мы подаем на вход смещенную картинку, то выходная карта будет свдинута на ту же величину, в противно случае оно останется неизменным. Это свойство является базовым в надежности сверточной сети при сдвиге или искажении вводных данных. 
 
 
 Как только свойство было замечено, его точно место положение становится не важным. А важным становится, только примерная относительная позиция. 
 
 
 Lecun et al. Gradient-based learning applied to document recognition (1998) 
 
 Прототип того, что сегодня мы называем сверточной нейронной сетью была предложена в 1970 году Фукусимой. Было предложено можество методов обучения с учителем и без, но сегодня CNN обучается только благодаря обратному распространению. Давайте взглянем на одну из известных архитектур известных как LeNet-5.
 
 LeNet-5 architecture from Lecun et al. Gradient-based learning applied to document recognition. 
 Архитектура очень близка к современным CNN. LeNet-5 был создан для распознавания ручных цифр с чернобелых картинок размером 32x32. Два основных блока, как мы сейчас их называем: экстрактор и классификатор свойств. 
 
 С локальными восприимчивыми полянми нейронов можно извлечь простые визуальные свойства, как ориентация граней, концы, углы... 
 Lecun et al. Gradient-based learning applied to document recognition (1998) 
 
 Экстрактор свойств состоит из двух сверточных слоёв. Первый слой имеет 6 сверточных фильтров с ядром 5x5. Применение этих фильтров с подпоследовательными добавочными смещениями и гиперболической танценциальной активацией производят карты, естественно новые, чуть меньшего размера 28x28. По соглашению, результат описывается следующим объёмом 28x28x6. Чтобы сократить разреженное разрешени, производится подрезание. Оно выдает карты свойств 14x14x6. 
 
 Все юниты в карте свойств делят тот же набор 25 весов и тем же количеством смещений, так чтобы можно было заметить те же свойства в любом месте на входе.
Lecun et al. Gradient-based learning applied to document recognition (1998) 
 
 Следующая свертка округляет результат уже в размер карт 10x10x16. В отичии от первого слоя, мы используем ядро 5x5x6. Это значит, что каждый из 16 сверток одновременно обрабатывает все 6 свойств карты полученных на прошлом шаге. После разделения мы получаем результирующи размер 5x5x16. 
 Классификатор состоит из трехмерно связанных слоёв с 120, 84б и 10 нейронами каждый. Последний слой предоставляет зашифрованный ответ. Небольшое различие между современными архитектурами в последнем слое, который состоит из 10 эвклидовых радиальных смещеных функциональных единиц, где сегодня это будет обычный полносвязный слой следующим за слоем softmax. 
 Важно понять, что единичный сверточный фильтр может понять только одно свйоство. Например, он может понять горизонатальную линию. Поэтому мы используем несколько больше фильтров с различными ядрами, чтобы иметь возможность понимать свойства: вертикальные линии, простые текстуры, или углы. Как вы уже видели, число фильтров обычно отображается в ConvNet диаграмма как объем. Слои, что глубже в сети будут объединять базовые свойства что были найдены в слоях выше в более абстрактные сущности как глаз, ухо или даже полные фигуры. Чтобы лучше понять этот механизм, давайте мы взглянем на поле восприятия на визуализации ниже. 
 
 Поле графического восприятия унаследованно от Арун Маллуа. Проведем курсором мыши над любым нейроном в верхнем слое, чтобы посмотреть как расширится поле восприятия в слое дальше. 
 Как можно увидеть проверяя нейроны в последнем слое, даже маленькое поле 3x3 ратстет как происходит движение в первом слое. Мы можем предвидеть, что нейроны "глубже" будут иметь лучшее понимание, что происходит на картинке. 
 Красота и величие достигаемое глубоким обучением в том, что ядро фитрации самообучаемо самой сетью, достигается оно даже лучшую точность в сравнении с человеческими способностями. Странности сверточных слоев в том, что результат получаемые после повторяющихся применений маленького количества веса как определено в ядре. Благодаря весам, сверточные слои имеют резко сокращенное число обучаемых параметров, сравнимых с полносвязанными слоями. 
 Сверточные типы 
 Я решил включить эту часть для интересующихся. Если вы впервые встретились с CNN, можете пропустить эту часть и вернуться к ней позже. 
 Можно встрерить большое количество слухов вокруг сверточных сетей сегодня. Однако, это не делает яснее, что те слои компьютерного зрения, что опускаются ниже, часто отличаются от тех, что исследуются для ConvNet. Даже в ConvNet есть множество сверточных слоев воодушевлящих зарождение ConvNet архитектуры и окруженных Xception и Mobilenet работами. Я верю, что вы заслуживаете знать, что существует множество видов применения сверточной сети в разных контекстах и я должен вам подсказать путь. 
 Свертка компьютерного зрения 
 
Низкоуровневое компьютерное зрение, например графические редакторы, обычно работают с одним, тремя, четырмя каналами картинки(красный, зеленый, синий, прозрачность). Каждое отдельное ядро обычно применяется на каждый канал. В этом случае получаем столько каналов, сколько каналов на входном изображении. В редких случаях когда одно ядро применяется к каждому каналу(размытие). Иногда, результирующие каналы суммируются производя одноканальную картинку(определение границ). 
 Свертка похожая на LeNet 
 
Чистая свертка CV отличается от этих ConvNet по 2м причинам: CV ядра определяются в ручную, где сила нейронной сети идет от обучения, и в нейронной сети мы строим глубокую сткруктуру составляя множество сверток друг на друга. Отсюда, нам нужно пересобрать информацию идущую от прошлого слоя. Оно позволяет нам обучать высший уровень понимания свойств. На конец свертки в нейронных сетях могут содежрать определения смещения, т.е. константы добавленые в результате работы каждой свертки. 
 A pure CV-style convolution is different from those in ConvNets due to two reasons: (1) in CV kernels are manually defined, whereas the power of neural networks comes from training, and (2) in neural networks we build a deep structure by stacking multiple convolutions on top of each other. Therefore, we need to recombine the information coming from previous layers. That allows us to train higher-level feature detections8. Finally, convolutions in neural networks may contain bias terms, i.e. constants added to results of each convolution. 
 Recombination of features coming from earlier layers was previously illustrated in LeNet-5 example. As you remember, in the second convolutional layer we would apply 3D kernels of size 5×5×6 computing dot products simultaneously on all six feature maps from the first layer. There were sixteen different kernels thus producing sixteen new channels. 
 
 Each pixel in the feature map is obtained as a dot product between the RGB color channels and the sliding kernel. Image credit: Wikimedia.
Convolution filter with three input channels.
Each pixel in the feature map is obtained as a dot product between the RGB color channels and the sliding kernel. Image credit: Wikimedia. 
 To summarize, a single LeNet-like convolution operates simultaneously on all input channels and produces a single channel. By having an arbitrary number of kernels, any number of output channels is obtained. It is not uncommon to operate on volumes of 512 channels! The computation cost of such convolution is DK×DK×M×N×DF×DF where M is the number of input channels, N is the number of output channels, is the kernel size and is the feature map size9. 
 Depthwise Separable Convolution 
 
 LeNet-style convolution requires a large number of operations. But do we really need all of them? For instance, can spatial and cross-channel correlations be somehow decoupled? The Xception paper largely inspired by Inception architecture shows that indeed, one can build more efficient convolutions by assuming that spatial correlations and cross-channel correlations can be mapped independently. This principle was also applied in Mobilenet architectures. 
 The depthwise separable convolution works the following way. First, like in low-level computer vision, individual kernels are applied to each individual channel. Then, after optional activation10, there is another convolution, but this time exclusively in-between channels. That is typically achieved by applying a 1×1 convolution kernel. Finally, there is a (ReLU) activation. 
 This way, depthwise separable convolution has two distinct steps: a space-only convolution and a channel recombination. This reduces the number of operations to 
 To summarize, the main difference between low-level image processing (Photoshop) and neural networks is that image processing operates on lots of pixels, but the image depth remains unchanged (three-four channels). On the other hand, convolutional neural networks tend to operate on images of moderate width and height (e.g. 230×230 pixels), but can achieve depth of thousands of channels, while simultaneously decreasing the number of "pixels" in feature maps towards to the end of processing pipeline. 
 Implementing Convolutional Networks in Haskell 
 Today, we will implement and train a convolutional network inspired by LeNet-5. We will witness that indeed this ConvNet has about twice lower error on the MNIST handwritten digit recognition task, while being four times smaller compared to the previous model! The source code from this post is available on Github. 
 Convolutional Layer Gradients 
 First of all, we need to know how to obtain convolutional layer gradients. Although convolution in neural networks is technically performed by the cross-correlation operator11, this does not matter since kernel parameters are learnable. To understand gradient derivation, let us take a look at the cross-correlation operation in single dimension:
 
 For instance, if we fix 1D kernel W size to be k=3, equation above would simply mean that vector produces an output 
 
Denoting error from the previous layer as δ, convolutional layer gradients w.r.t. input X are 
 
 So it can be seen that it is a cross-correlation operation again, but with flipped kernel. Hence 
 
 Similarly, gradients w.r.t. kernel W are computed as
 
Thus we have yet another cross-correlation 
 
 These formulas will become handy when implementing the backward pass from convolutional layers. 
 Convolutions in Haskell 
 First, let us start with two most relevant imports: modules from array library massiv and automatic differentiation tools from backprop. 
 -- Multidimensional arrays to store learnable
-- parameters and represent image data
import Data.Massiv.Array hiding ( map, zip, zipWith, flatten )
import qualified Data.Massiv.Array as A
-- Automatic heterogeneous back-propagation library
import Numeric.Backprop
 
 Let us brush up our data structures to match our convolutional needs. First, we need to represent a batch of images. We will use the following convention:
batch size × channels × height × width. For instance, if we have a batch of 16 RGB images with dimensions 32 × 32
, then we get a volume of 16 × 3 × 32 × 32. Here we have a Volume4 type: 
 type Volume4 a = Array U Ix4 a
 
 It is just an alias to a four-dimensional unboxed Array coming from the massiv package. In deep learning frameworks those n-dimensional arrays are conventionally called tensors. Note that during transformation in a neural network, the shape of data will change, except the batch dimension that will always remain the same. Similarly, we need a way to represent our convolutional filters. In LeNet, for instance, we can represent the first convolutional layer as a volume of 6×1×5×5. In order to be able to distinguish between parameter and data volumes, we will introduce Conv2d data structure: 
 data Conv2d a = Conv2d { _kernels :: !(Volume4 a) }
 
 Note that for the sake of simplicity we decided not to implement biases. As previously, Linear will represent fully-connected layer parameters: 
 data Linear a = Linear { _weights :: !(Matrix a)
 , _biases :: !(Vector a)
 }
 
 As usual, type parameter a means that we may later decide whether we need a Float, a Double or some other weights encoding. Now, we have everything to describe learnable parameters (weights) in LeNet: 
 data LeNet a =
 LeNet { _conv1 :: !(Conv2d a)
 , _conv2 :: !(Conv2d a)
 , _fc1 :: !(Linear a)
 , _fc2 :: !(Linear a)
 , _fc3 :: !(Linear a)
 }
 
 Now, it becomes obvious that with every new layer type managing backpropagation data structures as we did on Days 2 and 4 quickly becomes tedious. Moreover, last time when we introduced a Batchnorm1d layer, our forward-backward pass function already oversized 100 lines of code. Today, we will make use of the backprop library introduced on Day 3 to better structure our project. Thus, LeNet can be represented as a differentiable function lenet, which is nothing more than a composition of other functions aka neural network layers: 
 lenet :: (Reifies s W)
 => BVar s (LeNet Float)
 -> Volume4 Float -- ^ Batch of images
 -> BVar s (Matrix Float)
lenet l = constVar

 -- Feature extractor
 -- Layer (layer group) #1
 ~> sameConv2d (l ^^. conv1)
 ~> relu
 ~> maxpool
 -- Layer #2
 ~> validConv2d (l ^^. conv2)
 ~> relu
 ~> maxpool

 ~> flatten

 -- Classifier
 -- Layer #3
 ~> linear (l ^^. fc1)
 ~> relu
 -- Layer #4
 ~> linear (l ^^. fc2)
 ~> relu
 -- Layer #5
 ~> linear (l ^^. fc3)
 
 This description pretty much talks for itself. The (~>) operator is a function composition, i.e. f ~> g = \x -> g(f(x)) . This could be written in other forms, though. We prefer the so-called pointfree notation: 
 (~>) :: (a -> b) -> (b -> c) -> a -> c
f ~> g = g. f
 
 That simply reverses the order of composition, i.e. first is applied f, and only then g12. I find this "backward composition" notation consistent with major neural network frameworks. Compare how an equivalent network can be represented in PyTorch: 
 import torch.nn as nn

class LeNet(nn.Module):
 def __init__(self, num_classes=10):
 super(LeNet, self).__init__()
 # Feature extractor: a (sequential) composition of convolutional,
 # activation, and pooling layers
 self.features = nn.Sequential(
 nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2, bias=False),
 nn.ReLU(),
 nn.MaxPool2d(kernel_size=2, stride=2),
 nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0, bias=False),
 nn.ReLU(),
 nn.MaxPool2d(kernel_size=2, stride=2)
 )

 # Classifier: a composition of linear and activation layers
 self.classifier = nn.Sequential(
 nn.Linear(16 * 5 * 5, 120, bias=True),
 nn.ReLU(),
 nn.Linear(120, 84, bias=True),
 nn.ReLU(),
 nn.Linear(84, num_classes, bias=True)
 )

 # Composition of feature extractor and classifier
 def forward(self, x):
 x = self.features(x)
 # Flatten
 x = x.view(x.size(0), -1)
 x = self.classifier(x)
 return x
 
 Despite some noise in the Python code above13, we see that nn.Sequential object acts simply as a function composition with (~>) combinators. Now, let us return back to lenet signature: 
 lenet :: (Reifies s W)
 => BVar s (LeNet Float)
 -> Volume4 Float -- ^ Batch of images
 -> BVar s (Matrix Float)
 
 We have already seen the BVar s ("backprop variable") type meaning that we deal with a differentiable structure. The first argument has type BVar s (LeNet Float). That conventionally means that LeNet is the model optimized by backpropagation and shortly we will show how to achieve that. The second function argument is a batch of images (Volume4), and finally the result is a bunch of vectors (Matrix) containing classification results performed by our model. 
 Functions constVar and (^^.) are special ones. Both are coming from the Numeric.Backprop import. The first function "lifts" a batch to be consumed by the sequence of differentiable functions. The second one (^^.) allows us to conveniently access records from our LeNet data type. For instance, sameConv2d (l ^^. conv1) means that sameConv2d has access to the _conv1 field. 
 Layers known from the previous days (linear, ReLU) will be reused in accordance to backprop conventions. What is left to implement are sameConv2d, validConv2d, and maxpool layers. To be able to do that simply and efficiently we will learn a neat tool called stencils. 
 Fun With Stencils 
 Let us come back to 1D convolution from Equation (2). If we look closer, we notice a pattern in convolution/cross-correlation, i.e. 
Indeed, every time we multiply the same set of weights and the only thing that changes is source data (masked with bullets above). Therefore, to implement convolutions we only need to define some sort of pattern that processes given points. This fixed pattern applied to array elements is called a stencil. The stencil convolution approach can be also used to perform subsampling operations, e.g. max pooling. Moreover, the method allows for efficient data parallelism. 
 Previously, we have introduced Massiv library featuring parallel arrays computation. This library was preferable to hmatrix due to two reasons: multidimensional arrays and parallel computation. Today, we have yet one more reason: stencils. Good news: massiv already implements them for us! Those are based on this work. 
 First, we warm up with a 1D cross-correlation from Equation (2). Suppose, we have a "delay" kernel [1,0,0]. Here is how this can be applied in an interactive shell (aka REPL): 
 > import Data.Massiv.Array
> type Vector a = Array U Ix1 a -- Convenience alias
> k = fromList Par [1, 0, 0] :: Vector Int
> sten = makeCorrelationStencilFromKernel k
> dta = fromList Par [1, 2, 5, 6, 2, -1, 3, -2] :: Vector Int
> mapStencil (Fill 0) sten dta
Array DW Par (Sz1 8)
 [ 0, 1, 2, 5, 6, 2, -1, 3 ]
 
 Note that the first value was replaced with zero since we have applied mapStencil (Fill 0) that uses zero-filling padding strategy to preserve the size of the array. Our array was processed to look internally like this: [0,1,2,5,6,2,−1,3,−2,0], with fake zeros inserted. Alternatively, we could use applyStencil noPadding as below: 
 > applyStencil noPadding sten dta
Array DW Par (Sz1 6)
 [ 1, 2, 5, 6, 2, -1 ]

> dta' = fromList Par [0, 1, 2, 5, 6, 2, -1, 3, -2, 0] :: Vector Int
> applyStencil noPadding sten dta'
Array DW Par (Sz1 8)
 [ 0, 1, 2, 5, 6, 2, -1, 3 ]
 
 Here is a more interesting computer vision example. For comparison, we define an identity (no transformation) stencil and a box stencil (blurring effect). The identity kernel has one in its center (coordinate 0 :. 0) and zeroes everywhere else. The box transformation computes averaged value over nine neighboring pixels including the center pixel itself (see also this for more information), therefore the kernel has weights 1/9 everywhere. 
 import Data.Massiv.Array
import Data.Massiv.Array.IO
import Graphics.ColorSpace

identity :: Elevator a => Stencil Ix2 (Pixel RGB a) (Pixel RGB a)
identity =
 makeStencil sz c $ \get -> get (0 :. 0)
 -- `get` is a function that receives a coordinate relative to the
 -- stencil's center, i.e. (0 :. 0) is the center itself.
 where
 sz = Sz (3 :. 3) -- Kernel size: 3 x 3
 c = 1 :. 1 -- Center coordinate
{-# INLINE identity #-}

box :: (Elevator a, Fractional a) => Stencil Ix2 (Pixel RGB a) (Pixel RGB a)
box =
 makeStencil sz c $ \get ->
 ( get (-1 :. -1) + get (-1 :. 0) + get (-1 :. 1)
 + get (0 :. -1) + get (0 :. 0) + get (0 :. 1)
 + get (1 :. -1) + get (1 :. 0) + get (1 :. 1)) / 9
 where
 sz = Sz (3 :. 3)
 c = 1 :. 1
{-# INLINE box #-}

main :: IO ()
main = do
 frog <- readImageAuto "files/frog.jpg" :: IO (Image S RGB Double)

 -- Identity transformation
 writeImageAuto "files/frog_clone.png" $ computeAs S (mapStencil Edge identity frog)
 -- Box filtering (blur)
 writeImageAuto "files/frog_blurred.png" $ computeAs S (mapStencil Edge box frog)
 
 In main function, we load an image and apply those two stencils. All we need to do is to specify the stencil pattern, i.e. convolution kernel, and massiv takes care to apply it. This is what we get: 
 
 Original frog (left) and blurred (right). Image credit.
While in the last example above we have defined stencils manually with makeStencil, in practice we may prefer to use 
 makeCorrelationStencilFromKernel
 :: (Manifest r ix e, Num e) => Array r ix e -> Stencil ix e e
 
 The function has a single argument, stencil's kernel. 
 Convolutional Layers With Stencils 
 Before actually diving into implementation, we have to understand that convolution decreases the number of "pixels" in the image, similar to the "delayed" 1D kernel above. Moreover, some information next to the image border is lost due to the fact that the sliding window "visits" bordering pixels less than those closer to the center. To avoid the information loss, often the area around the image is "padded" (filled) with zeros. Thus we can perform a "same" convolution, i.e. one that does not change the convolved dimensions. As of version 0.4.3, Massiv stencils already support different padding modes. In case of LeNet 5×5 kernels, we define padding of 2 pixels on every side (top, right, bottom, and left) using top left and bottom right corners: 
 sameConv2d = conv2d (Padding (Sz2 2 2) (Sz2 2 2) (Fill 0.0))
 
 "Valid" convolution is a convolution with no padding: 
 validConv2d = conv2d (Padding (Sz2 0 0) (Sz2 0 0) (Fill 0.0))
 
 In both cases, we will use generic conv2d function with a padding argument. Supporting automatic differentiation, our conv2d combines both forward and backward passes: 
 conv2d :: Reifies s W
 => Padding Ix2 Float
 -> BVar s (Conv2d Float)
 -> BVar s (Volume4 Float)
 -> BVar s (Volume4 Float)
 
 The signature tells us that it is a differential function that takes a padding, Conv2d parameters, and a batch of images Volume4 and produces another batch of images. The most efficient way to define the function is to manually specify both forward and backward passes; although it is also possible to specify only the forward pass composing smaller functions operating on BVars and letting backprop to figure out their gradients. The general pattern in backprop is to provide the forward pass (Conv2d w) x -> ... and the backward pass using a lambda expression containing previously computed gradients dz as an argument. Those forward and backward passes are glued together with liftOp2. op2 (or liftOp1. op1 for layers with no learnable parameters) as below: 
 conv2d p = liftOp2. op2 $ \(Conv2d w) x ->
 (conv2d_ p w x, \dz -> let dw = conv2d'' p x dz
 -- ... Compute padding p1
 dx = conv2d' p1 w dz
 in (Conv2d dw, dx) )
 
 Now, forward pass is a cross-correlation between the input channels and a kernel set from the filter bank. Recall that the number of kernel sets defines the number of output channels. We simultaneously perform cross-correlation operations on all images in the batch, therefore we concatenate the results over the channel dimension 3. 
 conv2d_ pad w x = res
 where
 -- Some code omitted for simplicity

 -- Create a stencil ("sliding window") from the given set
 -- in the filter bank
 sten = makeCorrelationStencilFromKernel. resize' (Sz4 1 cin w1 w2). (w !>)

 -- For each kernel set, create and run stencils on all images in the batch.
 -- Finally, concatenate (append) the results over the channel dimension
 res = foldl' (\prev ch -> let conv = computeAs U $ applyStencil pad4 (sten ch) x
 in computeAs U $ append' 3 prev conv) empty [0..cout - 1]
 
 To better illustrate how this works on a batch of images, here is a visualization what is going on in the first convolutional layer. 
 
 Folding over six filter sets in the first convolutional layer. Currently active convolutional filter set (highlighted in red) contributes to so far processed results (in color) on the right.
Folding over six filter sets in the first convolutional layer. Currently active convolutional filter set (highlighted in red) contributes to so far processed results (in color) on the right. 
 In MNIST we have black and white images, thus a single-channel input. During each step in foldl', a single filter is applied to all images. There are six filters hence resulting in six new channels in each convolved image on the right. A similar operation is performed in the second convolutional layer, this time resulting in 16-channel images (see also Convolution Types. LeNet-like Convolution above for the illustration on a single image). Below is a Python/NumPy equivalent of foldl'/applyStencil combo from above. 
 import numpy as np

# ... Get current batch, stride, and padding parameters
# ... Using those parameters, determine
# output image width, height, and number of channels

# Initialize resulting array
res = np.zeros([len(batch), out_channels, out_im_height, out_im_width])

# Iterate over all four dimensions
for i in range(len(batch)):
 for j in range(out_im_height):
 for k in range(out_im_width):
 for c in range(out_channels):
 # Define the sliding window position
 y1 = j * stride
 y2 = y1 + kernel_height
 x1 = k * stride
 x2 = x1 + kernel_width
 im_slice = batch[i, :, y1:y2, x1:x2]

 # Get current channel weights
 Wcur = W[:,c,:,:]

 # Out pixel = dot product
 res[i, c, j, k] = np.sum(im_slice * Wcur)

return res
 
 Now that we have accomplished convolution in the forward direction, we need to compute gradients for the backward pass. 
 Gradients Are Also Convolutions 
 The image plane gradients are obtained by formulas (3) and (4). However, in practice we also have to take into account that we deal with multiple images that have multiple channels. In fact, I leave it to you to figure out exact formulas. Those can be inferred from Haskell equations below. Yes, any Haskell line is already written in a mathematical notation. That is essentially what Haskell "purity" is about. You might have realized by now that programming in Haskell is all about finding your own patterns. Now, let us compute gradients w.r.t. input 
 conv2d' p w dz = conv2d_ p w' dz
 where
 w' = (compute. rot180. transposeInner) w

 -- Rotate kernels by 180 degrees. This is expressed in terms
 -- of tensor reversal along width and height dimensions.
 rot180 = reverse' 1. reverse' 2
 
 Let us remind that these compute functions evaluate delayed arrays into actual representation in memory. Please do not hesitate to find reverse and transposeInner in massiv documentation. Finally, here we have gradients w.r.t. convolution kernel weights 
 conv2d''
 :: Padding Ix2 Float -> Volume4 Float -> Volume4 Float -> Volume4 Float
-- Iterate over images in a batch
conv2d'' p x dz = res -- computeMap (/fromIntegral bs) res
 where
 -- Batch size
 Sz (bs :> _) = size x

 -- Code omitted

 res = foldl' (\acc im ->
 let cur = _conv2d'' p (compute $ dzd !> im) (compute $ xd !> im)
 in acc + cur) base [1..bs-1]
 
 We simply accumulate gradients from individual images in the batch using left strict fold. Finally, we generalize Equation (4) to get those transformations: 
 _conv2d'' :: Padding Ix2 Float
 -> Volume Float -- ^ Gradients \delta (dz)
 -> Volume Float -- ^ X
 -> Volume4 Float -- ^ dL/dW
_conv2d'' (Padding (Sz szp1) (Sz szp2) pb) dz x = ...
 
 More Stencils: Max Pooling Layers 
 Stencils are handy not only for convolutional layers. Another use case are pooling (subsampling) layers. To be able to perform max pooling, first we have to define the corresponding stencil pattern (or just use maxStencil from the library instead). 
 maxpoolStencil2x2 :: Stencil Ix4 Float Float
maxpoolStencil2x2 = makeStencil (Sz4 1 1 2 2) 0 $ \ get ->
 let max4 x1 x2 x3 x4 = max (max (max x1 x2) x3) x4
 in max4 <$> get (0 :> 0 :> 0 :. 0)
 <*> get (0 :> 0 :> 1 :. 1)
 <*> get (0 :> 0 :> 0 :. 1)
 <*> get (0 :> 0 :> 1 :. 0)
 
 Here max4 is a function that receives four values and computes the maximal one. This function is applied to a patch of four neighboring pixels. Do not forget that we operate in a 4D space, therefore each point in the given patch has four coordinates. Here is how we interpret a coordinate: 
 (0 :> 0 :> 0 :. 0)
 ^ ^ ^ ^
 4 | | 1
 3 2
1 and 2 are coordinates in the image plane,
3 is the channel dimension, and
4 is the batch dimension.
 
 Note the :. and :> operators. The first one :. constructs a coordinates (index) "list" with two elements and :> just adds more coordinates for higher dimensions. To learn more about <$> and <*> functions, see Applicative Functors. 
 We would like to have non-overlapping sliding windows in max pooling, therefore, we want to use 2×2 stride in the image plane. However, we would like to perform the same operation over every channel in every image. Therefore, we use computeWithStride and the stride is 1 :> 1 :> 2 :. 2 (batch dim×channel dim×height×width). Note that we do not want any padding since the goal of pooling is actually to reduce the number of pixels: four times in this case. 
 maxpool_ :: Volume4 Float -> Volume4 Float
maxpool_ = computeWithStride (Stride (1 :> 1 :> 2 :. 2)).
 applyStencil noPadding maxpoolStencil2x2
 
 Finally, we compute gradients using pixels that had maximal values in the forward pass. And we obtain a differentiable maxpool function lifting forward-backward passes with liftOp1. op1: 
 maxpool :: Reifies s W
 => BVar s (Volume4 Float)
 -> BVar s (Volume4 Float)
maxpool = liftOp1. op1 $ \x ->
 let out = maxpool_ x
 s = Stride (1 :> 1 :> 2 :. 2)
 outUp = computeAs U $ zoom s out
 maxima = A.zipWith (\a b -> if a == b then 1 else 0) outUp x
 in (out, \dz -> let dzUp = computeAs U $ zoom s dz
 -- Elementwise product
 in maybe (error $ "Dimensions problem") compute (maxima .*. delay dzUp))
 
 Putting It All Together 
 First, we fetch the MNIST data and randomly generate the initial model: 
 main :: IO ()
main = do
 trainS <- mnistStream 64 "data/train-images-idx3-ubyte" "data/train-labels-idx1-ubyte"
 testS <- mnistStream 1000 "data/t10k-images-idx3-ubyte" "data/t10k-labels-idx1-ubyte"

 net <- randNetwork
 
 As previously, data streams are loaded from binary MNIST files; however, the initial model generation has to take into account the new structure, LeNet. 
 randNetwork :: IO (LeNet Float)
randNetwork = do
 -- Generate a new conv layer weights:
 -- 6 out channels, 1 input channel, kernel size: 5 x 5
 _conv1 <- randConv2d (Sz4 6 1 5 5)
 -- 16 out channels, 6 input channels, kernel size: 5 x 5
 _conv2 <- randConv2d (Sz4 16 6 5 5)
 let [i, h1, h2, o] = [16 * 5 * 5, 120, 84, 10]
 _fc1 <- randLinear (Sz2 i h1)
 _fc2 <- randLinear (Sz2 h1 h2)
 _fc3 <- randLinear (Sz2 h2 o)
 return $
 LeNet { _conv1 = _conv1
 , _conv2 = _conv2
 , _fc1 = _fc1
 , _fc2 = _fc2
 , _fc3 = _fc3
 }
 
 Finally, we train our model 
 -- ... `main` function
 net' <- train TrainSettings { _printEpochs = 1
 , _lr = 0.01
 , _totalEpochs = 30
 } net (trainS, testS)
 
 The train function is essentially a wraper around the sgd stochastic gradient descent function. 
 train TrainSettings { _printEpochs = printEpochs
 , _lr = lr
 , _totalEpochs = totalEpochs
 } net (trainS, testS) = do
 (net', _) <- iterN (totalEpochs `div` printEpochs) (\(net0, j) -> do
 net1 <- sgd lr printEpochs net0 trainS

 -- ... Compute and print accuracies

 ) (net, 1)
 return net'
 
 In sgd we compute gradients over a mini-batch and subtract them to optimize the model parameters. 
 sgd lr n net0 dataStream = iterN n epochStep net0
 where
 -- Fold over the stream of all batches
 epochStep net = S.foldl' _trainStep net dataStream
 -- Update gradients based on a single batch
 _trainStep net (x, targ) = trainStep lr x targ net
 
 Now, the training step is a few lines of code thanks to Numeric.Backprop.gradBP function that applies the chain rule, thus replacing the entire pass function we had before. 
 trainStep
 :: Float -- ^ Learning rate
 -> Volume4 Float -- ^ Images batch
 -> Matrix Float -- ^ Targets
 -> LeNet Float -- ^ Initial network
 -> LeNet Float
trainStep lr !x !targ !n = n - lr * (gradBP (crossEntropyLoss x targ) n)
 
 If we ever want to actually use our model to identify a handwritten digit, we call evalBP: 
 forward :: LeNet Float -> Volume4 Float -> Matrix Float
forward net dta = evalBP (`lenet` dta) net
 
 Now, we compile and run our program: 
 $ ./run.sh
1 Training accuracy 10.4 Validation accuracy 10.3
2 Training accuracy 85.7 Validation accuracy 86.3
3 Training accuracy 95.9 Validation accuracy 96.2
4 Training accuracy 97.6 Validation accuracy 97.4
5 Training accuracy 98.4 Validation accuracy 98.1
6 Training accuracy 98.3 Validation accuracy 98.0
7 Training accuracy 99.0 Validation accuracy 98.8
8 Training accuracy 99.0 Validation accuracy 98.7
9 Training accuracy 99.1 Validation accuracy 98.7
...
26 Training accuracy 99.8 Validation accuracy 99.0
27 Training accuracy 99.8 Validation accuracy 98.7
28 Training accuracy 99.9 Validation accuracy 98.9
29 Training accuracy 99.7 Validation accuracy 98.6
30 Training accuracy 99.9 Validation accuracy 98.9
 
 Thus, after 30 training epochs, we have obtained a ∼1% validation error, which is about twice as low compared to the simple fully-connected architecture. See the complete project on Github. 
 Model Parameters Comparison 
 Our three-layer model from Day 4 consumed (784+1)⋅300+(300+1)⋅50+(50+1)⋅10=251,060 learnable parameters (not counting batchnorm parameters). In contrast, the convolutional neural network model that we have implemented today requires only 1 ⋅5⋅5⋅6+16⋅6⋅5⋅5+(400+1)⋅120+(120+1)⋅84++(84+1)⋅10=61,684 parameters.
Note also that 1⋅5⋅5⋅6+16⋅6⋅5⋅5=2,550 are convolutional layers parameters, whereas the majority (59,134) reside in classifier's fully-connected layers. 
 Although, we did not apply batch normalization today, like was in the original LeNet, feel free to experiment with adding batchnorm layers. Please also note that there is a difference between batch normalization after convolution and after fully-connected layers (see the batchnorm article). 
 Citation 
 @article{penkovsky2019CNN,
 title = "Convolutional Neural Networks Tutorial",
 author = "Penkovsky, Bogdan",
 journal = "penkovsky.com",
 year = "2019",
 month = "November",
 url = "https://penkovsky.com/neural-networks/day5/"
}
 
 Summary 
 Convolutional neural networks aka ConvNets achieve translation invariance and connection sparsity. Thanks to weight sharing, ConvNets dramatically reduce the number of trained parameters. The power of ConvNet comes from training convolution filters, in contrast to manual feature engineering. In future posts, we will gain the power of GPU to face bigger challenges. We will save the planet. And we might even try to read someone's mind with convolutional networks. Stay tuned! 
 Further reading 
 Learned today: 
 
 
 Interactive Image Kernels 
 
 
 Excellent tutorial on ConvNets 
 
 
 The Ancient Secrets of Computer Vision (online course) 
 
 
 Why I prefer functional programming 
 
 
 Efficient Parallel Stencil Convolution in Haskell 
 
 
 Deeper into neural networks: 
 
 
 LeNet-5 paper 
 
 
 Inception paper 
 
 
 Xception paper 
 
 
 Mobilenet paper

Day 6: Saving Energy with Binarized Neural Networks
Last week Apple has acquired XNOR.ai startup for amazing $200 million. The startup is known for promoting binarized neural network algorithms to save the energy and computational resources. That is definitely a way to go for mobile devices, and Apple just acknowledged that it is a great deal for them too. I feel now is a good time to explain what binarized neural networks are so that you can better appreciate their value for the industry. 
 
 Today's post is based on 
 
 Day 1: https://notepad.gasick.ru/books/10-дневный-курс-машинное-обучение/page/день-1-изучение-нейронной-сети-сложным-путем 
 Day 2: https://notepad.gasick.ru/books/10-дневный-курс-машинное-обучение/page/день-2-что-делают-скрытые-уровни 
 Day 4: https://notepad.gasick.ru/books/10-дневный-курс-машинное-обучение/page/den-4-vaznost-paketnoi-normalizacii 
 
 The source code from this post is available on Github. 
 
 The Basics 
 Knowing what a (real-weight) neural network is you already have the most essential information. In contrast, a binarized neural network (BNN) has its weights and activations limited to a single bit precision. Meaning that those have values of either +1 or −1 . Below is a comparison table. 
 Binarized neural network vs conventional neural network with real weights
Binarized neural network vs conventional neural network with real weights 
 Having single-bit values can drastically reduce hardware requirements to operate those networks. As you remember from Day 1, neural networks are built around activations of dot products. When every value is +1 or −1, multiplication can be replaced with
XNOR binary operation, thus eliminating the need in hardware multipliers. The dot product sum is then replaced by a simple bit counting operation (Popcount). And finally the activation is merely a thresholding (Sign) function. 
 Overall, the architecture can speed up conventional hardware or even better, drastically reduce the area of dedicated ASICs. Reduced area means smaller manufacturing cost. Not less important is another implication, reduced energy consumption. That is crucial for edge devices. To further save the energy one can even merge computation and memory giving rise to emerging in-memory computing technology. 
 As we have discussed before, batch normalization is helpful to prevent neurons from saturation and to faster train the network. It turns out that for binarized networks batchnorm is actually indispensable. Indeed, neurons with
Sign
activations tend to be useless ("dead") otherwise. Moreover, for an efficient training, to make the error landscape smooth, it is advisable to also include batch normalization immediately before the softmax layer. Yes, the most important lesson to learn here is is actually how to train a BNN. I attempt to answer the most frequently encountered questions in the following section. 
 Binarized Neural Networks FAQ 
 Here are some typical questions that people tend to ask. 
 Q: Do I need to provide binarized inputs to a BNN? 
 A: No, BNN inputs can remain non-binarized (real). That results in the first layer slightly different from the others: having binary weights (+1 and -1), yet producing real pre-activation values. After activation, those values are converted to binary ones. Here is a hint: 
 
 It is relatively easy to handle continuous-valued inputs as fixed point numbers, with m bits of precision.
Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks... (2016) 
 
 You should also be aware that indeed there exist methods to have binary inputs as well. 
 Q: Do I need more neurons? 
 A: Yes, typically one needs 3 to 4 times more binary neurons to compensate for information loss. 
 Q: Do I need a softmax layer? 
 A: Only for BNN training. For inference (and simpler hardware) it can be removed. 
 Q: Do binary neurons have learnable biases? 
 A: No, biases are redundant as normally you want to use batch normalization layers. Those layers already learn bias-equivalent parameters. 
 Q: Why BNNs are sometimes called "integer networks"? 
 A: Because bit count (before activation) results in an integer value. 
 Q:
Sign activation gradient is zero 
 A: We approximate Sign(x) derivative with Hardtanh(x)1 derivative, i.e. 1|x|≤1. 
 Q: Are gradients binarized in backprop training? 
 A: No, during the training one typically deals with real gradients and thus, real weights. Therefore, it is only in the forward pass where the network applies its binarized weights. 
 Q: So what is the interest then? 
 A: BNNs are interesting to develop dedicated hardware hosting pretrained networks for inference, i.e. performing only the forward pass. Think about handwriting recognition-based automated systems as a use case. Think about energy-harvesting sensors. Solar-powered smart cameras. Drones. Smart wearable devices and implants. Binarized networks facilitate smaller, faster, and more energy-efficient inference devices. Though, there is also an effort towards (on-chip) binarized networks training. 
 Implementing Binarized Neural Networks in Haskell 
 There is no better way to understand a binarized network than to create one ourselves. This easy demo is built on top of Day 4. Here are a few code highlights. 
 First of all, the layers types we will use 
 data Layer a = -- A linear layer (BNN training)
 | BinarizedLinear (Matrix a)
 -- Batch normalization with running mean, variance, and two
 -- learnable affine parameters
 | Batchnorm1d (Vector a) (Vector a) (Vector a) (Vector a)
 | Activation FActivation
 
 Second, we provide the Sign activation 
 sign :: Matrix Float -> Matrix Float
sign = computeMap f
 where
 f x = if x <= 0
 then -1
 else 1
 
 and its gradient approximation: 
 sign' :: Matrix Float
 -> Matrix Float
 -> Matrix Float
sign' x = compute. A.zipWith f x
 where
 f x0 dy0 = if (x0 > (-1)) && (x0 < 1)
 then dy0
 else 0
 
 Finally, we accommodate the forward and backward passes of the network through the BinarizedLinear layer. 
 _pass inp (BinarizedLinear w : layers) = (dX, pred, BinarizedLinearGradients dW : t)
 where
 -- Binarize the weights (!)
 wB = sign w

 -- Forward
 lin = maybe (error "Incompatible shapes") compute (inp |*| wB)

 (dZ, pred, t) = _pass lin layers

 -- Backward
 dW = linearW' inp dZ
 -- Gradient w.r.t. wB (!)
 dX = linearX' wB dZ
 
 In the main function we define the architecture, starting with the number of neurons in each layer. Good news: on the MNIST handwriting recognition benchmark we can achieve accuracy comparable to those of a 32 bit network. However, we may need more neurons (by infl factor) in the hidden layers. 
 let infl = 4

 let [i, h1, h2, o] = [784, infl * 300, infl * 50, 10]
 
 Here is the network specification. Note the BinarizedLinear layers defined above. The first BinarizedLinear has real inputs, however every subsequent one receives binary inputs coming from Sign activation. 
 let net =
 [ BinarizedLinear w1
 , Batchnorm1d (zeros h1) (ones h1) (ones h1) (zeros h1)
 , Activation Sign

 , BinarizedLinear w2
 , Batchnorm1d (zeros h2) (ones h2) (ones h2) (zeros h2)
 , Activation Sign

 , BinarizedLinear w3
 -- NB this batchnorm (!)
 , Batchnorm1d (zeros o) (ones o) (ones o) (zeros o)
 ]
 
 A final technical note is that dividing by the batch variance (and rescaling by a learnable parameter gamma ) in batchnorm actually makes little sense since division (multiplication) by a positive constant does not change the sign. If you would like to get your hands dirty with the code, that is a good place to optimize. Another suggestion is to transfer your newly acquired skills to a convolutional network architecture2. 
 See the complete project on Github. If you have any questions, remarks, or any typos found please send me an email. For suggestions about the code, feel free to open a new issue. 
 Summary 
 Binarized neural networks (BNNs) are valuable for low-power edge devices. Starting with energy-harvesting sensors and finishing with smart wearable medical devices and implants. Binarized networks facilitate smaller, faster, and more energy-efficient hardware. 
 We have illustrated how to train a BNN solving a handwritten digits recognition task. Our binarized network can achieve accuracy comparable to a full-precision 32 bit network provided that we moderately increase the number of binary neurons. To better grasp today's concept, train your own BNNs. Pay attention to small details. Do not forget the batch normalization before the softmax. And good luck! 
 Further Reading
Binarized neural networks: 
 
 [http://arxiv.org/abs/1602.02830](Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1) 
 [https://arxiv.org/abs/1603.05279]XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks 
 [https://penkovsky.com/project/edge-ai/]Emerging technology: Edge AI 
 [https://penkovsky.com/publication/medical-bnn/]In-Memory Resistive RAM Implementation of Binarized Neural Networks for Medical Applications

Day 7: Real World Deep Learning
So far we have explored neural networks almost in the vacuum. Although we have provided some illustrations for better clarity, relying an existing framework would allow us to benefit from the knowledge of previous contributors. One such framework is called Hasktorch. Among the practical reasons to use Hasktorch is relying on a mature Torch Tensor library. Another good reason is strong GPU acceleration, which is necessary for almost any serious deep learning project. Finally, standard interfaces rather than reinventing the wheel will help to reduce the boilerplate. 
 Fun fact: one of Hasktorch contributors is Adam Paszke, the original author of Pytorch. 
 
 Today's post is also based on 
 Day 2: What Do Hidden Layers Do?
Day 4: The Importance Of Batch Normalization
Day 5: Convolutional Neural Networks Tutorial
The source code from this post is available on Github. 
 
 The Basics 
 The easiest way to start with Hasktorch is via Docker: 
 docker run --gpus all -it --rm -p 8888:8888 \
 -v $(pwd):/home/ubuntu/data \
 htorch/hasktorch-jupyter:latest-cu11
 
 Now, you may open localhost:8888 in your browser to access Jupyterlab notebooks. Note that you need to select Haskell kernel when creating a new notebook. 
 If you have never used Torch library before, you may also want to review this tutorial. 
 MNIST Example 
 Let's take the familiar MNIST example and see how it can be implemented in Hasktorch. 
 Imports
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE ScopedTypeVariables #-}

import Control.Exception.Safe
 ( SomeException (..),
 try,
 )
import Control.Monad ( forM_, when, (<=<) )
import Control.Monad.Cont ( ContT (..) )
import GHC.Generics
import Pipes hiding ( (~>) )
import qualified Pipes.Prelude as P
import Torch
import Torch.Serialize
import Torch.Typed.Vision ( initMnist )
import qualified Torch.Vision as V
import Prelude hiding ( exp )
 
 The most notable import is the Torch module itself. There are also related helpers such Torch.Vision to handle image data. The function initMnist has type 
 initMnist :: String -> IO (MnistData, MnistData)
 
 The function is loading MNIST train and test datasets, similar to loadMNIST from previous posts. 
 It might be also useful to pay attention to Pipes module. It is an alternative to previously used Streamly , which also allows building streaming components. 
 We also import functions from Control.Monad , which are useful for IO operations. 
 Finally, we hide exp function in favor of Torch exp , which operates on tensors (arrays)1 rather than floating point scalars: 
 Torch.exp :: Tensor -> Tensor
 
 Defining Neural Network Architecture 
 First we define a neural network data structure that contains trained parameters (neural network weights). In the simplest case, it can be a multilayer perceptron (MLP). 
 data MLP = MLP
 { fc1 :: Linear,
 fc2 :: Linear,
 fc3 :: Linear
 }
 deriving (Generic, Show, Parameterized)
 
 This MLP contains three linear layers. Next, we may define a data structure that specifies the number of neurons in each layer: 
 data MLPSpec = MLPSpec
 { i :: Int,
 h1 :: Int,
 h2 :: Int,
 o :: Int
 }
 deriving (Show, Eq)
 
 Now, we can define a neural network as a function, similar as we did on Day 5 with a "reversed" composition operator (~>) . 
 (~>) :: (a -> b) -> (b -> c) -> a -> c
f ~> g = g. f

mlp :: MLP -> Tensor -> Tensor
mlp MLP {..} =
 -- Layer 1
 linear fc1
 ~> relu

 -- Layer 2
 ~> linear fc2
 ~> relu

 -- Layer 3
 ~> linear fc3
 ~> logSoftmax (Dim 1)
 
 We finish by a (log) softmax layer reducing the tensor's dimension 1 (Dim 1). Derivatives of linear, relu, and logSoftmax are already handled by Torch library. 
 Initial Weights 
 How do we generate initial random weights? As you may remember from Day 5, we could create a function such as this one: 
 randNetwork = do
 let [i, h1, h2, o] = [784, 64, 32, 10]
 fc1 <- randLinear (Sz2 i h1)
 fc2 <- randLinear (Sz2 h1 h2)
 fc3 <- randLinear (Sz2 h2 o)
 return $
 MLP { fc1 = fc1
 , fc2 = fc2
 , fc3 = fc3
 }
 
 In our example we do almost the same, except we benefit from applicative functors and Randomizable . 
 instance Randomizable MLPSpec MLP where
 sample MLPSpec {..} =
 MLP
 <$> sample (LinearSpec i h1)
 <*> sample (LinearSpec h1 h2)
 <*> sample (LinearSpec h2 o)
 
 We say above that MLP is an instance of the Randomizable typeclass, parametrized by MLPSpec . All we needed to define this instance was to implement a sample function. To generate initial MLP weights, later we can simply write 
 let spec = MLPSpec 784 64 32 10
net <- sample spec
 
 Train Loop 
 The core of the neural network training is trainLoop , which enables a single training "epoch". Let us first inspect its type signature. 
 trainLoop :: Optimizer o => MLP -> o -> ListT IO (Tensor, Tensor) -> IO MLP
 
 This signifies that the function accepts an initial neural network configuration, an optimizer, and a dataset. The optimizer can be a gradient descent (GD), Adam, or other optimizer. The result of the function is a new MLP configuration, as a result of IO call. IO is necessary for instance if we want to print the loss after each iteration. Now, let's take a look at the implementation: 
 trainLoop model optimizer = P.foldM step begin done. enumerateData
First, we enumerate the dataset with enumerateData . Then, we iterate over (fold) the batches. The step function is an analogy to a step in the gradient descent algorithm: 
 where
 step :: MLP -> ((Tensor, Tensor), Int) -> IO MLP
 step model ((input, label), iter) = do
 let loss = nllLoss' label $ mlp model input
 -- Print loss every 50 batches
 when (iter `mod` 50 == 0) $ do
 putStrLn $ "Iteration: " ++ show iter ++ " | Loss: " ++ show loss
 (newParam, _) <- runStep model optimizer loss 1e-3
 return newParam
 
 We calculate a negative log likelihood loss nllLoss' between the ground truth label and the output of our MLP. Note that model is the parameter, i.e. weights of the MLP network. Then, we take advantage of the iteration number iter to print the loss every 50 iterations. Finally, we perform a gradient descent step using our optimizer via runStep :: ... => model -> optimizer -> Loss -> LearningRate -> IO (model, optimizer) and keep only new model newParam . The learning rate here is 1e-3 , but can be eventually changed. 
 The done function is (trivial in this case) finalization of foldM iterations over the MLP model and begin are the initial weights (we use pure to satisfy the type m x requirement). 
 done = pure
 begin = pure model
 
 Putting It All Together 
 The remaining part is simple. We load the data into batches, specify the number of neurons in our MLP, choose an optimizer, and initialize the random weights. 
 main = do
 (trainData, testData) <- initMnist "data"
 let trainMnist = V.MNIST {batchSize = 256, mnistData = trainData}
 testMnist = V.MNIST {batchSize = 1, mnistData = testData}
 spec = MLPSpec 784 64 32 10
 optimizer = GD
 net <- sample spec
 
 Then, we train the network for 5 epochs: 
 net' <- foldLoop net 5 $ \model _ ->
 runContT (streamFromMap (datasetOpts 2) trainMnist) $ trainLoop model optimizer. fst
 
 Finally, we may examine the model on test images 
 forM_ [0 .. 10] $ displayImages net' <=< getItem testMnist
 
 For this purpose may use a function such as 
 displayImages :: MLP -> (Tensor, Tensor) -> IO ()
displayImages model (testImg, testLabel) = do
 V.dispImage testImg
 putStrLn $ "Model : " ++ (show. argmax (Dim 1) RemoveDim. exp $ mlp model testImg)
 putStrLn $ "Ground Truth : " ++ show testLabel
 
 Running 
 Iteration: 0 | Loss: Tensor Float [] 12.3775 
Iteration: 50 | Loss: Tensor Float [] 1.0952 
Iteration: 100 | Loss: Tensor Float [] 0.5626 
Iteration: 150 | Loss: Tensor Float [] 0.6660 
Iteration: 200 | Loss: Tensor Float [] 0.4771 
Iteration: 0 | Loss: Tensor Float [] 0.5012 
Iteration: 50 | Loss: Tensor Float [] 0.4058 
Iteration: 100 | Loss: Tensor Float [] 0.3095 
Iteration: 150 | Loss: Tensor Float [] 0.4237 
Iteration: 200 | Loss: Tensor Float [] 0.3433 
Iteration: 0 | Loss: Tensor Float [] 0.3671 
Iteration: 50 | Loss: Tensor Float [] 0.3206 
Iteration: 100 | Loss: Tensor Float [] 0.2467 
Iteration: 150 | Loss: Tensor Float [] 0.3420 
Iteration: 200 | Loss: Tensor Float [] 0.2737 
Iteration: 0 | Loss: Tensor Float [] 0.3054 
Iteration: 50 | Loss: Tensor Float [] 0.2779 
Iteration: 100 | Loss: Tensor Float [] 0.2161 
Iteration: 150 | Loss: Tensor Float [] 0.2933 
Iteration: 200 | Loss: Tensor Float [] 0.2289 
Iteration: 0 | Loss: Tensor Float [] 0.2693 
Iteration: 50 | Loss: Tensor Float [] 0.2530 
Iteration: 100 | Loss: Tensor Float [] 0.1979 
Iteration: 150 | Loss: Tensor Float [] 0.2616 
Iteration: 200 | Loss: Tensor Float [] 0.1986 
 
 
 
 
 #%%***** 
 ::: % 
 %: 
 :% 
 #: 
 :% 
 %. 
 #= 
 :%. 
 =# 
Model : Tensor Int64 [1] [ 7]
Ground Truth : Tensor Int64 [1] [ 7]
 
 
 %%%# 
 %# % 
 . #% 
 :%: 
 %+ 
 *% 
 %= 
 %% 
 %%%%++%%%=
 ==%%=. 
 
 
Model : Tensor Int64 [1] [ 2]
Ground Truth : Tensor Int64 [1] [ 2]
 
 
 .- 
 = 
 % 
 .# 
 =: 
 @ 
 # 
 ++ 
 %: 
 % 
 
 
Model : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
 
 
 %. 
 *%- 
 %%%%# 
 :%%+:%- 
 %% -%. 
 % .@+ 
 % %%. 
 % #%* 
 %%%%%% 
 :%%%- 
 
 
Model : Tensor Int64 [1] [ 0]
Ground Truth : Tensor Int64 [1] [ 0]
 
 
 
 = + 
 % % 
 +. % 
 % %: 
 + % 
 %--=*% 
 :: +% 
 =% 
 =% 
 * 
 
Model : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
 
 
 
 %@ 
 @: 
 =@ 
 @% 
 @ 
 :@ 
 %# 
 @ 
 @ 
 + 
 
Model : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
 
 
 
 % % 
 % % 
 +# -+ 
 +%*::*% 
 :%==%+ 
 % 
 ++ 
 % 
 %-+ 
 * 
 
Model : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
 
 
 
 + 
 %%+ 
 .%*%% 
 -: *% 
 -#-%%. 
 %% =# 
 % 
 .% 
 #. 
 % 
 
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
 
 
 ..=. 
 .%%%%%% 
 ::%+: 
 % 
 % 
 %= 
 %%%%%%+ 
 :%%%% 
 %%%% 
 %# 
 
 
Model : Tensor Int64 [1] [ 6]
Ground Truth : Tensor Int64 [1] [ 5]
 
 
 
 
 +%%%# 
 +%* .%% 
 :%. .#%+ 
 %@%%%%* 
 +%- 
 -%# 
 %% 
 %% 
 %= 
 @ 
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
 
 
 ==: 
 %%**%% 
 .% %: 
 *- +# 
 % :# 
 # :# 
 -# +# 
 -# .% 
 # +%: 
 #%%%%= 
 
 
Model : Tensor Int64 [1] [ 0]
Ground Truth : Tensor Int64 [1] [ 0]
 
 See the complete project on Github. For suggestions about the content feel free to open a new issue. 
 Summary 
 Today we have learned the basics of Hasktorch library. The most important is that the principles from our previous days still apply. Therefore, the transition to the new library was quite straightforward. With a few minor changes, this example could be run on a graphics processing unit accelerator. 
 Further Reading
Hasktorch: 
 Hasktorch tutorial https://hasktorch.github.io/tutorial/02-tensors.html
Hasktorch examples https://github.com/hasktorch/hasktorch/tree/master/examples
Hasktorch documentation http://hasktorch.org/docs.html
Applicative functors http://learnyouahaskell.com/functors-applicative-functors-and-monoids

Day 8: Model Uncertainty Estimation
Wouldn't it be nice if the model also told us which predictions are not reliable? Can this be done even on unseen data? The good news is yes, and even on new, completely unseen data. It is also simple to implement in practice. A canonical example is in a medical setting. By measuring model uncertainty, the doctor can learn how reliable is their AI-assisted patient's diagnosis. This allows the doctor to make a better informed decision whether to trust the model or not. And potentially save someone's life. 
 Today we build upon Day 7 and we continue our journey with Hasktorch: 
 
 We will introduce a Dropout layer. 
 We will compute on a graphics processing unit (GPU). 
 We will also show how to load and save models. 
 We will train with Adam optimizer. 
 And finally we will talk about model uncertainty estimation. 
 
 The complete project is also available on Github. 
 Dropout Layer 
 Neural networks, as any other model with many parameters, tend to overfit. By overfitting I mean "fail to fit to additional data or predict future observations reliably". Let us consider a classical example below. 
 Overfitting.
Overfitting. Credit Ignacio Icke, CC BY-SA 4.0 
 The green line is a decision boundary created by an overfitted model. We see that the model tries to memorize every possible data point. However, it fails to generalize. To ameliorate the situation, we perform a so-called regularization. That is a technique that helps to prevent overfitting. In the image above, the black line is a decision boundary of a regularized model. 
 One of regularization techniques for artificial neural networks is called dropout or dilution. Its principle of operation is quite simple. During neural network training, we randomly disconnect a fraction of neurons with some probability. It turns out that dropout conditioning results in more reliable neural network models. 
 A Neural Network with Dropout 
 The data structures MLP (learnable parameters) and MLPSpec (number of neurons) remain unchanged. However, we will need to modify the mlp function (full network) to include a Dropout layer. If we inspect dropout :: Double -> Bool -> Tensor -> IO Tensor type, we see that it accepts three arguments: a Double probability of dropout, a Bool that turns this layer on or off, and a data Tensor . Typically, we turn the dropout on during the training and off during the inference stage. 
 However, the biggest distinction between e.g. relu function and dropout is that relu :: Tensor -> Tensor is a pure function, i.e. it does not have any 'side-effects'. This means that every time when we call a pure function, the result will be the same. This is not the case with dropout that relies on an (external) random number generator, and therefore returns a new result each time. Therefore, its outcome is an IO Tensor . 
 One has to pay a particular attention to those IO functions, because they can change the state in the external world. This can be printing text on the screen, deleting a file, or launching missiles. Typically, we prefer to keep functions pure whenever possible, as function purity improves the reasoning about the program: It is a child's play to refactor (reorganize) a program consisting only of pure functions. 
 I find the so-called do-notation to be the most natural way to combine both pure functions and those with side-effects. The pure equations can be grouped under let keyword(s), while the side-effects are summoned with a special <- glue. This is how we integrate dropout in mlp . Note that now the outcome of mlp also becomes an IO Tensor . 
 mlp :: MLP -> Bool -> Tensor -> IO Tensor
mlp MLP {..} isStochastic x0 = do
 -- This subnetwork encapsulates the composition
 -- of pure functions
 let sub1 =
 linear fc1
 ~> relu

 ~> linear fc2
 ~> relu

 -- The dropout is applied to the output
 -- of the subnetwork
 x1 <- dropout
 0.1 -- Dropout probability
 isStochastic -- Activate Dropout when in stochastic mode
 (sub1 x0) -- Apply dropout to
 -- the output of `relu` in layer 2

 -- Another linear layer
 let x2 = linear fc3 x1

 -- Finally, logSoftmax, which is numerically more stable
 -- compared to simple log(softmax(x2))
 return $ logSoftmax (Dim 1) x2
 
 For model uncertainty estimation, it is empirically recommended to keep the dropout probability anywhere between 0.1 and 0.2. 
 Computing on a GPU 
 To transfer data onto a GPU, we use toDevice :: ... => Device -> a -> a . Below are helper methods to traverse data structures containing tensors (e.g. MLP) to convert those between devices. 
 toLocalModel :: forall a. HasTypes a Tensor => Device -> DType -> a -> a
toLocalModel device' dtype' = over (types @Tensor @a) (toDevice device')

fromLocalModel :: forall a. HasTypes a Tensor => a -> a
fromLocalModel = over (types @Tensor @a) (toDevice (Device CPU 0))
 
 Below is a shortcut to transfer data to cuda:0 device, assuming the Float type. 
 toLocalModel' = toLocalModel (Device CUDA 0) Float 
 
 The train loop is almost the same as in the previous post, except a few changes. First, we convert training data to GPU with toLocalModel' (assuming that the model itself was already converted to GPU). Second, predic <- mlp model isTrain input is an IO action. Third, we manage optimizer's internal state1. 
 trainLoop
 :: Optimizer o
 => (MLP, o) -> LearningRate -> ListT IO (Tensor, Tensor) -> IO (MLP, o)
trainLoop (model0, opt0) lr = P.foldM step begin done. enumerateData
 where
 isTrain = True
 step :: Optimizer o => (MLP, o) -> ((Tensor, Tensor), Int) -> IO (MLP, o)
 step (model, opt) args = do
 let ((input, label), iter) = toLocalModel' args
 predic <- mlp model isTrain input
 let loss = nllLoss' label predic
 -- Print loss every 100 batches
 when (iter `mod` 100 == 0) $ do
 putStrLn
 $ printf "Batch: %d | Loss: %.2f" iter (asValue loss :: Float)
 runStep model opt loss lr
 done = pure
 begin = pure (model0, opt0)
 
 We also modify the train function to use Adam optimizer with mkAdam : 
 
 0 is the initial iteration number (then internally increased by the optimizer). 
 We provide beta1 and beta2 values. 
 flattenParameters net0 are needed to get the shapes of the trained parameters momenta. See also Day 2 for more details. 
 
 train :: V.MNIST IO -> Int -> MLP -> IO MLP
train trainMnist epochs net0 = do
 (net', _) <- foldLoop (net0, optimizer) epochs $ \(net', optState) _ ->
 runContT (streamFromMap dsetOpt trainMnist)
 $ trainLoop (net', optState) lr. fst
 return net'
 where
 dsetOpt = datasetOpts workers
 workers = 2
 lr = 1e-4 -- Learning rate
 optimizer = mkAdam 0 beta1 beta2 (flattenParameters net0)
 beta1 = 0.9
 beta2 = 0.999
 
 Here is a function to get model accuracy: 
 accuracy :: MLP -> ListT IO (Tensor, Tensor) -> IO Float
accuracy net = P.foldM step begin done. enumerateData
 where
 step :: (Int, Int) -> ((Tensor, Tensor), Int) -> IO (Int, Int)
 step (ac, total) args = do
 let ((input, labels), _) = toLocalModel' args
 -- Compute predictions
 predic <- let stochastic = False
 in argmax (Dim 1) RemoveDim 
 <$> mlp net stochastic input

 let correct = asValue
 -- Sum those elements
 $ sumDim (Dim 0) RemoveDim Int64
 -- Find correct predictions
 $ predic `eq` labels

 let batchSize = head $ shape predic
 return (ac + correct, total + batchSize)

 -- When done folding, compute the accuracy
 done (ac, total) = pure $ fromIntegral ac / fromIntegral total

 -- Initial errors and totals
 begin = pure (0, 0)

testAccuracy :: V.MNIST IO -> MLP -> IO Float
testAccuracy testStream net = do
 runContT (streamFromMap (datasetOpts 2) testStream) $ accuracy net. fst
 
 Below we provide the MLP specification: number of neurons in each layer. 
 spec = MLPSpec 784 300 50 10
 
 Saving and Loading the Model 
 Before we can save the model, we have to make the weight tensors dependent first: 
 save' :: MLP -> FilePath -> IO ()
save' net = save (map toDependent. flattenParameters $ net)
 
 The inverse is true for model loading. We also replace parameters in a newly generated model with the one we have just loaded: 
 load' :: FilePath -> IO MLP
load' fpath = do
 params <- mapM makeIndependent <=< load $ fpath
 net0 <- sample spec
 return $ replaceParameters net0 params
 
 Load the MNIST data: 
 (trainData, testData) <- initMnist "data"
 
 Train a new model: 
 -- A train "loader"
trainMnistStream = V.MNIST { batchSize = 256, mnistData = trainData }
net0 <- toLocalModel' <$> sample spec

epochs = 5
net' <- train trainMnistStream epochs net0
 
 Saving the model: 
 save' net' "weights.bin"
 
 To load a pretrained model: 
 net <- load' "weights.bin"
 
 We can verify the model's accuracy: 
 -- A test "loader"
testMnistStream = V.MNIST { batchSize = 1000, mnistData = testData }

ac <- testAccuracy testMnistStream net
putStrLn $ "Accuracy " ++ show ac
 
 Accuracy 0.9245
 
 The accuracy is not tremendous, but it can be improved by introducing batch norm, convolutional layers, and training longer. We are about to discuss model uncertainty estimation and this accuracy is good enough. 
 Predictive Entropy 
 Model uncertainties are obtained as:
 
 where y is label, x – input image, c – class, p – probability.
We call H predictive entropy. And it is the very dropout technique that helps us to estimate those uncertainties. All we need to do is to collect several predictions in the stochastic mode (i.e. dropout enabled) and apply the formula from above. 
 predictiveEntropy :: Tensor -> Float
predictiveEntropy predictions =
 let epsilon = 1e-45
 a = meanDim (Dim 0) RemoveDim Float predictions
 b = Torch.log $ a + epsilon
 in asValue $ negate $ sumAll $ a * b
 
 Visualizing Softmax Predictions 
 To get a better feeling what model outputs look like, it would be nice to visualize the softmax output as a histogram or a bar chart. For instance 
 bar ["apples", "oranges", "kiwis"] [50, 100, 25]
 
 apples ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 50.00
oranges ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 100.00
kiwis ▉▉▉▉▉▉▉▉▉▉▉▉▋ 25.00
 
 Now, we would like to display an image, the predictive entropy, and the softmax output, followed by prediction and ground truth. To transform logSoftmax into softmax, we use the following identity:
 
that is softmax = exp. logSoftmax. 
 displayImage :: MLP -> (Tensor, Tensor) -> IO ()
displayImage model (testImg, testLabel) = do
 let repeatN = 20
 stochastic = True
 preds <- forM [1..repeatN] $ \_ -> exp -- logSoftmax -> softmax
 <$> mlp model stochastic testImg
 pred0 <- mlp model (not stochastic) testImg
 let entropy = predictiveEntropy $ Torch.cat (Dim 0) preds

 -- Select only the images with high entropy
 when (entropy > 0.9) $ do
 V.dispImage testImg
 putStr "Entropy "
 print entropy
 -- exp. logSoftmax = softmax
 bar (map show [0..9]) (asValue $ flattenAll $ exp pred0 :: [Float])
 putStrLn $ "Model : " ++ (show. argmax (Dim 1) RemoveDim. exp $ pred0)
 putStrLn $ "Ground Truth : " ++ show testLabel
 
 Note that below we show only some of those images the model is uncertain about (entropy > 0.9) 
 testMnistStream = V.MNIST {batchSize = 1, mnistData = testData}
forM_ [0 .. 200] $ displayImage (fromLocalModel net) <=< getItem testMnistStream
 
 +% 
 % 
 * 
 #- +%%= 
 % %% % 
 % %+ # 
 % % * 
 % % :% 
 #*:=%# 
 -%=. 
 
 
Entropy 1.044228
0 ▉▏ 0.01
1 ▏ 0.00
2 ▋ 0.01
3 ▏ 0.00
4 ▉ 0.01
5 ▍ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.70
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.21
9 ▉▉▉▋ 0.05
Model : Tensor Int64 [1] [ 6]
Ground Truth : Tensor Int64 [1] [ 6]
 
 
 .#%#. 
 %%+: 
 % 
 %.. 
 ##-#%. 
 -% 
 :% 
 + 
 - .% 
 @%+*%%+ 
 
 
Entropy 1.2909155
0 ▏ 0.00
1 ▏ 0.00
2 ▍ 0.00
3 ▉▉▉▉▉▉▉▉ 0.07
4 ▏ 0.00
5 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.44
6 ▏ 0.00
7 ▍ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.47
9 ▉▏ 0.01
Model : Tensor Int64 [1] [ 8]
Ground Truth : Tensor Int64 [1] [ 5]
 
 
 
 =- = 
 #- =# 
 %- # 
 +% % 
 %. .% 
 ## .* 
 %%%%%#%#. 
 . % 
 
 
 
Entropy 1.3325933
0 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.19
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.46
5 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▊ 0.18
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▊ 0.16
7 ▏ 0.00
8 ▏ 0.00
9 ▏ 0.00
Model : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
 
 
 *: 
 :%%* 
 #- -+ 
 - 
 # 
 +: 
 # =. 
 #. =%: 
 *.*%- 
 #%%: 
 
 
Entropy 1.2533671
0 ▉ 0.01
1 ▉▉▍ 0.03
2 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.38
3 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.54
4 ▏ 0.00
5 ▋ 0.01
6 ▏ 0.00
7 ▏ 0.00
8 ▉▉▋ 0.03
9 ▏ 0.00
Model : Tensor Int64 [1] [ 3]
Ground Truth : Tensor Int64 [1] [ 2]
 
 
 
 +##- 
 * : 
 = 
 % = 
 % % 
 -= @ 
 = % 
 % 
 % 
 % 
 
Entropy 0.9308149
0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▉ 0.01
4 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.29
5 ▍ 0.00
6 ▏ 0.00
7 ▎ 0.00
8 ▉▎ 0.02
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.67
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
 
 
 
 # 
 % # 
 % * 
 % = 
 %%@% 
 * % 
 % 
 % 
 % 
 = 
 
Entropy 1.39582
0 ▏ 0.00
1 ▉▍ 0.01
2 ▏ 0.00
3 ▉▉▉▉▉▊ 0.06
4 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.48
5 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▋ 0.17
6 ▉▉▉▉ 0.04
7 ▏ 0.00
8 ▉▋ 0.02
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.22
Model : Tensor Int64 [1] [ 4]
Ground Truth : Tensor Int64 [1] [ 4]
 
 
 
 .#%@ 
 %%%%= 
 +%. %# 
 %%%%: 
 %%% 
 -%% 
 -%% 
 .%% 
 %%- 
 %* 
 
Entropy 1.0009595
0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▉▊ 0.02
4 ▏ 0.00
5 ▎ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.35
8 ▉ 0.01
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.62
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
 
 
 
 
 %##% 
 :%+%%. 
 -% %: 
 -% %+ 
 + %+ 
 %+ 
 %+ 
 %# 
 %% 
 .+ 
Entropy 1.0057298
0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▉▉▍ 0.03
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.33
8 ▏ 0.00
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.63
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 7]
 
 
 
 %%%%% 
 .% 
 %. 
 =%%%+ 
 % %# - 
 %%. 
 *%- 
 %:% 
 %-%= 
 %%- 
 
Entropy 1.0500848
0 ▉▉▉▉▍ 0.07
1 ▎ 0.00
2 ▎ 0.00
3 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.79
4 ▉▉▊ 0.04
5 ▉▉▉▎ 0.05
6 ▏ 0.00
7 ▍ 0.01
8 ▎ 0.00
9 ▉▊ 0.03
Model : Tensor Int64 [1] [ 3]
Ground Truth : Tensor Int64 [1] [ 3]
 
 
 
 :* 
 % 
 %% 
 :% 
 %* 
 +* 
 % 
 % 
 % 
 = 
 
Entropy 1.590256
0 ▏ 0.00
1 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.36
2 ▏ 0.00
3 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.10
4 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.32
5 ▉▉▉▎ 0.02
6 ▏ 0.00
7 ▎ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.12
9 ▉▉▉▉▉▉▉▉▉▉▍ 0.07
Model : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
 
 
 
 = = 
 %%%%%. 
 :%% 
 %* 
 .%%%%%%%%+
 %%%*: 
 %% 
 %% 
 %% 
 %% 
 
Entropy 0.9592192
0 ▏ 0.00
1 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▊ 0.28
2 ▋ 0.01
3 ▍ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▍ 0.01
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.67
8 ▏ 0.00
9 ▉▉▏ 0.03
Model : Tensor Int64 [1] [ 7]
Ground Truth : Tensor Int64 [1] [ 7]
 
 
 
 =%#* 
 :%%- .# 
 %% :% 
 .% #= 
 % 
 %%# 
 -%%%% 
 %%%.% 
 #% *+ 
 : 
 
Entropy 1.0005924
0 ▍ 0.00
1 ▏ 0.00
2 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.48
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.47
8 ▉▉▉▋ 0.03
9 ▉▎ 0.01
Model : Tensor Int64 [1] [ 2]
Ground Truth : Tensor Int64 [1] [ 2]
 
 
 - 
 :%%%- 
 :% % 
 +: :%- 
 -% *% 
 *: %* 
 == *% 
 * :% 
 #::..:*%% 
 :%*%%-: 
 
 
Entropy 1.3647958
0 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.50
1 ▏ 0.00
2 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.23
3 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.23
4 ▏ 0.00
5 ▉▉▉▏ 0.03
6 ▏ 0.00
7 ▏ 0.00
8 ▏ 0.00
9 ▉▍ 0.01
Model : Tensor Int64 [1] [ 0]
Ground Truth : Tensor Int64 [1] [ 0]
 
 
 
 %- 
 :% 
 # 
 -%#%* 
 :: @%. 
 * % #. 
 %% % 
 % 
 % 
 
 
Entropy 1.1518966
0 ▉▉▉▎ 0.06
1 ▍ 0.01
2 ▊ 0.01
3 ▏ 0.00
4 ▉▉▊ 0.05
5 ▏ 0.00
6 ▏ 0.00
7 ▏ 0.00
8 ▍ 0.01
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.86
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 2]
 
 
 
 =%%%%+ 
 .#. =#% 
 %* %# 
 #. .% 
 .# *%: 
 .%%%- = 
 # 
 # 
 -%% =% 
 =%%# 
 
Entropy 1.1256037
0 ▉▊ 0.02
1 ▏ 0.00
2 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▏ 0.29
3 ▎ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.59
9 ▉▉▉▉▉▉▉▉▎ 0.10
Model : Tensor Int64 [1] [ 8]
Ground Truth : Tensor Int64 [1] [ 9]
 
 
 --%: 
 . % 
 %: 
 ** .% 
 *%%. 
 %%*% 
 %* % 
 % % 
 % %: 
 %%%: 
 
 
Entropy 1.0862491
0 ▏ 0.00
1 ▉▉▋ 0.03
2 ▉▉▉▉▉ 0.05
3 ▏ 0.00
4 ▏ 0.00
5 ▋ 0.01
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.42
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.50
9 ▏ 0.00
Model : Tensor Int64 [1] [ 8]
Ground Truth : Tensor Int64 [1] [ 8]
 
 
 
 %% 
 %% 
 *%# 
 :%%- 
 .%% 
 %%+ 
 +%% 
 *%+ 
 =%= 
 =: 
 
Entropy 1.0085171
0 ▏ 0.00
1 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.81
2 ▎ 0.00
3 ▍ 0.01
4 ▎ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▏ 0.16
8 ▎ 0.01
9 ▍ 0.01
Model : Tensor Int64 [1] [ 1]
Ground Truth : Tensor Int64 [1] [ 1]
 
 
 
 -@@: 
 -# +: 
 #- % 
 %: ..- 
 +%=*% 
 .%% 
 %* 
 %% 
 %% 
 %. 
 
Entropy 1.5438546
0 ▏ 0.00
1 ▏ 0.00
2 ▉▉▉▉ 0.03
3 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.14
4 ▉▉▉▉▉▊ 0.05
5 ▊ 0.01
6 ▏ 0.00
7 ▉▉▉▉▊ 0.04
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▎ 0.31
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.42
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 9]
 
 Reflecting on softmax outputs above we can state that 
 
 Softmax output alone is not enough to estimate the model uncertainty. We can observe wrong predictions even when the margin between the top and second-best guess is large. 
 Sometimes prediction and ground truth coincide. So why the entropy is high? We actually need to inspect such cases in more details.
The first point is well illustrated by this example: 
 
 
 
 
 %- 
 :% 
 # 
 -%#%* 
 :: @%. 
 * % #. 
 %% % 
 % 
 % 
 
 
Entropy 1.1518966
0 ▉▉▉▎ 0.06
1 ▍ 0.01
2 ▊ 0.01
3 ▏ 0.00
4 ▉▉▊ 0.05
5 ▏ 0.00
6 ▏ 0.00
7 ▏ 0.00
8 ▍ 0.01
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.86
Model : Tensor Int64 [1] [ 9]
Ground Truth : Tensor Int64 [1] [ 2]
 
 To illustrate the last point, let us take a closer look at cases with high entropy. By running several realizations of the stochatic model, we can verify if the model has any "doubt" by selecting different answers. 
 displayImage' :: MLP -> (Tensor, Tensor) -> IO ()
displayImage' model (testImg, testLabel) = do
 let repeatN = 10
 pred' <- forM [1..repeatN] $ \_ -> exp -- logSoftmax -> softMax
 <$> mlp model True testImg
 pred0 <- mlp model False testImg
 let entropy = predictiveEntropy $ Torch.cat (Dim 0) pred'

 V.dispImage testImg
 putStr "Entropy "
 print entropy
 forM_ pred' ( \pred ->
 putStrLn ""
 >> bar (map show [0..9]) (asValue $ flattenAll pred :: [Float]) )
 putStrLn $ "Model : " ++ (show. argmax (Dim 1) RemoveDim. exp $ pred0)
 putStrLn $ "Ground Truth : " ++ show testLabel
 
 The first example from above (dataset index 11) gives this: 
 (displayImage' (fromLocalModel net) <=< getItem testMnistStream) 11
 
 
 +% 
 % 
 * 
 #- +%%= 
 % %% % 
 % %+ # 
 % % * 
 % % :% 
 #*:=%# 
 -%=. 
 
 
Entropy 1.1085687

0 ▎ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.90
7 ▏ 0.00
8 ▉▉▉▉▉▍ 0.10
9 ▏ 0.00

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▋ 0.01
1 ▏ 0.00
2 ▏ 0.00
3 ▎ 0.01
4 ▉▉▉▏ 0.05
5 ▏ 0.00
6 ▉▉▎ 0.04
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.86
9 ▉▎ 0.02

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▉▉▉▉▍ 0.04
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▉▉▉▉▉▉▉▉▉▉▏ 0.09
5 ▉▏ 0.01
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▋ 0.30
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.12
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.43

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▋ 0.01
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▎ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.74
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.20
9 ▉▉▋ 0.04

0 ▉▏ 0.02
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▋ 0.01
5 ▏ 0.00
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.80
7 ▏ 0.00
8 ▉▉▉▉▉▉▋ 0.10
9 ▉▉▉▉▎ 0.07

0 ▉▉▉▉▍ 0.04
1 ▏ 0.00
2 ▎ 0.00
3 ▏ 0.00
4 ▉▉▉▉▉▉▉▉▉▉▏ 0.09
5 ▉▏ 0.01
6 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▋ 0.30
7 ▏ 0.00
8 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▍ 0.12
9 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0.43
Model : Tensor Int64 [1] [ 6]
Ground Truth : Tensor Int64 [1] [ 6]
 
 Wow! The model sometimes "sees" digit 6, sometimes digit 8, and sometimes digit 9! For the contrast, here is how predictions with low entropy typically look like. 
 (displayImage' (fromLocalModel net) <=< getItem testMnistStream) 0
 
 #%%***** 
 ::: % 
 %: 
 :% 
 #: 
 :% 
 %. 
 #= 
 :%. 
 =# 
Entropy 4.8037423e-4

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00

0 ▏ 0.00
1 ▏ 0.00
2 ▏ 0.00
3 ▏ 0.00
4 ▏ 0.00
5 ▏ 0.00
6 ▏ 0.00
7 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 1.00
8 ▏ 0.00
9 ▏ 0.00
Model : Tensor Int64 [1] [ 7]
Ground Truth : Tensor Int64 [1] [ 7]
 
 The model always "sees" digit 7. That is why the predictive entropy is low. Note that the results are model-dependent. Therefore we also share the weights for reproducibility. However, every realization of the stochastic model might still be different, especially in those cases where the entropy is high. 
 Find the complete project on Github. For suggestions about the content feel free to open a new issue. 
 Summary 
 I hope you are now convinced that model's uncertainty estimation is an invaluable tool. This simple technique is essential when applying deep learning for real-life decision making. This post also develops on how to use Hasktorch library in practice. Notably, it is very straightforward to run computations on a GPU. Overall, Hasktorch can be used for real-world deep learning. The code is well-structured and relies on a mature Torch library. On the other hand, it would be desirable to capture high-level patterns so that the user does not need to think about low-level concepts such as dependent and independent tensors, for example. The end user should be able to simply apply save net "weights.bin" and mynet <- load "weights.bin" without any indirections. The same reasoning applies to the trainLoop , i.e. the user does not need to reinvent it every time. Eventually, a higher-level package on top of Hasktorch should capture the best practices, similar to PyTorch Lightning or fast.ai. 
 Now your turn: explore image recognition with AlexNet convolutional network and have fun! 
 Edit 27/04/2022: The original version from 23/04 did not correctly handle optimizer's internal state. Therefore, train and trainLoop were fixed. You will find the updated notebook on Github. 
 Learn More 
 
 Improving neural networks by preventing co-adaptation of feature detectors https://arxiv.org/pdf/1207.0580.pdf 
 Dropout: A Simple Way to Prevent Neural Networks from Overfitting https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf 
 Tutorial: Dropout as Regularization and Bayesian Approximation https://xuwd11.github.io/Dropout_Tutorial_in_PyTorch/ 
 Two Simple Ways To Measure Your Model’s Uncertainty https://towardsdatascience.com/2-easy-ways-to-measure-your-image-classification-models-uncertainty-1c489fefaec8 
 Uncertainty in Deep Learning, Yarin Gal http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf 
 AlexNet example in Hasktorch https://github.com/hasktorch/hasktorch/tree/master/examples/alexNet

Day 9: Roaming The Latent Space
The Secret Space 
 
 latent - present and capable of emerging or developing but not now visible, obvious, active, or symptomatic
—Webster's Dictionary 
 
 Autoencoders
 A simple autoencoder with latent dimension L.
A simple autoencoder with latent dimension L. 
 How do we implement an autoencoder? Let us take a multilayer network from the image above: 784→400→L→400→784. Where the latent space dimension L is some smaller number such as 20 or maybe even 2. The L-sized latent vector z is often called a bottleneck. The left part from the bottleck is called an encoder qϕ and the part on the right, a decoder pθ. Where ϕ and θ are trainable parameters of encoder and decoder, respectively. 
 The encoder takes an input (like an image) and generates a compact representation, typically a vector. It is also not a mistake to call it a compressed representation. The decoder takes this compact representation and creates an output as close as possible to the original input. Hence the name, autoencoder. Of course, some information is lost due to the dimensionality reduction. Therefore, the goal of a autoencoder is to find the most relevant features to preserve as much information about the input object as possible. 
 data AE = AE
 { -- Encoder parameters
 l1 :: Linear,
 l2 :: Linear,
 -- Decoder parameters
 l3 :: Linear,
 l4 :: Linear
 }
 deriving (Generic, Show, Parameterized)
 
 Then, the whole autoencoder network is 
 ae :: AE -> Tensor -> (Tensor, Tensor)
ae AE {..} =
 linear l1
 -- Encoder
 ~> relu
 ~> linear l2
 ~> relu
 -- Decoder
 ~> linear l3
 ~> relu
 ~> linear l4
 ~> sigmoid
 
 We can also specify the exact dimensions 
 aeConfig =
 AESpec
 (LinearSpec 784 400)
 (LinearSpec 400 latent_size)
 (LinearSpec latent_size 400)
 (LinearSpec 400 784)
 
 Of course, a smaller L (latent_size), the more information is lost. Therefore, depending on the application we may want to change this value. As a rule of thumb, L contains the smallest number of neurons to enforce the compression. In this article we set L=2 so that we can simply reveal our latent space in two dimensions. 
 Variational autoencoder 
 The principal difference of variational autoencoders (VAE) from normal autoencoders is in the bottleneck. Instead of a compressed input, it estimates a distribution. In practice, VAE estimates the mean μ and the standard deviation σ -- normal distribution parameters. By sampling from that distribution, a new unseen before object can be generated. Like a new font or a new piece of cloth. Or a face. Or a melody. Isn't that nice?
 
 Variational autoencoder. Arrows signify fully-connected layers and vertical bars are data vectors. 
 data VAESpec = VAESpec
 {
 -- Encoder trainable parameters (phi)
 fc1 :: LinearSpec,
 fcMu :: LinearSpec,
 fcSigma :: LinearSpec,

 -- Decoder trainable parameters (theta)
 fc5 :: LinearSpec,
 fc6 :: LinearSpec
 }
 deriving (Show, Eq)

myConfig =
 VAESpec
 (LinearSpec 784 400)
 (LinearSpec 400 latent_size)
 (LinearSpec 400 latent_size)
 (LinearSpec latent_size 400)
 (LinearSpec 400 784)

data VAE = VAE
 { l1 :: Linear,
 lMu :: Linear,
 lSigma :: Linear,
 l4 :: Linear,
 l5 :: Linear
 }
 deriving (Generic, Show, Parameterized)
 
 It can be useful to have separate encode qϕ(z|x) and decode pθ(x|z) functions. 
 encode :: VAE -> Tensor -> (Tensor, Tensor)
encode VAE {..} x0 =
 let enc_ =
 linear l1
 ~> relu

 x1 = enc_ x0
 mu = linear lMu x1
 logSigma = linear lSigma x1
 in (mu, logSigma)

decode :: VAE -> Tensor -> Tensor
decode VAE {..} =
 linear l4
 ~> relu
 ~> linear l5
 ~> sigmoid
 
 Then, the complete variational autoencoder will be 
 vaeForward :: VAE -> Bool -> Tensor -> IO (Tensor, Tensor, Tensor)
vaeForward net@(VAE {..}) _ x0 = do
 let (mu, logSigma) = encode net x0
 sigma = exp (0.5 * logSigma)

 eps <- toLocalModel' <$> randnLikeIO sigma

 let z = (eps `mul` sigma) `add` mu
 reconstruction = decode net z

 return (reconstruction, mu, logSigma)
 
 Pay a special attention to the reparametrization:
 
Where z is our latent vector, noise ε is sampled from the normal distribution (randnLikeIO1), and ⊙ is elementwise product. Thanks to this trick, we can backpropagate through a stochastic layer. See this excellent post for more details. There are two differences between variational and ordinary autoencoder training: (1) the reparametrization and (2) the loss function, which we cover below. 
 VAE Loss Function 
 The loss function consists of two parts:
 
L(x,^x) is the reconstruction loss. It decreases when the decoded output ^x is closer to the original input x. This is basically the loss of an ordinary autoencoder. For instance, it can be binary cross-entropy loss or L2 loss. 
 And the second term DKL(⋅) is the Kullback-Leibner divergence. It tells how much the distribution pθ(z) is different from the distribution qϕ(z). Or even more informative is to say that KL divergence tells how much information is lost if the distribution pθ is used to represnt qϕ. From the original paper,
 
Therefore, the complete VAE loss is 
 vaeLoss :: Float -> Tensor -> Tensor -> Tensor -> Tensor -> Tensor
vaeLoss beta recon_x x mu logSigma = reconLoss + asTensor beta * kld
 where
 reconLoss = bceLoss recon_x x
 kld = -0.5 * sumAll (1 + logSigma - pow (2 :: Int) mu - exp logSigma)
 
 We also include the β≥0 term. When β=0 the networks is trained as an ordinary autoencoder. When β=1, we have a classical VAE. And when β>1, we force latent vector representations disentanglement. As we can see from the image below, in case of disentanglement, there are separate latent variables that encode position, rotation, and scale. Whereas entangled variables tend to encode all object properties at the same time. I recommend the excellent article by Higgins et al., which is featuring some insights from neuroscience. 
 Disentangled vs entangled latent representations.
Disentangled vs entangled latent representations. Image source: Higgins et al. 2016. 
 The binary cross-entropy loss is defined as 
 bceLoss target x =
 -sumAll (x * Torch.log(1e-10 + target) + (1 - x) * Torch.log(1e-10 + 1 - target))
We add a small term 10−10 to avoid numerical errors due to log(0).

# Visualizing the Latent Space
Here is how our latent space looks for different values of β.

[![](https://notepad.gasick.ru/uploads/images/gallery/2026-03/scaled-1680-/image-1772481462502.png)](https://notepad.gasick.ru/uploads/images/gallery/2026-03/image-1772481462502.png)
[![](https://notepad.gasick.ru/uploads/images/gallery/2026-03/scaled-1680-/image-1772481486907.png)](https://notepad.gasick.ru/uploads/images/gallery/2026-03/image-1772481486907.png)
[![](https://notepad.gasick.ru/uploads/images/gallery/2026-03/scaled-1680-/image-1772481494113.png)](https://notepad.gasick.ru/uploads/images/gallery/2026-03/image-1772481494113.png)
Test data distributions for different β values.

And here is how we compute that
```haskell
main = do
 (trainData, testData) <- initMnist "data"
 net0 <- toLocalModel' <$> sample myConfig

 beta_: _ <- getArgs
 putStrLn $ "beta = " ++ beta_

 let beta = read beta_
 trainMnistStream = V.MNIST { batchSize = 128, mnistData = trainData }
 testMnistStream = V.MNIST { batchSize = 128, mnistData = testData }
 epochs = 20
 cpt = printf "VAE-Aug2022-beta_%s.ht" beta_
 logname = printf "beta_%s.log" beta_

 putStrLn "Starting training..."
 net' <- time $ train beta trainMnistStream epochs net0
 putStrLn "Done"

 -- Saving the trained model
 save' net' cpt

 -- Restoring the model
 net <- load' cpt

 -- Test data distribution in the latent space
 putStrLn $ "Saving test dataset distribution to " ++ logname
 _ <- testLatentSpace logname testMnistStream net
 
 Where 
 testLatentSpace :: FilePath -> V.MNIST IO -> VAE -> IO ()
testLatentSpace fn testStream net = do
 runContT (streamFromMap (datasetOpts 2) testStream) $ recordPoints fn net. fst

recordPoints :: FilePath -> VAE -> ListT IO (Tensor, Tensor) -> IO ()
recordPoints logname net = P.foldM step begin done. enumerateData
 where
 step :: () -> ((Tensor, Tensor), Int) -> IO ()
 step () args = do
 let ((input, labels), i) = toLocalModel' args
 (encMu, _) = encode net input
 batchSize = head $ shape encMu

 let s = toStr $ Torch.cat (Dim 1) [reshape [-1, 1] labels, encMu]
 appendFile logname s

 return ()

 done () = pure ()
 begin = pure ()

toStr :: Tensor -> String
toStr dec =
 let a = asValue dec :: [[Float]]
 b = map (unwords. map show) a
 in unlines b
 
 Now, let's take a walk around our latent space to get an idea how it looks like inside. 
 main = do
 net <- load' "VAE-Aug2022-beta_1.ht"

 let xs = [-3,-2.7..3::Float]

 -- 2D latent space as a Cartesian product
 zs = [ [x,y] | x<-xs, y<-xs ]

 decoded = Torch.cat (Dim 0) $
 map (decode net. toLocalModel'. asTensor. (:[])) zs

 writeFile "latent_space_2D.txt" (toStr decoded)
 
 Some examples from 2D latent space.
Some examples from 2D latent space. 
 Pretty neat! We see gradual transitions between different digits. Note that these digits are actually generated by VAE. 
 ConvNet VAE 
 While we have built a simple variational autoencoder based on MLP (multilayer perceptron), nothing prevents us from using other architectures. In fact, let's build a convolutional VAE! 
 data VAESpec = VAESpec
 {
 -- Encoder trainable parameters
 conv1 :: Conv2dSpec,
 conv2 :: Conv2dSpec,
 conv3 :: Conv2dSpec,
 fcMu :: LinearSpec,
 fcSigma :: LinearSpec,

 -- Decoder trainable parameters
 fc :: LinearSpec,
 deconv1 :: ConvTranspose2dSpec,
 deconv2 :: ConvTranspose2dSpec,
 deconv3 :: ConvTranspose2dSpec
 }
 deriving (Show, Eq)

myConfig =
 VAESpec
 (Conv2dSpec 1 32 4 4) -- 1 -> 32 channels; 4 x 4 kernel
 (Conv2dSpec 32 64 4 4) -- 32 -> 64 channels; 4 x 4 kernel
 (Conv2dSpec 64 128 3 3) -- 64 -> 128 channels; 3 x 3 kernel
 (LinearSpec (2 * 2 * 128) latent_size)
 (LinearSpec (2 * 2 * 128) latent_size)
 (LinearSpec latent_size 1024)
 (ConvTranspose2dSpec 1024 256 4 4)
 (ConvTranspose2dSpec 256 128 6 6)
 (ConvTranspose2dSpec 128 1 6 6)

data VAE = VAE
 { c1 :: Conv2d,
 c2 :: Conv2d,
 c3 :: Conv2d,
 lMu :: Linear,
 lSigma :: Linear,
 l :: Linear,
 t1 :: ConvTranspose2d,
 t2 :: ConvTranspose2d,
 t3 :: ConvTranspose2d
 }
 deriving (Generic, Show, Parameterized)

instance Randomizable VAESpec VAE where
 sample VAESpec {..} =
 VAE
 <$> sample conv1
 <*> sample conv2
 <*> sample conv3
 <*> sample fcMu
 <*> sample fcSigma
 <*> sample fc
 <*> sample deconv1
 <*> sample deconv2
 <*> sample deconv3

encode :: VAE -> Tensor -> (Tensor, Tensor)
encode VAE {..} x0 =
 let enc_ =
 -- Reshape vectors [batch_size x 784]
 -- into grayscale images of [batch_size x 1 x 28 x 28]
 reshape [-1, 1, 28, 28]
 -- Stride 2, padding 0
 ~> conv2dForward c1 (2, 2) (0, 0)
 ~> relu
 ~> conv2dForward c2 (2, 2) (0, 0)
 ~> relu
 ~> conv2dForward c3 (2, 2) (0, 0)
 ~> relu
 ~> flatten (Dim 1) (Dim (-1))

 x1 = enc_ x0
 mu = linear lMu x1
 logSigma = linear lSigma x1
 in (mu, logSigma)

decode :: VAE -> Tensor -> Tensor
decode VAE {..} =
 linear l
 ~> relu
 ~> reshape [-1, 1024, 1, 1]
 -- Stride 2, padding 0
 ~> convTranspose2dForward t1 (2, 2) (0, 0)
 ~> relu
 ~> convTranspose2dForward t2 (2, 2) (0, 0)
 ~> relu
 ~> convTranspose2dForward t3 (2, 2) (0, 0)
 ~> sigmoid
 ~> reshape [-1, 784] -- Reshape back
 
 And that is all we need. 
 Here is the latent space for β=1 (normal VAE) using our CNN architecture: 
 Disentanglement 
 To better illustrate how parameter β encourages disentanglement between latent representations, let us first increase the latent dimension to L=10. For β=1 and β=4 we perform scan along each individual z coordinate. 
 
 
 Indeed, the latent space under β=4 looks more disentangled compared to β=1. We can see e.g. that z1 is the parameter that defines how "light" or how "bold" is the digit, whereas z2 controls how wide is the digit. Whereas such individual components for β=1 are hard to identify. For instance when β=1, z4 controls not only how "bold" is the digit, but also its shape. 
 For more details, see this notebook. 
 Find the complete project and associated data on Github. For suggestions about the content feel free to open a new issue. 
 Summary 
 Variational autoencoder is a great tool in modern deep learning. Manipulating the latent space allows us not only to "interpolate" between different images or other objects, but also to perform inpainting (adding details to incomplete images) or even zero shot learning. The last one is crucial for the so-called artificial general intelligence. 
 The loss function is important for VAE training. If your VAE does not work as expected, the odds are that the loss function is not implemented correctly. Also check if the random noise
ε
is drawn from the normal distribution. 
 Citation 
 @article{penkovsky2022VAE,
 title = "Roaming The Latent Space",
 author = "Penkovsky, Bogdan",
 journal = "penkovsky.com",
 year = "2022",
 month = "August",
 url = "https://penkovsky.com/neural-networks/day9/"
}
 
 Learn More 
 Auto-Encoding Variational Bayes https://arxiv.org/abs/1312.6114
A Beginner's Guide to Variational Methods: Mean-Field Approximation https://blog.evjang.com/2016/08/variational-bayes.html
Early Visual Concept Learning with Unsupervised Deep Learning https://arxiv.org/abs/1606.05579
Pytorch VAE Implementation https://github.com/pytorch/examples/blob/main/vae/main.py
Interpolating Music https://magenta.tensorflow.org/music-vae
β-VAE: Learning Basic Visual Concepts With A Constrained Variational Framework https://openreview.net/pdf?id=Sy2fzU9gl
The Reparameterization Trick https://gregorygundersen.com/blog/2018/04/29/reparameterization/
KL is All You Need https://blog.alexalemi.com/kl-is-all-you-need.html
Generating Diverse High-Fidelity Images with VQ-VAE-2 https://arxiv.org/abs/1906.00446
Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE https://openaccess.thecvf.com/content/CVPR2021/papers/Peng_Generating_Diverse_Structure_for_Image_Inpainting_With_Hierarchical_VQ-VAE_CVPR_2021_paper.pdf
Variational Image Inpainting https://vincentcartillier.github.io/papers/variational-image-inpainting.pdf 
 
 A Technical Sidenote 
 Compared to the previous day, the trainLoop is slightly modified: First, we rescale the images between 0 and 1. Second, we include our new loss function in the step function. 
 step :: Optimizer o => (VAE, o) -> ((Tensor, Tensor), Int) -> IO (VAE, o)
 step (model, opt) args = do
 let ((x, _), iter) = toLocalModel' args
 -- Rescale pixel values [0, 255] -> [0, 1.0].
 -- This is important as the sigmoid activation in decoder can
 -- reach values only between 0 and 1.
 x' = x / 255.0
 (recon_x, mu, logSigma) <- vaeForward model False x'
 let loss = vaeLoss beta recon_x x' mu logSigma
 -- Print loss every 100 batches
 when (iter `mod` 100 == 0) $ do
 putStrLn
 $ printf "Batch: %d | Loss: %.2f" iter (asValue loss :: Float)
 runStep model opt loss lr
 
 The train function now uses Adam optimizer from Torch.Optim.CppOptim, which tends to be faster compared to mkAdam we used previously. This is not very different from mkAdam-based training, except that the learning rate is specified as Cpp.adamLr parameter and not as a trainLoop parameter (ignored when passed to runStep). 
 train :: Float -> V.MNIST IO -> Int -> VAE -> IO VAE
train beta trainMnist epochs net0 = do
 optimizer <- Cpp.initOptimizer adamOpt net0

 (net', _) <- foldLoop (net0, optimizer) epochs $ \(net', optState) _ ->
 runContT (streamFromMap dsetOpt trainMnist)
 $ trainLoop beta (net', optState) 0.0 . fst

 return net'
 where
 dsetOpt = datasetOpts workers
 workers = 2
 -- Adam optimizer parameters
 adamOpt =
 def
 { Cpp.adamLr = learningRate,
 Cpp.adamBetas = (0.9, 0.999),
 Cpp.adamEps = 1e-8,
 Cpp.adamWeightDecay = 0,
 Cpp.adamAmsgrad = False
 } ::
 Cpp.AdamOptions
 
 learningRate :: Double
learningRate = 1e-3
 
 I used a compiled version instead of a notebook since the network training worked much faster (the bottleneck was in training data mini-batches loading). Also I have trained networks with Torch.Optim.CppOptim. It is slightly faster compared to mkAdam. 
 I was also wondering why I get large values out of the encoder. It turns out that this is because relu function is unbounded. You may want to replace relu with Torch.tanh and visualize the latent space again.

Day 10: Beyond Supervised Learning
Ever wondered how machines defeated the best human Go player Lee Sedol in 2016? A historical moment for the game that was previously considered to be very tough. What is reinforcement learning that generates so much buzz recently? Superhuman performance playing arcade Atari games, real-time Tokamak plasma control, Google data center cooling, and autonomous chemical synthesis are all the recent achievements behind the approach. Let's dive to learn what empowers deep reinforcement learning. 
 Introduction 
 Reinforcement learning (RL) is a machine learning technique that is all about repetitive decision making. Well, just like when you go camping. During your first trip you realize that you need more food and you lack some tools. Next time, you take some more cereals and dried fruit, a better knife, and a pot. And after yet several more attempts you invest in better shoes or a backpack and learn how to arrange your stuff in a neat way. The journeys become longer and more enjoyable. As you can see, you iterate. You learn from experience. And you get rewarded. 
 No matter how much you can relate yourself to the experiences above, they contain everything that reinforcement learning has. An agent is learning from experience (or from mistakes if you want) by interacting with its environment. The agent receives some sort of a reward signal in response to its actions. And that is basically the idea. An idea from psychology? Perhaps. 
 In a supervised learning setting1 you would be instructed what equipment to take with you. And you would need to act exactly as instructed. However, the real life is not like that. You might have watched a camping channel on YouTube and still struggle to start a fire. Trying different kinds of approaches yourself, however, improves your survival skills over time. 
 
 The difference between supervised and reinforcement learning
The biggest difference between supervised learning and reinforcement learning is that in reinforcement learning the agent creates its own dataset by interacting with the environment.
You may also notice that reinforcement learning is actually more general than supervised learning. Indeed, in supervised learning we usually talk about a single iteration and apply some sort of ground truth or target signal. However, let's take a look at the following example. Imagine, you are a director of AI at Tesla (those hipsterish cars, you know) and your team trains a self-driving car based on videos that were previously recorded. Sounds like a supervised problem with images and labels, right? The car performs well during the same day, but tomorrow it rains and the car refuses to drive well. What do you do? You say, perhaps the distribution of images has changed. Let's collect some more for a rainy weather. And the model training is repeated. The third day is very foggy and your team has to collect the data again. And train the model again. So what happens? Exactly! Iteration. You have an implicit reinforcement learning loop where it is you who acts as an agent trying to outsmart the weather conditions2. 
 
 More often than not present decisions will influence the situation, and thus they will influence future decisions too. This is perhaps the biggest difference from supervised learning. Supervised learning conveniently omits to confess that its main goal is to help making decisions in real world. The most boring vision task you have encountered, classifying MNIST images, at some point was actually useful for post offices to automate mail delivery. Similarly, predicting time series might be helpful getting an idea about climate change or predicting economic downturns. Whatever supervised learning benchmark you take, in its origin there is some sort of implication in the real world. 
 Unlike supervised learning, reinforcement learning is all about decisions. It explicitly states the goal of achieving the best overall consequences -- by maximizing total reward. For instance, to arrive from point A to point B you may take a car. Depending on which route you will take and how frequently you will rest, this will affect your trip duration and how tired you will arrive. Those circumstances may further affect your plans. Decisions almost never occur in isolation. They will inevitably lead to new decisions. That is why real-life challenges are often accompanied by reinforcement learning in some form. 
 A quick note. People often talk about deep reinforcement learning. This is to highlight the presence of neural networks. The fact does not change the essence of the method. These days almost everyone is doing deep reinforcement learning anyway, so I prefer to omit the term. Just wanted to make sure we stay on the same page. Now, we will focus our attention on the nitty-gritty details of reinforcement learning. 
 Reinforcement Learning Concepts 
 By interacting with environment, an RL agent actually creates its own unique dataset, whereas in supervised learning the dataset is predefined. As the agent explores its environment and the dataset is being created we apply backprop to teach our agent similarly to supervised learning. It turns out there is a large variety of ways how to do that. The concepts below will give you a taste about the key ideas from reinforcement learning. 
 Policies, States, Observations 
 A policy (denoted by π) is simply a strategy that an agent uses when responding to changes in the environment. The following notation reads "a policy parametrized by parameters ϕ to take an action a given a state s": 
 
 Typically ϕ signifies weights in a neural network (the deep reinforcement learning story). In the notation above, by s people usually mean "state". And sometimes they actually mean "observations" o. The difference between the two is that observations are a projection of the complete state describing the environment. For example, when learning robotic manipulations from camera pixels, observations of a robotic hand are only a representation of the actual environment state from a certain angle. Sometimes people talk about "completely observable" environments, then s=o. The distinction between s and o does not affect the RL algorithms we are learning about today. 
 On-policy vs Off-policy 
 Reinforcement learning algorithms can be split between on-policy and off-policy. The first ones use trajectories (experiences)3 generated by the policy itself, while the second ones use trajectories from some other policies. For instance, algorithms that use a replay buffer (such as deep Q networks playing Atari games) to store transitions are off-policy algorithms. Typically, off-policy algorithms have better sample efficiency compared to on-policy ones, which makes them attractive when it is costly to generate new experiences. For instance, when training a physical robot. On the other hand, on-policy algorithms are often used when acquiring new experiences are cheap. For instance, in simulation. 
 Offline Reinforcement Learning 
 Offline RL is a family of algorithms where an RL agent observes past experiences and tries to figure out a better policy without interaction with an environment. Of course this is a very challenging task. Nevertheless, it has many real-world applications where it could be dangerous or illegal to apply reinforcement learning experiments. For instance, finding the best treatment in the medical setting. A doctor would not experiment on a patient using different drugs because this can lead to health deterioration. Instead, they may use reinforcement learning techniques applied to previous records of other patients to figure out the best prescription (offline setting). 
 Discrete vs Continuous Actions 
 Playing an arcade game using a manipulator with four buttons implies a discrete action setting (four possible actions). However, in reality there are settings with infinite or very large number of actions. For instance, robot joints positions. Certain algorithms can be applicable only to one action type (DQN can be applied only for discrete and DDPG only for continuous actions), whereas certain others can be used for both (e.g. A2C, PPO). 
 Deterministic vs Stochastic Policies 
 Certain policies inherently produce deterministic actions (DQN), while others sample from a learned distribution producing stochastic actions (PPO). Of course, it is possible to make a deterministic algorithm to produce stochastic actions (epsilon-greedy exploration) and vice versa. Why use a stochastic policy? Actually, under stochastic environments optimal policies are often stochastic. The simplest example is Rock-Paper-Scissors game. 
 Model-free vs Model-based 
 Model-based approaches assume a certain environment model. The advantage of this approach is better sample efficiency. On the other hand, model-free approaches may better generalize to unknown environments. In some cases, environment's model is learned. 
 Sample Efficiency 
 By sample efficiency we mean using less examples for training. 
 Reinforcement Learning Hints 
 The goal of reinforcement learning is typically to solve a real-life challenge. For that purpose you will need to define some interaction protocol between your agent and its environment and also a reward function. The reward function provides a score how well your agent performs. Typically, it is beneficial having a "dense" reward so that the agent gets some information at every time step. This is not always possible and there are methods to circumvent this issue. One way to look at reward shaping is that RL is your "compiler" and the reward is your understanding of the problem you are trying to solve. 
 To gain some expertise, you may want to start with existing environments4 and RL algorithms. If your goal is to train a physically embodied robot, you may want to create your own environment to try out some ideas in simulation. This will save you some time since running a simulator is typically faster than real-time robot training. 
 The more accurate is the simulator, the better the agent is likely to perform in real-world. For instance, in Sim-to-Real: Learning Agile Locomotion For Quadruped Robots the authors have obtained better results by creating accurate actuator models and performing latency modeling. 
 However, there is always a difference between running an agent on any simulator and reality. The problem known as reality gap. There are different techniques how to reduce this difference. One of them is domain randomization. The idea is to train the agent using environments with randomized parameters. In active domain randomization, the active learning approach is combined: agents are trained more often on environments where they performed poorly. 
 Another technique domain adaptation suggests making real data look like those in simulation or vice versa. You may also want to perform parameters fine-tuning on your real robot. 
 Note also that if you are training an agent in a simulated environment, the agent may "overfit" to the environment. That is, it may exhibit unrealistic (and often undesired) behaviors that still lead to high reward scores. The agent may try to exploit environment glitches if there are any. 
 Having those practical strategies in mind, we are going to the meat of RL. Below we are discussing an algorithm which constitutes the basis to many modern RL strategies such as TRPO and PPO. 
 REINFORCE 
 REINFORCE, also called Monte Carlo Policy Gradient, is the simplest from the policy gradient algorithms family. The advantage of policy gradients is that they can learn stochastic policies. This also addresses the exploration-exploitation dilemma. 
 REINFORCE algorithm: 
 Initialize policy parameters
ϕ
.
Generate trajectories
s
0
,
a
0
,
r
1
,
.
.
.
,
s
T
−
1
,
a
T
−
1
,
r
T
by interacting with environment.
For every step
t 
 0
,
1
,
.
.
.
,
T
−
1
, estimate returns
R
t 
 ∑
T
k 
 t
+
1
γ
k
−
t
−
1
r
k
.
Optimize agent parameters
ϕ
←
ϕ
+
α
R
t
∇
log
π
ϕ
(
a
t
|
s
t
)
.
Repeat steps 2-5 until convergence.
Note that when using universal function approximators (that is neural networks), convergence is not guaranteed. However, in practice you still have a good chance to train your agent. 
 numEpisodes = 400 
 main :: IO ()
main = do
putStrLn $ "Num episodes " ++ show numEpisodes 
 -- Seed Torch for reproducibility.
-- Feel free to remove.
manual_seed_L 10
let seed = 42 -- Environment seed 
 -- Step 1: Initialize policy parameters
agent <- mkAgent obsDim actionDim
-- We also initialize the optimizer
let trainer = mkTrainer agent 
 -- Repeat steps 2-4 for the number of episodes (trajectories)
(agent', trainer') <- foldM (\at i -> (reinforce conf) seed at i) (agent, trainer) [1..numEpisodes] 
 return () 
 reinforce cfg@Config {..} seed (agent, trainer) i = do
-- Step 2: Trajectories generation (rollout)
let s0 = Env.reset (seed + i)
(_, _, !rs, !logprobs_t) <- rollout maxSteps agent s0
let logprobs' = cat (Dim 0) logprobs_t
putStrLn $ "Episode " ++ show i ++ " - Score " ++ show (sum rs) 
 -- Step 3: Estimating returns
let returns_t = asTensor $ returns γ rs
returnsNorm = (returns_t - mean returns_t) / (std returns_t + 1e-8) 
 -- Step4: Optimize
optimize cfg logprobs' returnsNorm (agent, trainer)
Trajectory Generation
The agent generates a trajectory by interacting with the environment. We call this a rollout. By observing the function signature below, we can tell that the function consumes an integer number, an agent, and environment state. This integer is simply the maximal number of steps per episode. As a result, the function will provide the trajectory: observations, actions, and rewards. In addition, it also provides log probabilities, which we will use to compute our objective (or loss) before we can optimize parameters in Step 4. 
 rollout :: Int -> Agent -> Env.State
-> IO ([Observation], [[Int]], [Reward], [Tensor])
-- ^ Observations, actions, rewards, log probabilities
Here is how we implement it: we benefit from the excellent unfoldM function that has type Monad m => (s -> m (Maybe (a, s))) -> s -> m [a]. That is we iterate as long as the function in the first argument (s -> m (Maybe (a, s))) provides a Just value (and stop when Nothing). 
 Haskell libraries provide lots of "pieces of code" or "programming templates": functions like map, foldl, scanr, unfoldr, zip, zipWith, etc. All those are compact versions of loops! Some of those concepts gradually diffuse into imperative languages (such as C++ and Python). 
 rollout _maxSteps agent s0 = L.unzip4 <$> unfoldM f (_maxSteps, agent, s0)
where
-- Reached max number of steps. Nothing = stop.
f (0, _, _) = pure Nothing 
 f (_maxSteps, _agent@Agent{..}, _s) = do
 if Env.isDone _s
 -- The environment is done: stop.
 then do
 pure Nothing
 else do
 let ob = Env.observations _s
 (ac@(Action ac_), logprob) <- getAction ϕ ob
 let (r, s') = Env.step ac _s

 pure $ Just ((ob, ac_, r, logprob), (_maxSteps - 1, _agent, s'))
 
 The heart of iteration is in the end: First, observe the environment and sample actions. 
 let ob = Env.observations s
(ac@(Action ac ), logprob) <- getAction ϕ ob
Then, simulate the environment step: 
 let (r, s') = Env.step ac _s
And finally, prepare the next iteration step by reducing the maximal number of steps and passing the agent and the new environment state. 
 pure $ Just ((ob, ac_, r, logprob), (_maxSteps - 1, _agent, s'))
Here is how we get an action and a log probability. 
 getAction ϕ obs = do
let obs_ = unsqueeze (Dim 0) $ asTensor obs
(ac, logprob) <- evaluate ϕ obs_ Nothing 
 -- Return a single discrete action
return (Action [asValue ac], logprob)
Evaluate the policy
π
ϕ
: If no action provided, sample a new action from the learned distribution. Also get log probabilities for the action. 
 evaluate ϕ obs a = do
let probs = policy ϕ obs
dist = Categorical.fromProbs probs
action <- _getAct a dist
let logProb = D.logProb dist action
pure (action, logProb)
where
-- Sample from the categorical distribution:
-- get a tensor of integer values (one sample per observation).
_getAct Nothing dist = D.sample dist [head $ shape obs]
_getAct (Just a') _ = pure a'
Estimating Returns
After a rollout we (retrospectively) compute how good were our decisions during the whole trajectory. The trick is that we start from the last step. That is, our return
R
at time step
t
is our current reward plus a discounted future return. 
 R
t 
 r
t
+
γ
R
t
+
1
. 
 This expands into 
 R
t 
 r
t
+
γ
(
r
t
+
1
+
γ
(
r
t
+
2
+
γ
(
.
.
.
)
)
)
. 
 where
r
t
is the reward at time
t
. The meaning of this expression is the essence of policy gradient: We evaluate how good were our past decisions with respect to the future outcomes. 
 An intriguing way to look at discount factor
γ
is by using the concept of terminal state (to put simply, death). At each future step, the agent can get a reward with probability
γ
and can die with probability
1
−
γ
. Therefore, discounting the reward is akin to incorporating the "fear of death" into RL. In his lectures, S. Levine gives an example of receiving 1000 dollars today or in million years. In the latter case, the reward is hugely discounted as it is unlikely that we will be able to receive it. 
 We compute the returns as a function of rewards rs. We also use next terminal states indicators and future value estimators in special cases. 
 returns γ rs = f rs
where
f (r:[]) = [r] 
 f (r:xs) =
 let y = f xs
 -- Discounting a future return
 in r + γ * (head y) : y
 
 In policy gradient algorithms we evaluate how good are our past decisions with respect to the future outcomes. 
 Note that we compute the return recursively as in equation above: reward plus a discounted future return. Since Haskell is a lazy programming language, it is not critical that this value does not exist yet. Essentially, we promise to compute the list of future values y = f xs. Finally, we prepend current return to the list of future returns r + γ * (head y) : y. Fascinating! 
 f (r:xs) =
let y = f xs
in r + γ * (head y) : y
Optimizing parameters
Computing the loss.
Running a gradient step.
optimize :: Config
-> Tensor -> Tensor
-> (Agent, Trainer)
-> IO (Agent, Trainer)
optimize Config {..} logprobs_t returns_t (Agent {..}, Trainer {..}) = do
let loss = Torch.sumAll $ -logprobs_t * returns_t
(ϕ', opt') <- runStep ϕ opt loss (asTensor lr)
pure (Agent ϕ', Trainer opt')
Final Bits
There are still a few other things to complete the project. Let's define our policy network type
Φ
. Here we have three fully-connected layers. In other words, two hidden layers. 
 data Phi = Phi
{ pl1 :: Linear
, pl2 :: Linear
, pl3 :: Linear
}
deriving (Generic, Show, Parameterized)
Now we can implement the forward pass in a Policy Network: 
 policy :: Phi -> Tensor -> Tensor
policy Phi {..} state =
let x = ( linear pl1 ~> tanh
~> linear pl2 ~> tanh
~> linear pl3 ~> softmax (Dim 1)) state
in x
An agent is simply a policy network in Reinforce. We will update this data type to also accommodate the critic network in improved policy gradient later on: 
 data Agent = Agent
{ ϕ :: Phi
}
deriving (Generic, Show)
REINFORCE Trainer type is a single optimizer. 
 data Trainer = Trainer
{ opt :: Adam
}
deriving Generic
A new, untrained agent with random weights 
 mkAgent :: Int -> Int -> IO Agent
mkAgent obsDim actDim = do
let hiddenDim = 16
ϕ <- samplePhi obsDim actDim hiddenDim
pure $ Agent ϕ
Parameters
ϕ
∈
Φ
initialization 
 samplePhi :: Int -> Int -> Int -> IO Phi
samplePhi obsDim actDim hiddenDim =
Phi <$> sample (LinearSpec obsDim hiddenDim)
< > sample (LinearSpec hiddenDim hiddenDim)
< > sample (LinearSpec hiddenDim actDim)
Initializing the trainer 
 mkTrainer :: Agent -> Trainer
mkTrainer Agent {..} =
let par = flattenParameters ϕ
opt = mkAdam 0 0.9 0.999 par
in Trainer opt
Now, let's train the agent to solve the classic CartPole environment! See the complete REINFORCE project on Github. 
 Proximal Policy Optimization
REINFORCE algorithm is nice because it is simple. On the other hand, in practice it has a problem: finding the best meta-parameters, such as learning rate. Choose a value to low and training needs more samples, too high and it does not converge. To alleviate this training instability multiple variations of this algorithm have been proposed. One popular variation is called Proximal Policy Optimization (PPO). Below we upgrade REINFORCE to PPO. 
 Actor-Critic Style
In REINFORCE we only had a policy network
π
ϕ
. A more advanced version would be not only to use a policy network (so-called actor), but also a value network (critic). This value network would estimate how good is the state we are currently in. Naturally, the output of the critic network is a scalar with this estimated state value. 
 -- Value (Critic) Network type: Three fully-connected layers
data Theta = Theta
{ l1 :: Linear
, l2 :: Linear
, l3 :: Linear
}
deriving (Generic, Show, Parameterized) 
 -- | Forward pass in a Critic Network
critic :: Theta -> Tensor -> Tensor
critic Theta {..} state =
let net = linear l1 ~> tanh
~> linear l2 ~> tanh
~> linear l3
in net state
Here is how we will sample initial network weights: 
 sampleTheta :: Int -> Int -> IO Theta
sampleTheta obsDim hiddenDim =
Theta <$> sample (LinearSpec obsDim hiddenDim)
< > sample (LinearSpec hiddenDim hiddenDim)
< > sample (LinearSpec hiddenDim 1)
To make our life a bit simpler, we can also use the following wrapper when dealing with raw observations. 
 -- | Get value: Convenience wrapper around critic function.
value :: Theta -> [Float] -> Float
value θ ob =
let ob_ = unsqueeze (Dim 0) $ asTensor ob
in asValue $ critic θ ob_
The state value given by the critic network will be used for advantage estimation, instead of simply calculating discounted returns. 
 Advantage Estimation
As you remember, in policy gradients we optimize neural network parameters using
A
⋅
∇
log
π
ϕ
(
a
t
|
s
t
)
. In REINFORCE, this term
A 
 R
t
is discounted returns. This totally makes sense. However, this is not the only option5. 
 Let me introduce advantage
A 
 r
−
v
where
r
is reward and
v
is value, i.e. how good is our action. The advantage tells us how current action is better than an average action. Therefore, by optimizing our policy with respect to advantage, we would improve our actions compared to average. This would help to reduce variance of policy gradient. 
 In PPO, we typically use a fancy way to estimate the advantage called generalized advantage estimator (GAE). 
 advantages γ γ' rs dones vs = f $ L.zip4 rs (L.drop 1 dones) vs (L.drop 1 vs)
-- vs are current values (denoted by v) and
-- (L.drop 1 vs) are future values (denoted by v')
where
-- Not necessary to reverse the list if using lazy evaluation
f :: [(Float, Bool, Float, Float)] -> [Float]
-- End of list to be reached: same as terminal (auxiliary value)
f ((r, _, v, _):[]) = [r - v] 
 -- Next state terminal
f ((r, True, v, _):xs) = (r - v) : f xs

-- Next state non-terminal
f ((r, False, v, v'):xs) =
 let a = f xs
 delta = r + γ * v' - v
 in delta + γ * γ' * (head a) : a
 
 PPO Specifics
In REINFORCE, the policy gradient loss function was simply 
 L
(
ϕ
) 
 −
∑
t 
 log
π
ϕ
(
a
t
|
s
t
)
⋅
R
t
. 
 Note the negation sign in front: without it, the loss (to be minimized) becomes an objective (to be maximized) - earning highest returns6: 
 L
(
ϕ
) 
 ∑
t 
 log
π
ϕ
(
a
t
|
s
t
)
⋅
R
t
. 
 Now, what distinguishes PPO, it its objective. It consists of three components: a clipped policy gradient objective (actor network), value loss (critic network), and entropy bonus. Since the value function is a loss term, it has a minus sign: 
 L
P
P
O
t 
 L
C
L
I
P
t
−
c
1
L
V
F
+
c
2
S
, 
 where
c
1 
 0.5
and
c
2 
 0.01
are constants. If we are going to minimize loss instead, 
 L
P
P
O
t 
 −
L
C
L
I
P
t
+
c
1
L
V
F
−
c
2
S 
 =
L
P
G
t
+
c
1
L
V
F
−
c
2
S
. 
 loss = pg + vfC mulScalar vLoss - entC mulScalar entropyLoss
Now, the most interesting part of PPO. If we denote
r
t
(
ϕ
)
the probability ratio
r
t
(
ϕ
) 
 π
ϕ
(
a
t
|
s
t
)
π
ϕ
old
(
a
t
|
s
t
)
. Then PPO is maximizing the following objective 
 L
C
L
I
P
(
ϕ
) 
 ^
E
t
(
min
(
r
t
(
ϕ
)
^
A
t
,
clip
(
r
t
(
ϕ
)
,
1
−
ϵ
,
1
+
ϵ
)
^
A
t
)
)
. 
 First, let's take a look at the clipped objective
clip
(
r
t
(
ϕ
)
,
1
−
ϵ
,
1
+
ϵ
)
^
A
t
. Here, we try to restrict how far our policy will move during the policy gradient update. If advantage
A
is positive, the objective is clipped at value
1
+
ϵ
. If advantage
A
is negative, then the objective is clipped at
1
−
ϵ
. Finally, we take a min between a clipped and unclipped objective. Therefore, the final objective is a pessimistic bound on the unclipped objective (see Schulman et al.). 
 Now, transforming the objective into a loss: we add a minus sign and replace min with max: 
 L
P
G
(
ϕ
) 
 ∑
t 
 max
(
−
r
t
(
ϕ
)
^
A
t
,
−
clip
(
r
t
(
ϕ
)
,
1
−
ϵ
,
1
+
ϵ
)
^
A
t
)
. 
 pgLoss clipC logratio advNorm =
let ratio = exp logratio
ratio' = clamp (1 - clipC) (1 + clipC) ratio 
 pgLoss1 = -advNorm * ratio
 pgLoss2 = -advNorm * ratio'
 
 in mean $ max' pgLoss1 pgLoss2
Note that for better numerical properties we typically use logratio, instead of ratio (a log of a ratio is the difference of logs): 
 logratio = newlogprobs - logprobs_t
where newlogprobs are calculated by evaluating the new policy. 
 Next term
L
V
F
t
is often a squared-error loss
(
V
θ
(
s
t
)
−
V
targ
t
)
2
. 
 vLoss = mean (newvalues - returns)^2
However, we use clipped value loss. 
 clippedValueLoss clipC val newval ret =
let lossUnclipped = (newval - ret)^2
clipped = val + (clamp (-clipC) clipC (newval - val))
lossClipped = (clipped - ret)^2
lossMax = max' lossUnclipped lossClipped
in mean lossMax
Finally, the entropy bonus
S
is used in order to stimulate exploration, i.e. performing the same task by as many ways as possible. 
 entropyLoss = mean entropy
Here is updated evaluate function: 
 -- | Get action, logProb, and entropy tensors
evaluate
:: Phi -- ^ Policy weights
-> Tensor -- ^ Observations
-> Maybe Tensor -- ^ Action
-> IO (Tensor, Tensor, Tensor)
evaluate ϕ obs a = do
let probs = policy ϕ obs
dist = Categorical.fromProbs probs
action <- _getAct a dist
let logProb = D.logProb dist action
entropy = D.entropy dist
pure (action, logProb, entropy)
where
_getAct :: Maybe Tensor -> Categorical.Categorical -> IO Tensor
_getAct Nothing dist = D.sample dist [head $ shape obs]
_getAct (Just a') _ = pure a'
Putting it all together, we get this optimize function: 
 optimize :: Config
-> Tensor -> Tensor -> Tensor -> Tensor -> Tensor -> Tensor
-> (Agent, Trainer)
-> IO (Agent, Trainer)
optimize Config {..} obs_t acs_t val_t logprobs_t advantages_t returns_t (Agent {..}, Trainer {..}) = do
(_, newlogprobs, entropy) <- evaluate ϕ obs_t (Just acs_t) 
 let newvalues = critic θ obs_t
logratio = newlogprobs - logprobs_t
-- Normalized advantages
advNorm = (advantages_t - mean advantages_t) / (std advantages_t + 1e-8) 
 pg = pgLoss clipC logratio advNorm

 vLoss = clippedValueLoss clipC val_t newvalues returns_t

 entropyLoss = mean entropy

 loss = pg + vfC `mulScalar` vLoss - entC `mulScalar` entropyLoss
 
 ((θ', ϕ'), opt') <- runStep (θ, ϕ) opt loss (asTensor lr) 
 pure (Agent θ' ϕ', Trainer opt')
See the complete code on Github. 
 AlphaGo vs Lee Sedol
We started this article with the defeated Go champion Lee Sedol. He played against software called AlphaGo (version Lee, after him). How actually did AlphaGo win? What is self-play and how does it relate to reinforcement learning? Below we will address those questions. 
 Overview
AlphaGo generates data by playing games against an opponent (since the game of Go is a two-player game). This opponent is another machine player randomly chosen from a pool of opponents. In the end of the game one player wins (reward
r 
 
 
 
 1
) and another loses (reward
r 
 −
1
). 
 After several games have been played, the player is updated. Then, this new player is compared to the old one. If the new one wins sufficiently often, it is then accepted. Iteration by iteration, and the AlphaGo is improved by playing against itself. This is called a self-play. Below, we are going into mode details. 
 Monte Carlo Tree Search
We should introduce a new concept called Monte Carlo Tree Search. Monte Carlo, if you didn't know, is the area in the city-state of Monaco. It is also the European gambling capital. Some rich people like to waste money in the Monte Carlo casino. The author had a chance to visit it once and can confirm that is absolutely true. 
 Anyway, I digress. Statisticians simply love calling everything Monte Carlo when there are random simulations involved. For example, have you noticed that REINFORCE is also named Monte Carlo Policy Gradient? This is related to stochastic trajectories generated during rollouts. 
 Monte Carlo Tree Search. Source: Silver et al. 
Figure 1
Monte Carlo Tree Search. Source: Silver et al. 
 Monte Carlo Tree Search (MCTS) explores a tree of possible moves and is continuously refining its understanding of the game state by simulating random rollouts (also called playouts because of the game aspect). In Monte Carlo tree, each node is a board state (see Fig. 1). For each board state, we estimate the value of this state
Q
. That is how good it is or, in other words, whether we believe this board position is leading to a win or loss. The innovation behind AlphaGo was in combining MCTS with convolutional neural networks for state value estimation and for selecting the move to play. 
 In the first MCTS stage called Selection (Fig. 1a), an action is selected to maximize the value
Q
plus some bonus
u
(
P
)
that encourages exploration. This bonus is designed such that in the beginning the algorithm prefers actions with high prior probability
P
and low visit count; and eventually it prefers actions with high action value
Q
. This is achieved by weighting the bonus term by an exploration constant. 
 After a certain depth of Monte Carlo tree is reached, the second stage Evaluation (Fig. 1c) is performed: current position (leaf node in tree) is evaluated using the value network. And the actions (game moves) are selected according to the policy network
π
until the end of the game (terminal state), which leads to the outcome
±
r
. 
 Finally, the Backup is performed (Fig. 1d): the rollout statistics are updated by adding the outcome in a backward pass through the Monte Carlo tree. In the end, the value estimate
Q
(
s
,
a
)
becomes a weighted sum of the value network and just obtained rollout statistics. At the end of the search the algorithm selects an action with the maximum visit count. The authors state that this is less sensitive to outliers as compared to maximizing action value. 
 One more thing: when the number of certain node visits is frequent, the successor state is added to the tree. This is called Expansion (Fig. 1b). 
 Coming back to the neural networks you might be pleased to learn that the policy network
π
was trained using the REINFORCE algorithm you already know. Each iteration consisted of a minibatch of several games played between the current policy network
π
ϕ
and an opponent
π
ϕ
−
using parameters from the previous iteration. Every 500 iterations, parameters
ϕ
were saved to the opponent pool (so that there is always a variety of opponents to choose from). 
 The value network was trained to approximate the value function of the RL policy network
π
ϕ
. This was a regression task. The network was trained on a dataset of 30 million positions drawn from the self-play. There are many interesting technical aspects I suggest to read about in the original publication. 
 To summarize, Monte Carlo Tree Search was employed in AlphaGo because an exhaustive search is simply impossible as the game tree quickly grows too large. The idea of MCTS combined with neural networks is conceptually simple and provided sufficient computational resources, it wins. 
 Instead of Conclusion
Beating the strongest human player is an amazing feat by the DeepMind team. However, we should not forget that behind the scenes there was operating a whole data center to support AlphaGo computation. This is about six orders of magnitude more power consumption compared to the human brain (~20W)! Devising energy-efficient AI hardware is therefore our next milestone we are heading to. 
 Citation
@article{penkovsky2023RL,
title = "Beyond Supervised Learning",
author = "Penkovsky, Bogdan",
journal = "penkovsky.com",
year = "2023",
month = "May",
url = "https://penkovsky.com/neural-networks/beyond/"
}
If you like the article please consider sharing it. 
 Goodbye?
I have to admit that after all these days we have barely scratched the surface. Yet, I am happy about the journey we have made. We have seen how to create neural networks and to benefit from them; how to estimate model uncertainty; how to generate new things. And today we have learned how to continuously make decisions. There are so many more ideas to explore! Reinforcement learning is a rabbit hole in itself. By the way, did you know that ChatGPT is secretly using PPO to get better answers? 
 Let me know if you have any remarks. 
 Learn More
D. Silver. Policy Gradient - Lecture
Understanding the role of the discount factor
Hugging Face Deep RL Course
Learning dexterous in-hand manipulation
Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience
Noise and The Reality Gap
Policy Gradient with PyTorch
Proximal Policy Optimization Algorithms
Implementation matters in deep policy gradients
Mastering the game of Go with deep neural networks and tree search
Mastering the game of Go without human knowledge
This is what we were doing all the days before. Supervised learning. That is how it is called. One of my favourite teachers used to quote Molière, "Par ma foi, il y a plus de quarante ans que je dis de la prose, sans que j'en susse rien" ("These forty years now I've been speaking in prose without knowing it!"). ^
Honestly, I have no clue how a director of AI works at Tesla. But if you Andrej Karpathy and you are reading this, please share your past experiences. I would appreciate. ^
A trajectory is a repetitive sequence of observations and actions. ^
In Haskell it is possible to benefit from existing OpenAI Gym environments via Gym HTTP API. ^
See D. Silver's Lecture about Policy Gradients. ^
PyTorch seems to be better at minimizing rather than maximizing. ^