Lecture 'Deep Learning'

All dates for winter term 17/18

Note! Changes in the course plan:

Thursday, 11.01.18 --> 14:00 lecture
Thursday, 11.01.18 --> 15:45 no exercise!
Thursday, 18.01.18 --> 14:00 lecture
Thursday, 18.01.18 --> 15:45 lecture
Thursday, 25.01.18 --> 14:00 exercise
Thursday, 25.01.18 --> 15:45 exercise

Organizational issues

Script Last updated: 18.01.18
(password-protected, password will be provided only to students in first lecture)


Please do exercise 10 (CNN in TensorFlow) + exercise 11 (speech recognition with a CNN in Keras) in the script for the exercises at 25.01.18. All students that have not yet presented an exercise should present one at this day.

For some exercises, code will be provided as a starting point. This code will also be published in the Deep Learning Book GitHub repository.



  • Video of original experiments of Hubel and Wiesel from 1959 showing the existence of "simple" and "complex" cells.
  • Video of a Convolutional Neural Network demo from 1993. Yes, the CNN model is not really new!
  • Mini-batch gradient descent by Andrew Ng. Explains what mini-batches are compared to batches (11m28s)
  • Understanding Mini-Batch Gradient Descent by Andrew Ng. Explains how the path to a local minimum looks like for Batch GD, SGD and Mini-batch GD (11m18s)
  • Gradient descent with momentum by Andrew Ng. Gives a good intuition in a short time (9m20s)
  • RMSProp by Andrew Ng. Again gives a good intuition in a short time (7m41s)
  • Adam Optimization Algorithm by Andrew Ng. Mainly presents the formulas and shows that Adam (="Adaptive Moment Estimation") is a combination of the momentum optimizer with the RMSProp optimizer (7m07s)
  • Exponentially Weighted Averages by Andrew Ng. RMSProp+Adam use exponentially weighted (moving) averages (EWMA) of squared gradients. For this, Andrew Ng introduces EWMA as well (5m58s)
  • Bias Correction of Exponentially Weighted Averages by Andrew Ng. EWMA are actually quite bad estimates at the initial phase. They tend to underestimate the average. Here Andrew Ng explains how to correct for this artefact by multiplying the EWMA with (1/(1-beta^t)) (4m11s)
  • Learning Rate Decay by Andrew Ng. Shows the motivation why to decrease the learning rate and shows some formulas that are typically used to decrease the learning rate as a function of the number of training epochs (6m44s)
  • Tuning Process by Andrew Ng. Tells us not to use a grid sampling strategy, but random sampling and to follow a coarse-to-fine search (7m10s)
  • Normalizing Activations in a Network by Andrew Ng. Explains the key idea of activation normalization: Introducing a normalization step for the activation or output values of neurons in a layer, such that they have a certain mean value and a certain variance. Thus two new hyper parameters are introduced for each layer, which allows to learn a good mean value and variance for the activations / output values of neurons in the layer (8m54s)
  • Fitting Batch Norm Into Neural Networks by Andrew Ng. Explains that activations are normally computed on basis of the resulting per-layer activations given by a batch of samples and shows that the usual bias vectors are not needed any longer, since they are already incorporated in the batch normalization step, where we learn the best mean for the activation of each neuron in the current layer being considered (12m55s)
  • Why Does Batch Norm Work? by Andrew Ng. Explains why batch normalization helps: 1. it gives later layers a more stable input, 2. it acts a regularizer similary to dropout, since it adds noise to the activation values of the neurons, since the normalization parameters are computed iteratively using all mini-batches, but are then applied to only the current mini-batch (11m39s)
  • Batch Norm At Test Time by Andrew Ng. Explains an important difference between training and inference step. During training the normalization of the neuron's activation values is computed on basis of a mini-batch. And what do we do at test time? Here Andrew Ng explains that usually we keep track of an (exponenetially weigted) moving average of the means and variances computed during training and use these estimates for the normalization step at test (inference) time (5m46s)


This is old material from my Deep Learning lecture hold in winter term 16/17:

Slides: Slides of all lectures till January, 19 2017