Lecture 'Deep Learning'


All dates for winter term 18/19

Slides (password protected)


Tablet writings

1. Ein erstes einfaches Neuronales Netz (SOM)
2. Das Perzeptron Neuron- und Netzmodell
3. Trainieren von Machine Learning Modellen
4. Automatisches Differenzieren
5. Ein konkretes Beispiel zum Reverse Mode Autodiff
Schritt-für-Schritt Einführung in TensorFlow
6. Bedeutung der Transferfunktion
7. Long Short Term Memory (LSTM)
8. Optimierer: Momentum, Nesterov, AdaGrad, RMSProp, Adam (Tafelanschrieb)
9. LSTM Training


01 - Introduction to Python (for 18.10.18)
02 - Introduction to NumPy (for 25.10.18)
03 - Automatic Differentiation (for 08.11.18)
Solution for exercise 03
04 - Applications of Deep Learning
05 - MLP for regression: Pandas (part 1/4)
06 - MLP for regression: MLP in TensorFlow (part 2/4)
07 - MLP for regression: MLP in Keras, NaN Processing, Scaling (part 3/4)
08 - MLP for regression: Working with categorial features (part 4/4)
09 - CNN for image classification in Keras
10 - LSTM for time series prediction
11 - LSTM for Natural Language Processing (NLP)



  • Video of original experiments of Hubel and Wiesel from 1959 showing the existence of "simple" and "complex" cells.
  • Video of a Convolutional Neural Network demo from 1993. Yes, the CNN model is not really new!
  • Mini-batch gradient descent by Andrew Ng. Explains what mini-batches are compared to batches (11m28s)
  • Understanding Mini-Batch Gradient Descent by Andrew Ng. Explains how the path to a local minimum looks like for Batch GD, SGD and Mini-batch GD (11m18s)
  • Gradient descent with momentum by Andrew Ng. Gives a good intuition in a short time (9m20s)
  • RMSProp by Andrew Ng. Again gives a good intuition in a short time (7m41s)
  • Adam Optimization Algorithm by Andrew Ng. Mainly presents the formulas and shows that Adam (="Adaptive Moment Estimation") is a combination of the momentum optimizer with the RMSProp optimizer (7m07s)
  • Exponentially Weighted Averages by Andrew Ng. RMSProp+Adam use exponentially weighted (moving) averages (EWMA) of squared gradients. For this, Andrew Ng introduces EWMA as well (5m58s)
  • Bias Correction of Exponentially Weighted Averages by Andrew Ng. EWMA are actually quite bad estimates at the initial phase. They tend to underestimate the average. Here Andrew Ng explains how to correct for this artefact by multiplying the EWMA with (1/(1-beta^t)) (4m11s)
  • Learning Rate Decay by Andrew Ng. Shows the motivation why to decrease the learning rate and shows some formulas that are typically used to decrease the learning rate as a function of the number of training epochs (6m44s)
  • Tuning Process by Andrew Ng. Tells us not to use a grid sampling strategy, but random sampling and to follow a coarse-to-fine search (7m10s)
  • Normalizing Activations in a Network by Andrew Ng. Explains the key idea of activation normalization: Introducing a normalization step for the activation or output values of neurons in a layer, such that they have a certain mean value and a certain variance. Thus two new hyper parameters are introduced for each layer, which allows to learn a good mean value and variance for the activations / output values of neurons in the layer (8m54s)
  • Fitting Batch Norm Into Neural Networks by Andrew Ng. Explains that activations are normally computed on basis of the resulting per-layer activations given by a batch of samples and shows that the usual bias vectors are not needed any longer, since they are already incorporated in the batch normalization step, where we learn the best mean for the activation of each neuron in the current layer being considered (12m55s)
  • Why Does Batch Norm Work? by Andrew Ng. Explains why batch normalization helps: 1. it gives later layers a more stable input, 2. it acts a regularizer similary to dropout, since it adds noise to the activation values of the neurons, since the normalization parameters are computed iteratively using all mini-batches, but are then applied to only the current mini-batch (11m39s)
  • Batch Norm At Test Time by Andrew Ng. Explains an important difference between training and inference step. During training the normalization of the neuron's activation values is computed on basis of a mini-batch. And what do we do at test time? Here Andrew Ng explains that usually we keep track of an (exponenetially weigted) moving average of the means and variances computed during training and use these estimates for the normalization step at test (inference) time (5m46s)


This is old material from my Deep Learning lecture hold in winter term 16/17:

Slides: Slides of all lectures till January, 19 2017