Lectures

Here, you can find the recordings of the lecture videos.

  • Lecture 0: Course Overview and Logistics
    Overview: In this lecture, we go through the course logistics. The audio quality is poor, as the teaching station did not work.
    [link]

    Lecture Notes:

  • Lecture 1: Why Deep Learning
    Introduction: This lecture gives a short motivating introduction to deep learning.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 2: Machine Learning vs Analysis
    ML - Part I: In this lecture, we understand the concept of data-driven analysis and machine learning. We look at a toy example in which we want to find the hitting distance of a projectile. We solve this problem using both analytic and machine learning approaches.
    [link]

    Lecture Notes:

  • Lecture 3: ML Component 1 - Data
    ML Part II - Data: We saw that each ML solution contains 3 components of Data, Model and Loss. We now get to know Data.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 4: Supervised, Unsupervised and Semi-supervised
    ML Part II - Data: We give a simple example of unsupervised learning. We also take a look at other possible cases.
    [link]

    Lecture Notes:

  • Lecture 5: Components 2 and 3: Model and Loss
    ML Part III - Model and Loss: We now define formally the notions of a computational model and a loss function. In this respect, we understand what hyperparameters and learnable parameters are.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 6: First Example -- Classification by Perceptron
    Classification: We now take our first example --that is image classification. We see how the data looks like in this case. We further build a computational model called Perceptron for it and specify the loss.
    [link]

    Lecture Notes:

    Further Reads:

    • Binary Classification: Chapter 5 - Sections 5.1 and 5.2 of [BB]
    • McCulloch-Pitts Model: Paper A logical calculus of the ideas immanent in nervous activity published in the Bulletin of Mathematical Biophysics by Warren McCulloch and Walter Pitts in 1943, proposing a computational model for neuron. This paper is treated as the pioneer study leading to the idea of artificial neuron
  • Lecture 7: Recap -- Law of Large Numbers
    Recap: We briefly review the main notions in probability theory and the law of large numbers (LLN). This will help us understand the training through risk minimization better.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 8: Training via Empirical Risk Minimization
    Training: We formulate the training problem by considering our main objective, i.e., to make a model that in average determines any label accurately. We then see that using LLN, we can estimate the objective of this formulation by the so-called empirical risk. This explains why we always minimize the average loss over our dataset.
    [link]

    Lecture Notes:

    Further Reads:

    • Overview on Risk Minimization: Paper An overview of statistical learning theory published as an overview of his life-going developments in ML in the IEEE Transactions on Neural Networks by Vladimir N. Vapnik in 1999
  • Lecture 9: Training Perceptron Machine
    Perceptron Algorithm: We next train the perceptron machine. We see that the training in this case minimizes the error rate on the dataset. We study the Perceptron Algorithm by Rosenblatt, which is the most simple form of a gradient-based training algorithm.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 10: From Perceptron to NNs -- Universal Approximation
    Intro to NNs - Part I: We are now ready to study NNs. We look into the example of learning XOR and see that a single Perceptron is not able to do it. We then build a network of perceptrons to learn XOR. This gives birth to the idea of NNs. We see that NNs are universal approximators, i.e., they can approximate any complicated function as accurate as we want if they are large enough.
    [link]

    Lecture Notes:

    Further Reads:

    • Universal Approximation: Paper Approximation by superpositions of a sigmoidal function published in Mathematics of Control, Signals and Systems by George V. Cybenko in 1989
  • Lecture 11: Deep Neural Networks
    Intro to NNs - Part II: We now define formally the Deep Neural Networks. We understand the notion of depth and layer, as well as the key components of these architectures.
    [link]

    Lecture Notes:

    Further Reads:

    • DNNs: Chapter 6 - Sections 6.2 and 6.3 of [BB]
  • Lecture 12: Iterative Optimization by Gradient Descent
    Function Optimization - Part I: Gradient descent is the key algorithm enabling training of DNNs. We take a look at its foundation to understand how and why it works.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 13: More on Gradient Descent
    Function Optimization - Part II: We take a look at the behavior of gradient descent algorithm. We see that it always converge to a minimizer if we choose efficient learning rate. Also we learn how to handle cases with discontinuous or non-differentiable functions.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 14: Forward Propagation in MLPs
    Forward Pass: We start with out first architecture, i.e., fully-connected FNNs, which are also known as MLPs. We try to understand how these NNs computationally learn what we look for from the input, i.e., how does the inference occur? This is computationally done by what we call forward propagation of data in the NN.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 15: Training Neural Networks via GD
    Backward Pass - Part I: We give the big picture on training of a NN on a dataset via gradient descent. We see that we need to find sample gradients to be able to train NNs.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 16: Chain Rule on Computation Graph
    Backward Pass - Part II: We now define the notion of Computation Graph and see how we can numerically compute any gradient on this graph using chain rule.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 17: Backward Pass on Computation Graph
    Backward Pass - Part III: We start by the computation graph of the perceptron. We see that computing the sample gradient is equivalent to propagation of computations backward on the graph. This gives us the idea, how we can compute sample gradient on much larger networks.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 18: Backpropagation over MLP
    Backpropagation: A neural network can describe a computation graph. We use this fact to develop an algorithmic approach for gradient computation. This is known as backpropagation. We see how we can backpropagate overn an MLP and how we can use it to train.
    [link]

    Lecture Notes:

    Further Reads:

    • Backpropagation: Chapter 8 of [BB]
    • Backpropagation of Error Paper Learning representations by back-propagating errors published in Nature by D. Rumelhart, G. Hinton and R. Williams in 1986 advocating the idea of systematic gradient computation of a computation graph
  • Lecture 19: First Neural Classifier
    Backpropagation: We now use our knowledge to build our first neural classifier for a binary task. We see that our initial choices for loss and activation simply do not work because the do not have gradients. We replace them with cross-entropy and sigmoid activation to build a trainable model.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 20: Multiclass Classification
    Multiclass: We develop a neural multiclass classifier with vector-activated neuron. We learn how we can look at it as a probability computing machine, and how we could backpropagate over it.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 21: Stochastic Gradient Descent
    SGD - Part I: We see that full-batch training is in general computationally hard. We come up with a simple remedy for that, which is called sample-level training. However, it can lead to a repetitive behavior. We hence add a stochastic shuffling to this algorithm. This leads to the so called Stochastic Gradient Descent.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 22: Mini-batch SGD and Complexity-Variance Tradeoff
    SGD - Part II: We see that full-batch training and basic SGD describe the two extreme points of a spectrum. We can balance them using the idea of mini-bach training. This leads to a complexity-variance tradeoff.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 23: Evaluation and Generalization Measures
    Testing: Once we are over with training, we can evaluate how well our trained model generalizes to new data. We study this procedure and get familiar with typical learning curves.
    [link]

    Lecture Notes:

    Further Reads:

    • Generalization: Chapter 6 of the Book Patterns, predictions, and actions: A story about machine learning by Moritz Hardt and B. Recht published in 2021
  • Lecture 24: Linear and Sub-linear Convergence Speed
    Optimizers - Part I: We get a bit of discussions on convergence of optimizers. There are two set of speeds, linear and sub-linear. While the linear is very appreciated, it does not happen in practice. This motivates us to study efficient optimizers.
    [link]

    Lecture Notes:

    Further Reads:

    • Notes on Optimizers Lecture notes of the course Optimization for Machine Learning by Ashok Cutkosky in Boston University: A good resource for optimizers
  • Lecture 25: Optimizer Boosting -- Scheduling, Momentum and Rprop Ideas
    Optimizers - Part II: We discuss three key ideas for improving stochastic optimizers. Namely, learning rate scheduling, using moving averages, and automatic dimension-dependent learning rate scheduling via Rprop. These schemes are key components of most current robust optimizers.
    [link]

    Lecture Notes:

    Further Reads:

    • Learning Rate Scheduling Paper Cyclical Learning Rates for Training Neural Networks published in Winter Conference on Applications of Computer Vision (WACV) by Leslie N. Smith in 2017 discussing learning rate scheduling
    • Rprop Paper A direct adaptive method for faster backpropagation learning: the RPROP algorithm published in IEEE International Conference on Neural Networks by M. Riedmiller and H. Braun in 1993 proposing Rprop algorithm
  • Lecture 26: RMSprop and Adam
    Optimizers - Part III: We discuss two important optimizers, the Root Mean Square propagation (RMSprop) and Adam. We see that they both use the ideas of moving average and dimension dependent scheduling.
    [link]

    Lecture Notes:

    Further Reads:

    • RMSprop Lecture note by GEoffrey Hinton proposing RMSprop
    • RMSprop Analysis Paper RMSProp and equilibrated adaptive learning rates for non-convex optimization by Y. Dauphin et al. published in 2015 talking about RMSprop and citing Honton’s lecture notes
    • Adam Paper Adam: A Method for Stochastic Optimization published in 2014 by D. Kingma and J. Ba proposing Adam
  • Lecture 27: Overfitting
    Overfitting: We talk about the fundamental problem of overfitting. This motivates us to learn approaches such as validation, regularization and data augmentation.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 28: Sources of Overfitting
    Overfitting: We talk about three key concepts, namely model complexity, data size, and co-adaptation. These factors all contribute to overfitting. We talk about solutions to handle each of these situations.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 29: Regularization
    Regularization: We learn how to restrict the co-adaptation behavior of the model parameter. This is called regularization. We talk about the well-known approaches of weight decay and Lasso regularization.
    [link]

    Lecture Notes:

    Further Reads:

    • Overfitting and Regularization: Chapter 9 - Sections 9.1 to 9.3 of [BB]
    • Tikhonov Paper Tikhonov Regularization and Total Least Squares published in 1999 by G. Golub et al. illustrating the Tikhonov Regularization work
    • Lasso Paper Regression Shrinkage and Selection Via the Lasso published in 1996 by R. Tibshirani proposing the legendary Lasso
  • Lecture 30: Dropout
    Dropout: Dropout is a special form of regularization in which we randomly drop neurons in each iteration. We talk about the dropout and learn how we can implement it.
    [link]

    Lecture Notes:

    Further Reads:

    • Dropout 1 Paper Improving neural networks by preventing co-adaptation of feature detectors published in 2012 by G. Hinton et al. proposing Dropout
    • Dropout 2 Paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting published in 2014 by N. Srivastava et al. providing some analysis and illustrations on Dropout
  • Lecture 31: Statistical Viewpoint on Data
    Data: We go over the notion of data from statistical viewpoint. This enables us to understand Data Augmentation, Data Generation and Data Distribution. We also shortly discuss data cleaning.
    [link]

    Lecture Notes:

    Further Reads:

    • Data: Chapter 8 of the Book Patterns, predictions, and actions: A story about machine learning by Moritz Hardt and B. Recht published in 2021
    • Data Processing in Python Open Book Minimalist Data Wrangling with Python by Marek Gagolewski going through data processing in Python
  • Lecture 32: Normalization
    Normalization: We get over the idea of normalization and its impact on training. This motivates us to learn Batch Normalization scheme.
    [link]

    Lecture Notes:

    Further Reads:

    • Normalization Paper Is normalization indispensable for training deep neural network? published in 2020 by J. Shao et al. discussing the meaning and effects of normalization
  • Lecture 33: Batch Normalization
    Batch-Norm: We study Batch Normalization. We see how the backpropagation gets impacted when we do batch normalization.
    [link]

    Lecture Notes:

    Further Reads:

    • Batch-Norm Paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift published in 2015 by S. Ioffe and C. Szegedy proposing Batch Normalization
    • Batch-Norm Meaning Paper How Does Batch Normalization Help Optimization? published in 2018 by S. Santurkar et al. discussing why Batch Normalization works: they claim that the main reason is that loss landscape is getting much smoother
  • Lecture 34: Why Convolution?
    CNN I: We start with CNNs. We see that convolution is a scanning process with a linear filter. This operation gives us a fort of shift invariance enabling us to accomplish visual learning tasks at significantly lower complexity.
    [link]

    Lecture Notes:

    Further Reads:

    • Hubel and Wiesel Study Paper Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex published in 1962 by D. Hubel and T. Wiesel elaborating their finding on visual understanding
    • Neocognitron Paper Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position published in 1980 by _K. Fukushima _ proposing the Neocognitron as a computational model for visual learning
    • Backpropagating on LeNet Paper Backpropagation Applied to Handwritten Zip Code Recognition published in 1989 by Y. LeCun et al. developing backpropagation for LeNet
    • LeNet Paper Gradient-Based Learning Applied to Document Recognition published in 1998 by Y. LeCun et al. discussing LeNet
  • Lecture 35: Quick Preview on CNN
    CNN II: Following the intuition we build on the convolution, we get a quick start with CNNs. We see what the CNNs are and what components they have.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 36: Convolution Operation and Resampling
    Convolution I: We start with formally defining the convolution. We see how it works with stride 1. We further understand how we could interpret stride as a resampling operation on convolution with stride 1.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 37: Padding and Multichannel Convolution
    Convolution II: We further learn about zero padding, which helps us flexibly play with the output dimensions. We then extend the basic 2D convolution to the multi-channel case, where a multi-channel input is processed to a multi-channel output.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 38: Pooling and Flattening
    Convolution III: We learn about Pooling and its functionality in CNNs. We further learn how we can use basic flattening to perform final processing on the extracted features by the CNN layers.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 39: Deep CNNs
    Deep CNNs I: We learn about Deep CNNs, how they can be interpreted as locally connected MLPs. Also we learn about receptive field and its intuition on building Deep CNNs.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 40: Example of VGG-16
    Deep CNNs II: We investigate the example of VGG-16. This helps us understand clearly the details of CNNs.
    [link]

    Lecture Notes:

    Further Reads:

    • VGG Paper Very Deep Convolutional Networks for Large-Scale Image Recognition published in 2014 by K. Simonyan and A. Zisserman proposing VGG Architectures
  • Lecture 41: Backpropagation Through CNNs
    Training CNNs: We learn how to backpropagate on various layers of a CNN. Interestingly, backpropagation through convolutional layers is again described by convolution. This makes the implementation of backpropagation over CNN very efficient.
    [link]

    Lecture Notes:

    Further Reads:

    • LeCun’s Paper Paper Gradient-based learning applied to document recognition published in 2002 by Y. LeCun et al. summarizing the learning process in CNN
    • Efficient Backpropagation on CNN Paper High Performance Convolutional Neural Networks for Document Processing published in 2006 by K. Chellapilla et al. discussing efficient backpropagation on CNNs.
  • Lecture 42: Vanishing Gradient in Deep Networks
    Training CNNs: We learn about the vanishing and exploding gradient effect in deep neural networks. We see the source of this behavior and understand how we can address them. This motivates us to learn residual learning.
    [link]

    Lecture Notes:

    Further Reads:

    • ResNet Paper Deep Residual Learning for Image Recognition published in 2015 by K. He et al. proposing ResNet
  • Lecture 43: Skip Connection and ResNet
    Training CNNs: We learn residual learning, an approach to handle vanishing gradient behavior. We then study ResNet, as a well known deep networks that uses residual network.
    [link]

    Lecture Notes:

    Further Reads:

    • ResNet Paper Deep Residual Learning for Image Recognition published in 2015 by K. He et al. proposing ResNet
    • ResNet-1001 Paper Identity Mappings in Deep Residual Networks published in 2016 by K. He et al. demonstrating how deep ResNet can go
    • U-Net Paper U-Net: Convolutional Networks for Biomedical Image Segmentation published in 2015 by O. Ronneberger et al. proposing U-Net
    • DenseNet Paper Densely Connected Convolutional Networks published in 2017 by H. Huang et al. proposing DenseNet
  • Lecture 44: Processing Sequence Data
    Seq Models: We study sequence data and their processing by NNs. We see that due to temporal connection, we cannot process them separately. At the same time, the one-shot processing is not feasible. We hence need to develop a sequence model.
    [link]

    Lecture Notes:

    Further Reads:

    • Jordan Network Paper Attractor dynamics and parallelism in a connectionist sequential machine published in 1986 by M. Jordan proposing his RNN
    • Elman Network Paper Finding structure in time published in 1990 by J. Elman proposing a revision to Jordan Network
  • Lecture 45: Sequence Processing by Recursion
    RNN I: We study the most basic design for sequence processing, i.e., a recurrent NN. We see that finding the right choice of state in these architectures is rather crucial.
    [link]

    Lecture Notes:

    Further Reads:

    • BPTT Paper Backpropagation through time: What it does and how to do it published in 2002 by P. Werbos explaining BPTT
  • Lecture 46: Different Sequence Problems
    Seq Models II: We now think of a general sequence processing and specify different type of problems we could have. We check our basic RNN solution and see how the learning carries out in these models.
    [link]

    Lecture Notes:

    Further Reads:

    • Seq Models Article The Unreasonable Effectiveness of Recurrent Neural Networks written in May 2015 by A. Karpathy discussing different types of sequence problems
  • Lecture 47: Backpropagation Through Time
    RNN II: We see that by sequential processing of data, we need to backpropagate through time. This leads to finite memory due to the vanishing gradient behavior.
    [link]

    Lecture Notes:

    Further Reads:

    • Vanishing Gradient with BPTT Paper On the difficulty of training recurrent neural networks published in 2013 by R. Pascanu et al. discussing challenges in training with BPTT
    • Truncated BPTT Paper An efficient gradient-based algorithm for on-line training of recurrent network trajectories published in 1990 by R. Williams and J. Peng explaining truncated BPTT
  • Lecture 48: Gating Principle
    Gating: We see the gating principle as a key idea to extend the memory through time. This idea leads to LSTM and GRU architectures.
    [link]

    Lecture Notes:

    Further Reads:

    • Gating Principle Chapter Long Short-Term Memory published in 2012 in book Supervised Sequence Labelling with Recurrent Neural Networks by A. Graves explaining Gating idea