Lectures

Here, you can find the recordings of the lecture videos.

  • Lecture 0: Course Overview and Logistics
    Overview: In this lecture, we go through the course logistics. The audio quality is poor, as the teaching station did not work.
    [link]

    Lecture Notes:

  • Lecture 1: Why Deep Learning
    Introduction: This lecture gives a short motivating introduction to deep learning.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 2: Machine Learning vs Analysis
    ML - Part I: In this lecture, we understand the concept of data-driven analysis and machine learning. We look at a toy example in which we want to find the hitting distance of a projectile. We solve this problem using both analytic and machine learning approaches.
    [link]

    Lecture Notes:

  • Lecture 3: ML Component 1 - Data
    ML Part II - Data: We saw that each ML solution contains 3 components of Data, Model and Loss. We now get to know Data.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 4: Supervised, Unsupervised and Semi-supervised
    ML Part II - Data: We give a simple example of unsupervised learning. We also take a look at other possible cases.
    [link]

    Lecture Notes:

  • Lecture 5: Components 2 and 3: Model and Loss
    ML Part III - Model and Loss: We now define formally the notions of a computational model and a loss function. In this respect, we understand what hyperparameters and learnable parameters are.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 6: First Example -- Classification by Perceptron
    Classification: We now take our first example --that is image classification. We see how the data looks like in this case. We further build a computational model called Perceptron for it and specify the loss.
    [link]

    Lecture Notes:

    Further Reads:

    • Binary Classification: Chapter 5 - Sections 5.1 and 5.2 of [BB]
    • McCulloch-Pitts Model: Paper A logical calculus of the ideas immanent in nervous activity published in the Bulletin of Mathematical Biophysics by Warren McCulloch and Walter Pitts in 1943, proposing a computational model for neuron. This paper is treated as the pioneer study leading to the idea of artificial neuron
  • Lecture 7: Recap -- Law of Large Numbers
    Recap: We briefly review the main notions in probability theory and the law of large numbers (LLN). This will help us understand the training through risk minimization better.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 8: Training via Empirical Risk Minimization
    Training: We formulate the training problem by considering our main objective, i.e., to make a model that in average determines any label accurately. We then see that using LLN, we can estimate the objective of this formulation by the so-called empirical risk. This explains why we always minimize the average loss over our dataset.
    [link]

    Lecture Notes:

    Further Reads:

    • Overview on Risk Minimization: Paper An overview of statistical learning theory published as an overview of his life-going developments in ML in the IEEE Transactions on Neural Networks by Vladimir N. Vapnik in 1999
  • Lecture 9: Training Perceptron Machine
    Perceptron Algorithm: We next train the perceptron machine. We see that the training in this case minimizes the error rate on the dataset. We study the Perceptron Algorithm by Rosenblatt, which is the most simple form of a gradient-based training algorithm.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 10: From Perceptron to NNs -- Universal Approximation
    Intro to NNs - Part I: We are now ready to study NNs. We look into the example of learning XOR and see that a single Perceptron is not able to do it. We then build a network of perceptrons to learn XOR. This gives birth to the idea of NNs. We see that NNs are universal approximators, i.e., they can approximate any complicated function as accurate as we want if they are large enough.
    [link]

    Lecture Notes:

    Further Reads:

    • Universal Approximation: Paper Approximation by superpositions of a sigmoidal function published in Mathematics of Control, Signals and Systems by George V. Cybenko in 1989
  • Lecture 11: Deep Neural Networks
    Intro to NNs - Part II: We now define formally the Deep Neural Networks. We understand the notion of depth and layer, as well as the key components of these architectures.
    [link]

    Lecture Notes:

    Further Reads:

    • DNNs: Chapter 6 - Sections 6.2 and 6.3 of [BB]
  • Lecture 12: Iterative Optimization by Gradient Descent
    Function Optimization - Part I: Gradient descent is the key algorithm enabling training of DNNs. We take a look at its foundation to understand how and why it works.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 13: More on Gradient Descent
    Function Optimization - Part II: We take a look at the behavior of gradient descent algorithm. We see that it always converge to a minimizer if we choose efficient learning rate. Also we learn how to handle cases with discontinuous or non-differentiable functions.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 14: Forward Propagation in MLPs
    Forward Pass: We start with out first architecture, i.e., fully-connected FNNs, which are also known as MLPs. We try to understand how these NNs computationally learn what we look for from the input, i.e., how does the inference occur? This is computationally done by what we call forward propagation of data in the NN.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 15: Training Neural Networks via GD
    Backward Pass - Part I: We give the big picture on training of a NN on a dataset via gradient descent. We see that we need to find sample gradients to be able to train NNs.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 16: Chain Rule on Computation Graph
    Backward Pass - Part II: We now define the notion of Computation Graph and see how we can numerically compute any gradient on this graph using chain rule.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 17: Backward Pass on Computation Graph
    Backward Pass - Part III: We start by the computation graph of the perceptron. We see that computing the sample gradient is equivalent to propagation of computations backward on the graph. This gives us the idea, how we can compute sample gradient on much larger networks.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 18: Backpropagation over MLP
    Backpropagation: A neural network can describe a computation graph. We use this fact to develop an algorithmic approach for gradient computation. This is known as backpropagation. We see how we can backpropagate overn an MLP and how we can use it to train.
    [link]

    Lecture Notes:

    Further Reads:

    • Backpropagation: Chapter 8 of [BB]
    • Backpropagation of Error Paper Learning representations by back-propagating errors published in Nature by D. Rumelhart, G. Hinton and R. Williams in 1986 advocating the idea of systematic gradient computation of a computation graph
  • Lecture 19: First Neural Classifier
    Backpropagation: We now use our knowledge to build our first neural classifier for a binary task. We see that our initial choices for loss and activation simply do not work because the do not have gradients. We replace them with cross-entropy and sigmoid activation to build a trainable model.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 20: Multiclass Classification
    Multiclass: We develop a neural multiclass classifier with vector-activated neuron. We learn how we can look at it as a probability computing machine, and how we could backpropagate over it.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 21: Stochastic Gradient Descent
    SGD - Part I: We see that full-batch training is in general computationally hard. We come up with a simple remedy for that, which is called sample-level training. However, it can lead to a repetitive behavior. We hence add a stochastic shuffling to this algorithm. This leads to the so called Stochastic Gradient Descent.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 22: Mini-batch SGD and Complexity-Variance Tradeoff
    SGD - Part II: We see that full-batch training and basic SGD describe the two extreme points of a spectrum. We can balance them using the idea of mini-bach training. This leads to a complexity-variance tradeoff.
    [link]

    Lecture Notes:

    Further Reads:

  • Lecture 23: Evaluation and Generalization Measures
    Testing: Once we are over with training, we can evaluate how well our trained model generalizes to new data. We study this procedure and get familiar with typical learning curves.
    [link]

    Lecture Notes:

    Further Reads:

    • Generalization: Chapter 6 of the Book Patterns, predictions, and actions: A story about machine learning by Moritz Hardt and B. Recht published in 2021
  • Lecture 24: Linear and Sub-linear Convergence Speed
    Optimizers - Part I: We get a bit of discussions on convergence of optimizers. There are two set of speeds, linear and sub-linear. While the linear is very appreciated, it does not happen in practice. This motivates us to study efficient optimizers.
    [link]

    Lecture Notes:

    Further Reads:

    • Notes on Optimizers Lecture notes of the course Optimization for Machine Learning by Ashok Cutkosky in Boston University: A good resource for optimizers
  • Lecture 25: Optimizer Boosting -- Scheduling, Momentum and Rprop Ideas
    Optimizers - Part II: We discuss three key ideas for improving stochastic optimizers. Namely, learning rate scheduling, using moving averages, and automatic dimension-dependent learning rate scheduling via Rprop. These schemes are key components of most current robust optimizers.
    [link]

    Lecture Notes:

    Further Reads:

    • Learning Rate Scheduling Paper Cyclical Learning Rates for Training Neural Networks published in Winter Conference on Applications of Computer Vision (WACV) by Leslie N. Smith in 2017 discussing learning rate scheduling
    • Rprop Paper A direct adaptive method for faster backpropagation learning: the RPROP algorithm published in IEEE International Conference on Neural Networks by M. Riedmiller and H. Braun in 1993 proposing Rprop algorithm
  • Lecture 26: RMSprop and Adam
    Optimizers - Part III: We discuss two important optimizers, the Root Mean Square propagation (RMSprop) and Adam. We see that they both use the ideas of moving average and dimension dependent scheduling.
    [link]

    Lecture Notes:

    Further Reads:

    • RMSprop Lecture note by GEoffrey Hinton proposing RMSprop
    • RMSprop Analysis Paper RMSProp and equilibrated adaptive learning rates for non-convex optimization by Y. Dauphin et al. published in 2015 talking about RMSprop and citing Honton’s lecture notes
    • Adam Paper Adam: A Method for Stochastic Optimization published in 2014 by D. Kingma and J. Ba proposing Adam
  • Lecture 27: Overfitting
    Overfitting: We talk about the fundamental problem of overfitting. This motivates us to learn approaches such as validation, regularization and data augmentation.
    [link]

    Lecture Notes:

    Further Reads: