Lecture Videos
Lectures
Here, you can find the recordings of the lecture videos.
-
Lecture 1: Introduction and DL Components
DL Components: This lecture gives a short motivating introduction to deep learning. We then go through a dummy example and learn the basic components of DL.
[link]
Lecture Notes:
Further Reads:
- Motivation: Chapter 1 - Section 1.1 of [BB]
- Review on Linear Algebra: Chapter 2 of [GYC]
- ML Components: Chapter 1 - Sections 1.2.1 to 1.2.4 of [BB]
-
Lecture 2: Classification via Perceptron
Classification: We complete the remaining components of ML, i.e., model and loss. We then get into the example of image classification. We try to understand the ML components for the example of Perceptron.
[link]
Lecture Notes:
Further Reads:
- Binary Classification: Chapter 5 - Sections 5.1 and 5.2 of [BB]
- McCulloch-Pitts Model: Paper A logical calculus of the ideas immanent in nervous activity published in the Bulletin of Mathematical Biophysics by Warren McCulloch and Walter Pitts in 1943, proposing a computational model for neuron. This paper is treated as the pioneer study leading to the idea of artificial neuron –>
-
Lecture 3: Training via Empirical Risk Minimization
Risk Minimization: We formulate the training problem by considering our main objective, i.e., to make a model that in average determines any label accurately. We then see that using LLN, we can estimate the objective of this formulation by the so-called empirical risk. This explains why we always minimize the average loss over our dataset.
[link]
Lecture Notes:
Further Reads:
- Overview on Risk Minimization: Paper An overview of statistical learning theory published as an overview of his life-going developments in ML in the IEEE Transactions on Neural Networks by Vladimir N. Vapnik in 1999
-
Lecture 4: Multiple Layers of Perceptrons
NNs: We train the perceptron machine. We see that the training in this case minimizes the error rate on the dataset. We study the Perceptron Algorithm by Rosenblatt, which is the most simple form of a gradient-based training algorithm. We see that this algorithm does work for linear classification. For nonlinear settings, we need to make a network of these neurons. This explains the idea of NNs.
[link]
Lecture Notes:
Further Reads:
- Perceptron Simulation Experiments: Paper Perceptron Simulation Experiments presented by Frank Rosenblatt in Proceedings of IRE in 1960
- Perceptron: Chapter 1 - Section 1.2.1 of [Ag]
-
Lecture 5: Universal Approximation Theorem and Deep NNs
DNNs: We see that NNs are universal approximators, i.e., they can approximate any complicated function as accurate as we want if they are large enough. We define formally the Deep Neural Networks and understand the notion of depth and layer, as well as the key components of these architectures.
[link]
Lecture Notes:
Further Reads:
- Universal Approximation: Paper Approximation by superpositions of a sigmoidal function published in Mathematics of Control, Signals and Systems by George V. Cybenko in 1989
- DNNs: Chapter 6 - Sections 6.2 and 6.3 of [BB]
-
Lecture 6: Iterative Optimization by Gradient Descent
Function Optimization - Part I: Gradient descent is the key algorithm enabling training of DNNs. We take a look at its foundation to understand how and why it works. (Unfortunately, teaching station did not work. This is the recording from last year.)
[link]
Lecture Notes:
Further Reads:
- Gradient-based Optimization: Chapter 4 - Sections 4.3 and 4.4 of [GYC]
- Gradient Descent: Chapter 7 - Sections 7.1 and 7.2 of [BB]
-
Lecture 07: More on Gradient Descent
Function Optimization - Part II: We take a look at the behavior of gradient descent algorithm. We see that it always converge to a minimizer if we choose efficient learning rate. Also we learn how to handle cases with discontinuous or non-differentiable functions. (Unfortunately, teaching station did not work. This is the recording from last year.)
[link]
-
Lecture 08: Forward Propagation in MLPs
Forward Pass: We start with our first architecture, i.e., fully-connected FNNs, which are also known as MLPs. We try to understand how these NNs computationally learn what we look for from the input, i.e., how does the inference occur? This is computationally done by what we call forward propagation of data in the NN.
[link]
Lecture Notes:
Further Reads:
-
Lecture 09: Computing Gradient on Graph
Computation Graph: We give the big picture on training an NN on a dataset via gradient descent. We see that we need to find sample gradients to be able to train NNs. This can be done algorithmically by computing gradients on a graph.
[link]
-
Lecture 10: Backpropagation
Backpropagation: We start by the computation graph of the perceptron. We see that computing the sample gradient is equivalent to propagation of computations backward on the graph. This gives us the idea, how we can compute sample gradient on much larger networks.
[link]
Lecture Notes:
Further Reads:
- Backpropagation: Chapter 6 - Section 6.5 of [GYC]
- Backpropagation: Chapter 8 of [BB]
-
Lecture 11: Backpropagation over MLP
Backpropagation: A neural network can describe a computation graph. We use this fact to develop an algorithmic approach for gradient computation. This is known as backpropagation. We see how we can backpropagate overn an MLP and how we can use it to train.
[link]
Lecture Notes:
Further Reads:
- Backpropagation: Chapter 8 of [BB]
- Backpropagation of Error Paper Learning representations by back-propagating errors published in Nature by D. Rumelhart, G. Hinton and R. Williams in 1986 advocating the idea of systematic gradient computation of a computation graph
-
Lecture 12: Neural Classifier
Classification: We now use our knowledge to build our first neural classifier for a binary task. We see that our initial choices for loss and activation simply do not work because the do not have gradients. We replace them with cross-entropy and sigmoid activation to build a trainable model.
[link]
-
Lecture 13: Multiclass Classification
Multiclass: We develop a neural multiclass classifier with vector-activated neuron. We learn how we can look at it as a probability computing machine, and how we could backpropagate over it.
[link]
-
Lecture 14: Stochastic Gradient Descent and Learning Curves
SGD - Part I: We see that full-batch training is in general computationally hard. We come up with a simple remedy for that, which is called sample-level training. However, it can lead to a repetitive behavior. We hence add a stochastic shuffling to this algorithm. This leads to the so called Stochastic Gradient Descent. We see that full-batch training and basic SGD describe the two extreme points of a spectrum. We can balance them using the idea of mini-bach training. This leads to a complexity-variance tradeoff. We further talk about evaluation of our trained model and its generalization to new data. We study this procedure and get familiar with typical learning curves.
[link]
-
Lecture 15: Linear and Sub-linear Convergence Speed
Optimizers - Part I: We get a bit of discussions on convergence of optimizers. There are two set of speeds, linear and sub-linear. While the linear is very appreciated, it does not happen in practice. This motivates us to study efficient optimizers.
[link]
Lecture Notes:
Further Reads:
- Notes on Optimizers Lecture notes of the course Optimization for Machine Learning by Ashok Cutkosky in Boston University: A good resource for optimizers
-
Lecture 16: Practical Optimizers
Optimizers - Part II: We discuss three key ideas for improving stochastic optimizers. Namely, learning rate scheduling, using moving averages, and automatic dimension-dependent learning rate scheduling via Rprop. These schemes are key components of most current robust optimizers. We discuss two important optimizers, the Root Mean Square propagation (RMSprop) and Adam. We see that they both use the ideas of moving average and dimension dependent scheduling.
[link]
Lecture Notes:
Further Reads:
- Learning Rate Scheduling Paper Cyclical Learning Rates for Training Neural Networks published in Winter Conference on Applications of Computer Vision (WACV) by Leslie N. Smith in 2017 discussing learning rate scheduling
- Rprop Paper A direct adaptive method for faster backpropagation learning: the RPROP algorithm published in IEEE International Conference on Neural Networks by M. Riedmiller and H. Braun in 1993 proposing Rprop algorithm
-
Lecture 17: Overfitting and Regularization
Overfitting: We talk about three key concepts, namely model complexity, data size, and co-adaptation. These factors all contribute to overfitting. We talk about solutions to handle each of these situations. We learn how to restrict the co-adaptation behavior of the model parameter. This is called regularization. We talk about the well-known approaches of weight decay and Lasso regularization.
[link]
-
Lecture 18: Dropout and Data Augmentation
Dropout: Dropout is a special form of regularization in which we randomly drop neurons in each iteration. We talk about the dropout and learn how we can implement it. We then talk about Data Augmentation and Generation. We go over the notion of data from statistical viewpoint. This enables us to understand Data Augmentation, Data Generation and Data Distribution. We also shortly discuss data cleaning.
[link]
Lecture Notes:
Further Reads:
- Dropout 1 Paper Improving neural networks by preventing co-adaptation of feature detectors published in 2012 by G. Hinton et al. proposing Dropout
- Dropout 2 Paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting published in 2014 by N. Srivastava et al. providing some analysis and illustrations on Dropout
- Data: Chapter 8 of the Book Patterns, predictions, and actions: A story about machine learning by Moritz Hardt and B. Recht published in 2021
- Data Processing in Python Open Book Minimalist Data Wrangling with Python by Marek Gagolewski going through data processing in Python
-
Lecture 19: Normalization
Norm: We talk about standardization and data normalization.
[link]
Lecture Notes:
Further Reads:
- Data Processing in Python Open Book Minimalist Data Wrangling with Python by Marek Gagolewski going through data processing in Python
-
Lecture 20: Batch Normalization
Batch-Norm: We study Batch Normalization. We see how the backpropagation gets impacted when we do batch normalization.
[link]
Lecture Notes:
Further Reads:
- Batch-Norm Paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift published in 2015 by S. Ioffe and C. Szegedy proposing Batch Normalization
- Batch-Norm Meaning Paper How Does Batch Normalization Help Optimization? published in 2018 by S. Santurkar et al. discussing why Batch Normalization works: they claim that the main reason is that loss landscape is getting much smoother
-
Lecture 21: Convolutional Layers
CNN I: We start with CNNs. We see that convolution is a scanning process with a linear filter. This operation gives us a fort of shift invariance enabling us to accomplish visual learning tasks at significantly lower complexity. We then go through convolution operation.
[link]
Lecture Notes:
Further Reads:
- Hubel and Wiesel Study Paper Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex published in 1962 by D. Hubel and T. Wiesel elaborating their finding on visual understanding
- Neocognitron Paper Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position published in 1980 by _K. Fukushima _ proposing the Neocognitron as a computational model for visual learning
- Backpropagating on LeNet Paper Backpropagation Applied to Handwritten Zip Code Recognition published in 1989 by Y. LeCun et al. developing backpropagation for LeNet
- LeNet Paper Gradient-Based Learning Applied to Document Recognition published in 1998 by Y. LeCun et al. discussing LeNet
- Convolution: Chapter 9 - Sections 9.1 and 9.2 of [GYC]
-
Lecture 22: Multi-channel Convolution and Pooling
CNN II: We then extend the basic 2D convolution to the multi-channel case, where a multi-channel input is processed to a multi-channel output. We learn about Pooling and its functionality in CNNs. We further learn how we can use basic flattening to perform final processing on the extracted features by the CNN layers.
[link]
Lecture Notes:
Further Reads:
- Convolution: Chapter 9 - Sections 9.1 and 9.2 of [GYC]
- Multi-channel Convolution: Chapter 10 - Sections 10.2.3 to 10.2.5 of [BB]
- Pooling: Chapter 10 - Section 10.2.6 of [BB]
- Flattening: Chapter 10 - Sections 10.2.7 and 10.2.8 of [BB]
-
Lecture 23: Deep CNNs
CNNs III: We learn about Deep CNNs, how they can be interpreted as locally connected MLPs. Also we learn about receptive field and its intuition on building Deep CNNs. We investigate the example of VGG-16. This helps us understand clearly the details of CNNs.
[link]
Lecture Notes:
Further Reads:
- Convolution: Chapter 9 - Sections 9.4 and 9.6 of [GYC]
- VGG Paper Very Deep Convolutional Networks for Large-Scale Image Recognition published in 2014 by K. Simonyan and A. Zisserman proposing VGG Architectures
-
Lecture 24: Backpropagation Through CNNs
Training CNNs: We learn how to backpropagate on various layers of a CNN. Interestingly, backpropagation through convolutional layers is again described by convolution. This makes the implementation of backpropagation over CNN very efficient.
[link]
Lecture Notes:
Further Reads:
- LeCun’s Paper Paper Gradient-based learning applied to document recognition published in 2002 by Y. LeCun et al. summarizing the learning process in CNN
- Efficient Backpropagation on CNN Paper High Performance Convolutional Neural Networks for Document Processing published in 2006 by K. Chellapilla et al. discussing efficient backpropagation on CNNs.
-
Lecture 25: Vanishing Gradient in Deep Networks
ResNet: We learn about the vanishing and exploding gradient effect in deep neural networks. We see the source of this behavior and understand how we can address them. This motivates us to learn residual learning.
[link]
Lecture Notes:
Further Reads:
- ResNet Paper Deep Residual Learning for Image Recognition published in 2015 by K. He et al. proposing ResNet
-
Lecture 26: Skip Connection and ResNet
ResNet: We learn residual learning, an approach to handle vanishing gradient behavior. We then study ResNet, as a well known deep networks that uses residual network.
[link]
Lecture Notes:
Further Reads:
- ResNet Paper Deep Residual Learning for Image Recognition published in 2015 by K. He et al. proposing ResNet
- ResNet-1001 Paper Identity Mappings in Deep Residual Networks published in 2016 by K. He et al. demonstrating how deep ResNet can go
- U-Net Paper U-Net: Convolutional Networks for Biomedical Image Segmentation published in 2015 by O. Ronneberger et al. proposing U-Net
- DenseNet Paper Densely Connected Convolutional Networks published in 2017 by H. Huang et al. proposing DenseNet
-
Lecture 27: RNNs
Seq Models: We study sequence data and their processing by NNs. We see that due to temporal connection, we cannot process them separately. At the same time, the one-shot processing is not feasible. We hence need to develop a sequence model. We study the most basic design for sequence processing, i.e., a recurrent NN. We see that finding the right choice of state in these architectures is rather crucial.
[link]
Lecture Notes:
Further Reads:
- Jordan Network Paper Attractor dynamics and parallelism in a connectionist sequential machine published in 1986 by M. Jordan proposing his RNN
- Elman Network Paper Finding structure in time published in 1990 by J. Elman proposing a revision to Jordan Network
-
Lecture 28: Learning through Time
RNNs: We now think of a general sequence processing and specify different type of problems we could have. We check our basic RNN solution and see how the learning carries out in these models.
[link]
Lecture Notes:
Further Reads:
- BPTT Paper Backpropagation through time: What it does and how to do it published in 2002 by P. Werbos explaining BPTT
- Seq Models Article The Unreasonable Effectiveness of Recurrent Neural Networks written in May 2015 by A. Karpathy discussing different types of sequence problems
-
Lecture 29: Training RNNs
RNNs III: We see that by sequential processing of data, we need to backpropagate through time. This leads to finite memory due to the vanishing gradient behavior.
[link]
Lecture Notes:
Further Reads:
- Vanishing Gradient with BPTT Paper On the difficulty of training recurrent neural networks published in 2013 by R. Pascanu et al. discussing challenges in training with BPTT
- Truncated BPTT Paper An efficient gradient-based algorithm for on-line training of recurrent network trajectories published in 1990 by R. Williams and J. Peng explaining truncated BPTT
-
Lecture 30: Gated Architectures
Gating: We see the gating principle as a key idea to extend the memory through time. This idea leads to LSTM and GRU architectures. We study computational models with Gating. We look into GRU and LSTM in particular, study their training, and see how we can make them bidirectional.
[link]
Lecture Notes:
Further Reads:
- Gating Principle Chapter Long Short-Term Memory published in 2012 in book Supervised Sequence Labelling with Recurrent Neural Networks by A. Graves explaining Gating idea
- LSTM Paper Long short-term memory published in 1997 by S. Hochreiter and J. Schmidhuber proposing LSTM
- GRU Paper On the Properties of Neural Machine Translation: Encoder-Decoder Approaches published in 2014 by K. Cho et al. proposing GRU
-
Lecture 31: Correspondence Problem and CTC
CTC: This lecture discusses the correspondence issue raising in some sequence applications. We briefly discuss connectionist temporal classification (CTC) method which addresses the correspondence issue.
[link]
Lecture Notes:
Details:
- CTC Algorithm This is a recorded lecture on CTC
Further Reads:
- CTC Paper Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks published in 2006 by A. Graves et al. proposing CTC Algorithm
-
Lecture 32: Seq2Seq - Part I: Language Model
Seq2Seq - LM: We start with Seq2Seq models. We build a first Seq2Seq model which is a basic language model. We see how we can train it and what kid of ability this model will have.
[link]
Lecture Notes:
Further Reads:
