This page is for the new (2018) course on Machine Learning.
Class Notices:
The class outline is here as a PDF file, and here in HTML.
A brief list of the topics we will treat is as follows.
Method of Evaluation:
Last year, we decided that the most useful way of evaluating our abilities with machine learning would be to get hold of a suitable data set and train an algorithm for some appropriate task. It was also decided that, although individual students are very welcome to carry out a project alone, it is also quite in order for students to work in pairs, and submit one single project for the two participants.
This year, the rules for the project are the same. Students may undertake their projects either alone, or in pairs. Projects should be submitted to me by email by the end of the year, that is, to be precise, December 31, 2020
People who would like to work with a partner, but have not yet found one, should email me to that effect, and I will try to find one. Contrariwise, people who would like to volunteer to serve as another person's partner should also email me.
Resources:
We will follow the textbook Deep Learning, by Goodfellow, Bengio, and Courville. I do not expect to have time to cover any more than the first two main sections of the book. The book is available online at this site. It is also available as a hardback physical book.
We will also use Hands-on Machine Learning with SciKit-Learn, Keras and TensorFlow, Aurélien Géron, O'Reilly 2017 in parallel with the other textbook. This is a revised and updated version of a book that for the past two years has turned out to be a really valuable resource.
Here are other resources. See the course outline for more information about them.
More resources
This link is to a set of short additional notes for the course. It begins with a discussion of the singular value decomposition, generalised matrix inverses, and principal components analysis.
The resources and links given here are things I found or was told about since setting up the resources above, and are not in the course outline.
Data and Python files
Although it is not hard to download the MNIST data directly from within Python, it may be more convenient to get the data from the MNIST_data directory. The names of the files are mostly self-explanatory, but those names beginning t10k contain the test data.
This Python file contains code for loading the files in the MNIST_data directory, and then performing some of the tasks we spoke about during the second class, taken from Chapter 3 of Géron's "Hands-on" book. In order to run it successfully, you will need to install a number of Python libraries. If you are still using Python2 (not recommended), the tool for installing libraries is pip; if you have moved on to Python3, it is called pip3.
Then, from a terminal, issue the command
pip3 install --upgrade jupyter matplotlib numpy pandas scipy scikit-learn
If you like, you can add tensorflow to the list of modules to install. We will want it later.
It is advisable, although it is not necessary, to do all of the above in a virtual environment, in order not to mess up anything else you may be doing with Python on your computer. First install another package:
pip3 install --user --upgrade virtualenv
Then, create a directory (or folder, as some call it) where you wish to work, and, from within this directory, issue the command
virtualenv -p python3 venv
You can alter venv to any other name you like. Then, from the same place where you created the virtual environment, do
source venv/bin/activate
You are now in your virtual environment, and can proceed to install the various packages mentioned above. Once you are finished, you can just type deactivate, and you have left the virtual environment.
Log of material covered
The first class was later than usual, because it is scheduled on Monday, and September 14 was the first Monday of term on account of Labour Day. We began by looking at the Preface of Géron's book, and then went to cover much of his Chapter 1. We stopped just before the section entitled "Main Challenges of Machine Learning".
On September 21, we began by completing the study of the first chapter of Géron's book. This chapter provides a good overview of what machine learning is, and what the challenges are that face designers of algorithms intended to achieve machine learning.
We then skipped to the Deep Learning text, and embarked on Chapter 5, on "Machine Learning Basics". Much of this chapter duplicates what is in the first chapter of Géron, and so we skipped over it very lightly. We spent some time on subsection 5.3.1, on cross validation, in particular on the algorithm, presented in pseudo-code, for k-fold cross validation.
Since, except for Chapter 5, the first part of this book presents mathematical preliminaries, we have not spent time on it, and will not do so unless and until a need arises. We did however look briefly at section 3.1, entitled "Why Probability?", and took note of the different perspectives of frequentist probability and Bayesian probability, both of which are useful to us in different contexts. Then on to section 3.10, where we learned about the logistic and softplus functions. Next was section 3.13, on information theory, self-information, and the Kullback-Leibler divergence.
Finally, we returned to Géron's Chapter 2, in which he leads us through how to set up a machine-learning project. We just began this, and will continue with it next week.
Chapter 2 of the Hands-on book occupied us for all of the class on September 28. It deals in sometimes exhausting detail the many steps needed for a machine-learning project. So far, we got through the preliminary steps, some of them in numerous alternative versions.
These steps are, as enumerated at the beginning of the chapter:
What remains is:
Only the first of these should take much time next week.
We began on October 5 with material from Chapter 2 of the Deep Learning book, on the singular value decomposition, generalised inverses of matrices (called "pseudo-inverses" in the book), and in particular the Moore-Penrose inverse, and Principal Component Analysis (PCA). I used a note, available here, to help explain this material.
After that, we went back to the Hands-on book to finish off Chapter 2, with the end-to-end project on house prices. Chapter 3 deals with a classical problem for machine learning, recognising handwritten digits, using the MNIST data set. Unlike the housing problem, which is a regression task, this is a classification task, best treated with different algorithms. We defined precision and recall as two measures of how accurately a classification is done when there are only two categories, and defined a confusion matrix in a more general context.
After a long break, we began on October 19 by looking at sections 5.9 and 5.10 of the Deep Learning book, thereby concluding what I wanted to cover in the first part of the book. We then embarked on Part II, and, in Chapter 6, on Deep Feedforward Networks, covered section 6.1, on the impossibility of learning the XOR pattern by a linear function, and how to do so with one hidden layer and the ReLU (rectified linear unit) activation function. We made progress with section 6.2, on Gradient-Based Learning, and got as far as subsection 6.2.2.1. We will resume by discussing subsection 6.2.2.2, on Sigmoid Units.
In the Hands-On book, we finished Chapter 3 on the handwritten digits, and, in Chapter 4, looked again at Stochastic Gradient Descent. Géron gives a comparison of different methods of working with linear regressions. The next topic is Polynomial regression, which allows us to see more about over- and under-fitting.
Again on October 26 we started with the Deep Learning textbook, and completed the essential content of Chapter 6. We looked at a number of functions used as activation functions: sigmoid (or logistic), ReLU, softmax, softplus, hyperbolic tangent (tanh). We took a quick look at the universal approximation theorem, and then began our study of the back-propagation step in training a multilayer perceptron (MLP).
Géron gave us a few more paragraphs on back propagation. He then introduced us to the Fashion MNIST data set, which poses a classification task just like that with the handwritten digits. We were taken through a number of ways in which we could construct, compile, and run an MLP for the problem, using Keras. It turned out to be no harder to set up a model for a regression task, for which the California hausing data set was used.
November 2 was the day I had a good deal of trouble with Zoom, and this led to a somewhat perturbed class. Nevertheless, we made good progress with Chapter 7 of the Deep Learning book, on regularisation. We looked at the L^{2} norm penalty, also called weight decay, and then the L^{1} penalty, associated with the LASSO, which can lead to a sparse representation by setting some of the connection weights to zero. While working through sections 7.3 through 7.7, important topics were data augmentation ("fake" data), parameter sharing, noise robustness amd multi-task learning.
In the Hands-On book, we saw that there is a tool called tensorboard, which lets us draw pretty pictures of the course of learning. That was the last topic of interest in Chapter 10. Chapter 11 considers some of a long list of problems that can adversely affect learning with multi-layer perceptrons. The problem of vanishing gradients could be helped by not using activation functions that saturate too easily. Against this problem and also the exploding-gradients problem, a very effective technique is batch normalisation.
Although there was still a lot of trouble with Zoom on November 9, the recording is complete. So much so that it continues all through our ten-minute break. Feel free to fast-forward through that! We finished off Chapter 7 in the Deep Learning book, in which the main topics were Early Stopping, which not only saves computation time, but also serves as a regulariser, Bagging, as an example of an ensemble method that uses model averaging, and, finally, dropout, a very effective regulariser which implicitly averages a very great number of models, and turns out to be very effective in practice.
In the Hands-On book, we started with transfer learning, which allows us to use some layers of models pretrained on some task to be reused for different inputs and similar tasks. It can also be used in circumstances in which we have a lot of unlabelled instances and only a few labelled ones.
The same Chapter 11 in Hands-On proceeds to the next topic, that of optimisers. From the basic stochastic gradient descent algorithm, we learned about momentum optimisation, AdaGrad, RMSProp, Adam, Nadam, etc. After seeing how to code these, the next topic was again dropout, and we saw how to code that.
In the Deep Learning book, we skipped over most of Chapter 8 on optimisation, since we had covered that material the previous week in Hands-On. Most of November 16 was devoted to convolutional nets. First, in Chapter 9 of Deep Learning, there was a definition of the mathematical operation of convolution. This can be deployed to great advantage when the input has something like a spatial structure, where it makes sense to introduce the idea of locality, and allow only units that are close together to have an impact on a unit in the next higher layer. Convolution is often combined with pooling, which can induce invariance to small changes in space. But valid convolution reduces the number of units in each layer as you go upwards, and so zero-padding can be introduced in order to maintain the same number of units in each layer.
It is Chapter 14 of the Hands-On book that deals with convolutional nets. In this chapter, the concept of feature maps is introduced, as a way of carrying a stack of maps along with each layer, different maps concentrating on different aspects of the input. This chapter has quite a long catalogue of nets, with different architectures, that have competed in a challenge on image classification over the years. These nets have become steadily deeper and more complex, but have also become strikingly more effective.
On November 23, we continued work on convolutional nets (CNN), and started work on recurrent nets (RNN). We began with section 9.6 of the Deep Learning book, in which a hint was given about the recurrent property. In section 9.8, some attempts to make CNNs efficient by speeding up convolution are described. One is the Fourier transform, which converts convolution into multiplication, and another is the use of separable kernels, which are outer products of vectors. The last thing in Chapter 9 that we studied was section 9.10, in which a "cartoon" version of the visual system in the mammalian brain was given, with analogies to what happens with artificial CNNs.
We then skipped over to halfway through Chapter 14 of Hands-On, and saw prize-winning innovations like inception modules, residual units with skip connections, Xception, which uses separable kernels, and the SENet (squeeze and excitation). After that came the task of object recognition, using bounding boxes, and semantic segmentation.
After the break, we returned to Deep Learning, Chapter 10, on recurrent nets. It was seen that one special type of RNN, with hidden-to-hidden unit connections, could simulate a Turing machine. However, the back-propagation stage of training such a network is computationally costly. Other less costly possibilities cut the hidden-hidden connections, and use instead an indirect connection via the output layer. A useful addon with these models is called teacher forcing.
We continued study of Chapter 10 in the Deep Learning book on November 30. There were several sections on different types of RNN; we started with section 10.2.3. After we completed section 10.2, in section 10.3 we looked at bidirectional RNNs, and saw that we could go to four directions or even more. Section 10.4 introduces the encoder-decoder architecture, much used in machine translation. The next two sections are on deep recurrent networks and recursive networks, which have a tree structure. The next several sections deal with long-term dependencies, which pose quite a challenge for recurrent nets, and outline various proposed methods to try to take proper account of them in an RNN. To study the LSTM units and the GRUs, which both introduce the idea of gates allowing for both long and short memory, we switched to Hands-On, where the graphical representations of these units are easier to understand.
We stuck with Hands-On after the break, but went back to Chapter 6, on Decision trees. These constitute a technique that can be interpreted easily (white box), rather than the black boxes of deep learning. They risk overfitting, but can be regularised in several ways. In Chapter 7, ensemble methods are presented, whereby several weak learners can be combined to yield a stronger learner. We revisited bagging, as an ensemble method, and were about to embark on Random Forests, which are very effective learners in many circumstances.
We started a little late on December 3, but still covered a good deal. In the Deep Learning book, we finished the discussion of memory with time series, with the introduction of the neural Turing machine. This allows for memory to be retained over arbitrarily long time spans, without falling foul of the vanishing-gradients problem. This was followed by a quick look at two more topics from Chapter 12, in section 12.5. They were Recommender systems, used by advertisers to target their advertisements, and Exploration versus Exploitation.
Then back to Hands-On. We completed Chapter 7, starting with Random Forests, and looking at a number of other ensemble methods such as boosting, bagging, and stacking. We then went on to Chapter 8, on dimension reduction. In that chapter, a very large number of techniques are presented, all trying to overcome the curse of dimensionality. PCA is one of these, but there is a whole catalogue of others that can be useful in specific circumstances. Next week, we will move on to Chapter 16.
December 7 was the last class, and was given over entirely to the Hands-On book. In Chapter 16, the focus is on natural-language processing. The first topic is a model that tries to predict the next character in some text. It is trained on the complete works of Shakespeare, as published in modern editions, with Shakespeare's somewhat random spelling modernised and made uniform. The model can generate random text that bears some affinities to Shakespeare's work. A distinction is made between a stateless RNN and a stateful RNN. Their training instances are set up quite differently, but the stateful RNN can learn from longer strings of text.
Next came a model for which the features are words, not single characters. We made a swift detour to Chapter 13, where the concept of an embedding is explained. An embedding is a representation of a feature as a vector. It can be learned, and can embody quite a lot of information. It is something that can be produced by an encoder-decoder mechanism, and can be an essential part of models that can translate from one language to another.
In Chapter 17, the theme of representational learning is followed up, with autoencoders, which embody an encoder followed by a decoder. In training, the targets are the same as the inputs, but, by imposing constraints, the autoencoder can learn an efficient representation of the inputs, which can then be put to many different uses.
The newest kind of network, and the least like the others we have studied, is the GAN, for generative adversarial metwork. It contains two models, the generator and the discriminator, and they are adversarial in that they engage in a zero-sum game. The aim for the discriminator is to distinguish "real" from "fake" instances. The generator produces exclusively fake instances, but tries to fool the discriminator into thinking that they are real. A GAN can be very difficult to train, but much progress has been made.
Recordings
The recording of the first class, on September 14 can be viewed by clicking here. There is also an audio-only version, available here.
For September 21, the recording can be viewed by clicking here. Audio only is here.
For September 28, the recording can be viewed by clicking here. Audio only is here.
For October 5, the recording can be viewed by clicking here. Audio only is here.
For October 19, the recording can be viewed by clicking here. Audio only is here.
For October 26, the recording can be viewed by clicking here. Audio only is here.
For November 2, the recording of the part of the class after the problem with Zoom can be viewed by clicking here. Audio only is here.
For November 9, the recording can be viewed by clicking here. Audio only is here.
For November 16, the recording can be viewed by clicking here. Audio only is here.
For November 23, the recording can be viewed by clicking here. Audio only is here.
For November 30, the recording can be viewed by clicking here. Audio only is here.
For December 3, the recording can be viewed by clicking here. Audio only is here.
For December 7, the recording can be viewed by clicking here. Audio only is here.
To send me email, click here or write directly to
Russell.Davidson@mcgill.ca.
URL: https://russell-davidson.arts.mcgill.ca/e706