Deep Learning Tutorial part 3/3: Deep Belief Networks

Lazy Programmer

June 15, 2015

This is part 3/3 of a series on deep belief networks. Part 1 focused on the building blocks of deep neural nets – logistic regression and gradient descent. Part 2 focused on how to use logistic regression as a building block to create neural networks, and how to train them. Part 3 will focus on answering the question: “What is a deep belief network?” and the algorithms we use to do training and prediction.

This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python.

What is a deep belief network / deep neural network?

In its simplest form, a deep belief network looks exactly like the artificial neural networks we learned about in part 2! As long as there is at least 1 hidden layer, the model is considered to be “deep”. (I Googled around on this topic for quite awhile, it seems people just started using the term “deep learning” on any kind of neural network one day as a buzzword, regardless of the number of layers.)

It is common to use more than 1 hidden layer, and new research has been exploring different architectures than the simple “feedforward” neural network which we have been studying. Recurrent neural networks have become very popular in recent years. These networks contain “feedback” connections and contain a “memory” of past inputs. We will not talk about these in this post.

Ok, so then how is this different than part 2?

One reason deep learning has come to prominence in the past decade is due to increased computational power. It used to be that computers were just too slow to handle training large networks, especially in computer vision where each pixel of an image is an input. We have new libraries that take advantage of the GPU (graphics processing unit), which can do floating point math much faster than the CPU.

Note that because the architecture of the deep belief network is exactly the same, the flow of data from input to output (i.e. prediction) is exactly the same.

The only part that’s different is how the network is trained.

One problem with traditional multilayer perceptrons / artificial neural networks is that backpropagation can often lead to “local minima”. This is when your “error surface” contains multiple grooves and as you perform gradient descent, you fall into a groove, but it’s not the lowest possible groove.

Deep belief networks solve this problem by using an extra step called “pre-training”. Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there.

So what is this pre-training step and how does it work?

To understand this, we first need to learn about “Restricted Boltzmann Machines” or RBMs.

[Strictly speaking, multiple layers of RBMs would create a deep belief network – this is an unsupervised model. A supervised model with a softmax output would be called a deep neural network.]

Restricted Boltzmann Machines

Going back to our original simple neural network, let’s draw out the RBM. I’ve circled it in green here.

The RBM contains all the x’s, all the z’s, and the W in between. That’s pretty much all there is to it. An RBM is simply two layers of a neural network and the weights between them.

In an RBM we still refer to the x’s as the “input layer” and the z’s as the “hidden layer”. If you’ve ever learned about PCA, SVD, latent semantic analysis, or Hidden Markov Models – the idea of “hidden” or “latent” variables should be familiar to you.

As a simple example, you might observe that the ground is wet. You could have multiple hidden or latent variables, one representing the fact that it’s raining, another representing the fact that your neighbor is watering her garden.

In a sense they are the hidden causes or “base” facts that generate the observations that you measure.

Since RBMs are just a “slice” of a neural network, deep neural networks can be considered to be a bunch of RBMs “stacked” together.

Variables in a Restricted Boltzmann Machine

In this section we will look more closely at what an RBM is – what variables are contained and why that makes sense – through a probabilistic model – similar to what we did for logistic regression in part 1.

Although not shown explicitly, each layer of the RBM will have its own bias weights – W is the only weight shared between them. We will denote these bias weight as “a” for the visible units, and “b” for the hidden units.

We’re going to rename some variables to match what they are called in most tutorials and articles on the Internet. We’ll denote the “visible” vectors (i.e. inputs) by v and index each element of v by i. We’ll denote the “hidden” units by h and index each element by j.

Using our new variables, v, h, a, b, and including w(i,j) as before – we can define the “energy” of a network as:

In vector / matrix notation this can be written as:

We can define the probability of observing an input v with hidden vector h as:

Where Z is a normalizing constant so that the sum of all events = 1.

We can get the marginal distribution P(v) by summing over h:

Similar to logistic regression, we can define the conditional probabilities P(v(i) = 1 | h) and P(h(j) = 1 | v):

To train the network we again want to maximize some objective function. What should that be in this case?

Given that all we have are a bunch of training inputs, we simply want to maximize the joint probability of those inputs, i.e.

Equivalently, we can maximize the log probability:

Where V is of course the set of all training inputs.

Note that we do not use any training targets – we simply want to model the input. Thus, RBM is an unsupervised learning algorithm, like the Gaussian Mixture Model, for example.

The learning algorithm used to train RBMs is called “contrastive divergence”.

Contrastive Divergence

Contrastive divergence is highly non-trivial compared to an algorithm like gradient descent, which involved just taking the derivative of the objective function.

If you are going to use deep belief networks on some task, you probably do not want to reinvent the wheel. There are packages out there, such as Theano, pylearn2, and Torch7 – where a lot of people who are experts at this stuff have already written and optimized the code for performance.

Learning how to use those packages will take some effort in itself – so unless you are going to do research I would recommend holding off on understanding the technical details of contrastive divergence.

You still have a lot to think about – what learning rate should you choose? How many layers should your network have? How many units per layer? What about regularization and momentum?

These are not easy questions to answer, and only through experience will you get a “feel” for it.

Where to learn more

This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python. We fully derive and implement the contrastive divergence algorithm, so you can see it run yourself! We’ll also demonstrate how it helps us get around the “vanishing gradient problem”.

ann artificial intelligence artificial neural networks dbn deep learning gradient descent machine learning mlp Multilayer Perceptron rbm restricted Boltzmann machines

Latest articles

View all

[NEW COURSE] Next-Gen AI: Deep Reinforcement Learning in PyTorch IV

Hello friends! Don’t want to read the whole spiel? Just get the course here:...

Is Classic / Traditional Machine Learning Dead?

We have all seen that AI can write code. AI can write code = AI can write machine learning code AI can write machine learning code = why does...

[NEW COURSE] Cutting-Edge AI: Deep Reinforcement Learning in PyTorch (v2)

Hello friends! Don’t want to read my little spiel? Just get the course here:...

Deep Learning Tutorial part 3/3: Deep Belief Networks

Lazy Programmer

What is a deep belief network / deep neural network?

Restricted Boltzmann Machines

Variables in a Restricted Boltzmann Machine

Contrastive Divergence

Where to learn more

Latest Announcements

[NEW COURSE] Next-Gen AI: Deep Reinforcement Learning in PyTorch IV

[NEW COURSE] Cutting-Edge AI: Deep Reinforcement Learning in PyTorch (v2)

[NEW COURSE] Math 0-1: Probability for Data Science & Machine Learning

[NEW COURSE] Math 0-1: Linear Algebra for Data Science & Machine Learning

[NEW COURSE] Data Science: Bayesian Linear Regression in Python

[NEW COURSE] Data Science: Transformers for Natural Language Processing

NEW COURSE: Time Series Analysis, Forecasting, and Machine Learning in Python

NEW COURSE: Financial Engineering and Artificial Intelligence in Python

The complete PyTorch course for AI and Deep Learning has arrived

[VIP COURSE UPDATE] Artificial Intelligence: Reinforcement Learning in Python

Categories

Tags

Latest articles

[NEW COURSE] Next-Gen AI: Deep Reinforcement Learning in PyTorch IV

Is Classic / Traditional Machine Learning Dead?

[NEW COURSE] Cutting-Edge AI: Deep Reinforcement Learning in PyTorch (v2)

Deep Learning Tutorial part 3/3: Deep Belief Networks

Lazy Programmer

What is a deep belief network / deep neural network?

Restricted Boltzmann Machines

Variables in a Restricted Boltzmann Machine

Contrastive Divergence

Where to learn more

Latest Announcements

[NEW COURSE] Next-Gen AI: Deep Reinforcement Learning in PyTorch IV

[NEW COURSE] Cutting-Edge AI: Deep Reinforcement Learning in PyTorch (v2)

[NEW COURSE] Math 0-1: Probability for Data Science & Machine Learning

[NEW COURSE] Math 0-1: Linear Algebra for Data Science & Machine Learning

[NEW COURSE] Data Science: Bayesian Linear Regression in Python

[NEW COURSE] Data Science: Transformers for Natural Language Processing

NEW COURSE: Time Series Analysis, Forecasting, and Machine Learning in Python

NEW COURSE: Financial Engineering and Artificial Intelligence in Python

The complete PyTorch course for AI and Deep Learning has arrived

[VIP COURSE UPDATE] Artificial Intelligence: Reinforcement Learning in Python

Categories

Tags

Get the latest news to your inbox!

Latest articles

[NEW COURSE] Next-Gen AI: Deep Reinforcement Learning in PyTorch IV

Is Classic / Traditional Machine Learning Dead?

[NEW COURSE] Cutting-Edge AI: Deep Reinforcement Learning in PyTorch (v2)