This is part 3/3 of a series on deep belief networks. Part 1 focused on the building blocks of deep neural nets – logistic regression and gradient descent. Part 2 focused on how to use logistic regression as a building block to create neural networks, and how to train them. Part 3 will focus on answering the question: “What is a deep belief network?” and the algorithms we use to do training and prediction.
This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python.
What is a deep belief network / deep neural network?
In its simplest form, a deep belief network looks exactly like the artificial neural networks we learned about in part 2! As long as there is at least 1 hidden layer, the model is considered to be “deep”. (I Googled around on this topic for quite awhile, it seems people just started using the term “deep learning” on any kind of neural network one day as a buzzword, regardless of the number of layers.)
It is common to use more than 1 hidden layer, and new research has been exploring different architectures than the simple “feedforward” neural network which we have been studying. Recurrent neural networks have become very popular in recent years. These networks contain “feedback” connections and contain a “memory” of past inputs. We will not talk about these in this post.
Ok, so then how is this different than part 2?
One reason deep learning has come to prominence in the past decade is due to increased computational power. It used to be that computers were just too slow to handle training large networks, especially in computer vision where each pixel of an image is an input. We have new libraries that take advantage of the GPU (graphics processing unit), which can do floating point math much faster than the CPU.
Note that because the architecture of the deep belief network is exactly the same, the flow of data from input to output (i.e. prediction) is exactly the same.
The only part that’s different is how the network is trained.
One problem with traditional multilayer perceptrons / artificial neural networks is that backpropagation can often lead to “local minima”. This is when your “error surface” contains multiple grooves and as you perform gradient descent, you fall into a groove, but it’s not the lowest possible groove.

Deep belief networks solve this problem by using an extra step called “pre-training”. Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there.
So what is this pre-training step and how does it work?
To understand this, we first need to learn about “Restricted Boltzmann Machines” or RBMs.
[Strictly speaking, multiple layers of RBMs would create a deep belief network – this is an unsupervised model. A supervised model with a softmax output would be called a deep neural network.]
Restricted Boltzmann Machines
Going back to our original simple neural network, let’s draw out the RBM. I’ve circled it in green here.

The RBM contains all the x’s, all the z’s, and the W in between. That’s pretty much all there is to it. An RBM is simply two layers of a neural network and the weights between them.
In an RBM we still refer to the x’s as the “input layer” and the z’s as the “hidden layer”. If you’ve ever learned about PCA, SVD, latent semantic analysis, or Hidden Markov Models – the idea of “hidden” or “latent” variables should be familiar to you.
As a simple example, you might observe that the ground is wet. You could have multiple hidden or latent variables, one representing the fact that it’s raining, another representing the fact that your neighbor is watering her garden.
In a sense they are the hidden causes or “base” facts that generate the observations that you measure.
Since RBMs are just a “slice” of a neural network, deep neural networks can be considered to be a bunch of RBMs “stacked” together.
Variables in a Restricted Boltzmann Machine
In this section we will look more closely at what an RBM is – what variables are contained and why that makes sense – through a probabilistic model – similar to what we did for logistic regression in part 1.
Although not shown explicitly, each layer of the RBM will have its own bias weights – W is the only weight shared between them. We will denote these bias weight as “a” for the visible units, and “b” for the hidden units.
We’re going to rename some variables to match what they are called in most tutorials and articles on the Internet. We’ll denote the “visible” vectors (i.e. inputs) by v and index each element of v by i. We’ll denote the “hidden” units by h and index each element by j.

Using our new variables, v, h, a, b, and including w(i,j) as before – we can define the “energy” of a network as:

In vector / matrix notation this can be written as:

We can define the probability of observing an input v with hidden vector h as:

Where Z is a normalizing constant so that the sum of all events = 1.
We can get the marginal distribution P(v) by summing over h:

Similar to logistic regression, we can define the conditional probabilities P(v(i) = 1 | h) and P(h(j) = 1 | v):


To train the network we again want to maximize some objective function. What should that be in this case?
Given that all we have are a bunch of training inputs, we simply want to maximize the joint probability of those inputs, i.e.

Equivalently, we can maximize the log probability:

Where V is of course the set of all training inputs.
Note that we do not use any training targets – we simply want to model the input. Thus, RBM is an unsupervised learning algorithm, like the Gaussian Mixture Model, for example.
The learning algorithm used to train RBMs is called “contrastive divergence”.
Contrastive Divergence
Contrastive divergence is highly non-trivial compared to an algorithm like gradient descent, which involved just taking the derivative of the objective function.
If you are going to use deep belief networks on some task, you probably do not want to reinvent the wheel. There are packages out there, such as Theano, pylearn2, and Torch7 – where a lot of people who are experts at this stuff have already written and optimized the code for performance.
Learning how to use those packages will take some effort in itself – so unless you are going to do research I would recommend holding off on understanding the technical details of contrastive divergence.
You still have a lot to think about – what learning rate should you choose? How many layers should your network have? How many units per layer? What about regularization and momentum?
These are not easy questions to answer, and only through experience will you get a “feel” for it.
Where to learn more
This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python. We fully derive and implement the contrastive divergence algorithm, so you can see it run yourself! We’ll also demonstrate how it helps us get around the “vanishing gradient problem”.