# NEW Deep Learning Course: GANs and Variational Autoencoders

August 1, 2017

You asked for it, and it’s here!

I am pleased to announce my latest course:

Deep Learning: Variational Autoencoders and GANs

If you don’t want to read all my crazy writing and you just want to grab your coupon, click here: https://www.udemy.com/deep-learning-gans-and-variational-autoencoders/?couponCode=IAMAVIP

GANs have been called one of the most interesting developments in deep learning in 2016.

This is coming from Yann LeCun – one of the grandmasters of deep learning.

Most of you already know why GANs are cool – that’s why you’ve been asking me for this course.

But just in case you don’t:

GANs are notable for being able to produce extremely high-quality, high-resolution, sharp samples.

We’ve had neural networks (and non-deep learning ML algorithms) that can generate samples for decades… but none come close to the quality of images generated by GANs.

This is some Jason Bourne-level stuff… you know how they “enhance” a tiny / blurry image from some government spy camera?

Guess what can actually do that? GANs.

Fun fact: a group of Harvard PhDs JUST did an AMA on Reddit this week. Check out what one had to say about Unsupervised Deep Learning:

There are a ton of more cool applications that we’ll discuss in the course like reinforcement learning. This stuff is basically the latest-and-greatest in deep learning.

Why variational autoencoders and not just GANs? Both of these neural networks fall into the category of “deep neural network samplers” – they both attempt to learn the structure of data in some way, which you can then use to generate new data that mimics what was learned.

I think variational autoencoders are super cool because they combine 2 of my favorite subjects: deep learning and Bayesian machine learning.

They were also invented at approximately the same time and are always mentioned in the context of one another, so in some sense they belong in the same “family” of algorithms.

Another cool thing about this course: a surprising lack of prerequisites! Technically, this will be Deep Learning part 8 and Unsupervised Deep Learning part 2. But, you won’t need to know anything from Deep Learning part 5, 6, or 7, nor Unsupervised Deep Learning part 1 (which is also Deep Learning part 4). You will need to know how to build convolutional neural networks and have a working understanding of Bayes classifiers, but that’s pretty much it! Not that learning how to build convolutional neural networks was an easy place to get to, but now that you’re there, you can breathe easy.

Now, this course is going to be available to all students on Udemy, but if you’re receiving this email, that means you’ve already taken a course of mine, which is why I am offering you something exclusive: The VIP version of this course.

It will ONLY be available to students who use this link, which applies a special coupon that I can check that you used:

https://www.udemy.com/deep-learning-gans-and-variational-autoencoders/?couponCode=IAMAVIP

What’s in it?

VIP EXTRAS

Here’s what you get if you sign up for the VIP version of the course:

Because I am such a geek, I decided to use LaTeX to create short, concise tutorials for both the GAN and variational autoencoder.

These will be great to help you review the material and ingest it in a different format, no doubt increasing your understanding of what you learn in the course.

You can take it with you on the train and read it at your leisure!

Students are constantly asking me for PDFs just like this. Ask and you shall receive!

But that’s not all…

One of the COOLEST applications of neural networks that just “learn the structure of data” (as opposed to trying to assign labels to it) is STYLE TRANSFER.

Ever wanted to know what the New York City skyline would look like if it were painted by Picasso?

Now you can find out!

Style transfer networks are neural networks that learn the “essence” or “style” of one image, and then have the ability to apply that same style to new images.

I find this to be one of the most FASCINATING applications of using deep learning to learn the structure of data.

Of course… for most of us, such a neural network would take around 4 months to train…

So here’s what you get for signing up for the VIP special:

A SUPER SIMPLE script you can just run, which automatically downloads pre-trained neural network weights for 3 different styles (Dora Maar, Rain Princess, and Starry Night), which you can then use to apply those styles to ANY input image within SECONDS.

The neural network accepts any size input image because all the weights are convolutional filters!

How cool is that?

And it’s in Tensorflow, so all you Windows users out there don’t feel left out. =)

And remember: these VIP specials are only available IF you use the VIP COUPON (IAMAVIP) – so make sure you use the coupons / links in this email, otherwise, you will not get the VIP bonuses!

https://www.udemy.com/deep-learning-gans-and-variational-autoencoders/?couponCode=IAMAVIP

Quick note: If you don’t receive the VIP extras right away, don’t worry. I will be going through the list myself, you WILL get them.

#deep learning #gans #unsupervised learning #variational autoencoders

# Boston Dynamics – Introducing Handle

February 28, 2017

Amazing!

#artificial intelligence #boston dynamics #deep learning #reinforcement learning #robots

# New course! Reinforcement Learning in Python

January 27, 2017

I would like to announce my latest course – Artificial Intelligence: Reinforcement Learning in Python.

This has been one of my most requested topics since I started covering deep learning. This course has been brewing in the background for months.

The result: This is my most MASSIVE course yet.

Usually, my courses will introduce you to a handful of new algorithms (which is a lot for people to handle already). This course covers SEVENTEEN (17!) new algorithms.

This will keep you busy for a LONG time.

If you’re used to supervised and unsupervised machine learning, realize this: Reinforcement Learning is a whole new ball game.

There are so many new concepts to learn, and so much depth. It’s COMPLETELY different from anything you’ve seen before.

That’s why we build everything slowly, from the ground up.

There’s tons of new theory, but as you’ve come to expect, anytime we introduce new theory it is accompanied by full code examples.

What is Reinforcement Learning? It’s the technology behind self-driving cars, AlphaGo, video game-playing programs, and more.

You’ll learn that while deep learning has been very useful for tasks like driving and playing Go, it’s in fact just a small part of the picture.

Reinforcement Learning provides the framework that allows deep learning to be useful.

Without reinforcement learning, all we have is a basic (albeit very accurate) labeling machine.

With Reinforcement Learning, you have intelligence.

Reinforcement Learning has even been used to model processes in psychology and neuroscience. It’s truly the closest thing we have to “machine intelligence” and “general AI”.

What are you waiting for? Sign up now!!

COUPON:

https://www.udemy.com/artificial-intelligence-reinforcement-learning-in-python/?couponCode=EARLYBIRDSITE

#artificial intelligence #deep learning #reinforcement learning

# New course – Natural Language Processing: Deep Learning in Python part 6

August 9, 2016

[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at advanced NLP using deep learning.

Previously, you learned about some of the basics, like how many NLP problems are just regular machine learning and data science problems in disguise, and simple, practical methods like bag-of-words and term-document matrices.

These allowed us to do some pretty cool things, like detect spam emails, write poetry, spin articles, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but 4 new architectures in this course.

First up is word2vec.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

• king – man = queen – woman
• France – Paris = England – London
• December – Novemeber = July – June

We are also going to look at the GLoVe method, which also finds word vectors, but uses a technique called matrix factorization, which is a popular algorithm for recommender systems.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like parts-of-speech tagging and named entity recognition, and use recurrent neural networks to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about recursive neural networks, which finally help us solve the problem of negation in sentiment analysis. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

All of the materials required for this course can be downloaded and installed for FREE. We will do most of our work in Numpy and Matplotlib,and Theano. I am always available to answer your questions and help you along your data science journey.

See you in class!

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=SLOWBIRD_SITE

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec

# New course – Deep Learning part 5: Recurrent Neural Networks in Python

July 14, 2016

New course out today – Recurrent Neural Networks in Python: Deep Learning part 5.

If you already know what the course is about (recurrent units, GRU, LSTM), grab your 50% OFF coupon and go!:

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

Like the course I just released on Hidden Markov Models, Recurrent Neural Networks are all about learning sequences – but whereas Markov Models are limited by the Markov assumption, Recurrent Neural Networks are not – and as a result, they are more expressive, and more powerful than anything we’ve seen on tasks that we haven’t made progress on in decades.

Sequences appear everywhere – stock prices, language, credit scoring, and webpage visits.

Recurrent neural networks have a history of being very hard to train. It hasn’t been until recently that we’ve found ways around what is called the vanishing gradient problem, and since then, recurrent neural networks have become one of the most popular methods in deep learning.

If you took my course on Hidden Markov Models, we are going to go through a lot of the same examples in this class, except that our results are going to be a lot better.

Our classification accuracies will increase, and we’ll be able to create vectors of words, or word embeddings, that allow us to visualize how words are related on a graph.

We’ll see some pretty interesting results, like that our neural network seems to have learned that all religions and languages and numbers are related, and that cities and countries have hierarchical relationships.

If you’re interested in discovering how modern deep learning has propelled machine learning and data science to new heights, this course is for you.

I’ll see you in class.

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

#data science #deep learning #gru #lstm #machine learning #word vectors

# New Course: Unsupervised Machine Learning – Hidden Markov Models in Python

June 13, 2016

EARLY BIRD 50% OFF COUPON: CLICK HERE

Hidden Markov Models are all about learning sequences.

A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you’re going to default. In short, sequences are everywhere, and being able to analyze them is an important skill in your data science toolbox.

The easiest way to appreciate the kind of information you get from a sequence is to consider what you are reading right now. If I had written the previous sentence backwards, it wouldn’t make much sense to you, even though it contained all the same words. So order is important.

While the current fad in deep learning is to use recurrent neural networks to model sequences, I want to first introduce you guys to a machine learning algorithm that has been around for several decades now – the Hidden Markov Model.

This course follows directly from my first course in Unsupervised Machine Learning for Cluster Analysis, where you learned how to measure the probability distribution of a random variable. In this course, you’ll learn to measure the probability distribution of a sequence of random variables.

You guys know how much I love deep learning, so there is a little twist in this course. We’ve already covered gradient descent and you know how central it is for solving deep learning problems. I claimed that gradient descent could be used to optimize any objective function. In this course I will show you how you can use gradient descent to solve for the optimal parameters of an HMM, as an alternative to the popular expectation-maximization algorithm.

We’re going to do it in Theano, which is a popular library for deep learning. This is also going to teach you how to work with sequences in Theano, which will be very useful when we cover recurrent neural networks and LSTMs.

This course is also going to go through the many practical applications of Markov models and hidden Markov models. We’re going to look at a model of sickness and health, and calculate how to predict how long you’ll stay sick, if you get sick. We’re going to talk about how Markov models can be used to analyze how people interact with your website, and fix problem areas like high bounce rate, which could be affecting your SEO. We’ll build language models that can be used to identify a writer and even generate text – imagine a machine doing your writing for you.

We’ll look at what is possibly the most recent and prolific application of Markov models – Google’s PageRank algorithm. And finally we’ll discuss even more practical applications of Markov models, including generating images, smartphone autosuggestions, and using HMMs to answer one of the most fundamental questions in biology – how is DNA, the code of life, translated into physical or behavioral attributes of an organism?

All of the materials of this course can be downloaded and installed for FREE. We will do most of our work in Numpy and Matplotlib, along with a little bit of Theano. I am always available to answer your questions and help you along your data science journey.

Sign up now and get 50% off by clicking HERE

#data science #deep learning #hidden markov models #machine learning #recurrent neural networks #theano

# New Deep Learning course on Udemy

February 26, 2016

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

#adagrad #aws #batch gradient descent #deep learning #ec2 #gpu #machine learning #nesterov momentum #numpy #nvidia #python #rmsprop #stochastic gradient descent #tensorflow #theano

# Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

• Are interested in deep learning (this tutorial uses gradient descent)
• Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
• Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
• Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z = T.dot(X, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)


I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2$$

Where $$\hat{X}$$ is the reconstruction of $$X$$. If there is only one eigenvector, let’s call this $$v$$, then this becomes:

$$J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2$$

This is equivalent to the Frobenius norm, so we can write:

$$J = |X – Xvv^T|_F$$

One identity of the Frobenius norm is:

$$|A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) }$$

Which means we can rewrite the loss function as:

$$J = Tr( (X – Xvv^T)^T(X – Xvv^T) )$$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on $$v$$, you can use matrix algebra to rearrange this to get:

$$v^* = argmin\{-Tr(X^TXvv^T) \}$$

Which again using reordering would be equivalent to maximizing:

$$v^* = argmax\{ v^TX^TXv \}$$

The corresponding eigenvalue would then be:

$$\lambda = v^TX^TXv$$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$v \leftarrow v + \eta \nabla_v(v^TX^TXv)$$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i )$$

Next, note that to implement this algorithm you never need to actually calculate the covariance $$X^T X$$. If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get $$Xv$$, which is only of size $$N \times 1$$. You can then “dot” this with itself to get a scalar, which is only an $$O(N)$$ operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv = T.dot(X, v)
cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i))

gv = T.grad(cost, v)


Note that we re-normalize the eigenvector on each step, so that $$v^T v = 1$$.

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
inputs=[X],
)


Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
... initialize variables and expressions ...
... initialize theano train function ...
while t < max_iterations and change in v < tol:
outputs = train(data)

... return eigenvalues and eigenvectors ...


This is not really trivial but at the same time it's a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course on Udemy "Data Science: Deep Learning in Python" and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

# A Tutorial on Autoencoders for Deep Learning

December 31, 2015

Despite its somewhat initially-sounding cryptic name, autoencoders are a fairly basic machine learning model (and the name is not cryptic at all when you know what it does).

Autoencoders belong to the neural network family, but they are also closely related to PCA (principal components analysis).

Some facts about the autoencoder:

• It is an unsupervised learning algorithm (like PCA)
• It minimizes the same objective function as PCA
• It is a neural network
• The neural network’s target output is its input

The last point is key here. This is the architecture of an autoencoder:

So the dimensionality of the input is the same as the dimensionality of the output, and essentially what we want is x’ = x.

It can be shown that the objective function for PCA is:

$$J = \sum_{n=1}^{N} |x(n) – \hat{x}(n)|^2$$

Where the prediction $$\hat{x}(n) = Q^{-1}Qx(n)$$.

Q can be the full transformation matrix (which would result in getting exactly the old x back), or it can be a “rank k” matrix (i.e. keeping the k-most relevant eigenvectors), which would then result in only an approximation of x.

So the objective function can be written as:

$$J = \sum_{n=1}^{N} |x(n) – Q^{-1}Qx(n)|^2$$

Recall that to get the value at the hidden layer, we simply multiply the input->hidden weights by the input.

Like so:

$$z = f(Wx)$$

And to get the value at the output, we multiply the hidden->output weights by the hidden layer values, like so:

$$y = g(Vz)$$

The choice of $$f$$ and $$g$$ is up to us, we just have to know how to take the derivative for backpropagation.

We are of course free to make them “identity” functions, such that:

$$y = g(V f(Wx)) = VWx$$

This gives us the objective:

$$J = \sum_{n=1}^{N} |x(n) – VWx(n)|^2$$

Which is the same as PCA!

## If autoencoders are similar to PCA, why do we need autoencoders?

Autoencoders are much more flexible than PCA.

Recall that with neural networks we have an activation function – this can be a “ReLU” (aka. rectifier), “tanh” (hyperbolic tangent), or sigmoid.

This introduces nonlinearities in our encoding, whereas PCA can only represent linear transformations.

The network representation also means you can stack autoencoders to form a deep network.

## Cool theory bro, but what can autoencoders actually do for me?

Good question!

Similar to PCA – autoencoders can be used for finding a low-dimensional representation of your input data. Why is this useful?

Some of your features may be redundant or correlated, resulting in wasted processing time and overfitting in your model (too many parameters).

It is thus ideal to only include the features we need.

If your “reconstruction” of x is very accurate, that means your low-dimensional representation is good.

You can then use this transformation as input into another model.

## Training an autoencoder

Since autoencoders are really just neural networks where the target output is the input, you actually don’t need any new code.

Suppose we’re working with a sci-kit learn-like interface.

model.fit(X, Y)


You would just have:

model.fit(X, X)


Pretty simple, huh?

All the usual neural network training strategies work with autoencoders too:

• backpropagation
• regularization
• dropout
• RBM pre-training

If you want to get good with autoencoders – I would recommend trying to take some data and an existing neural network package you’re comfortable with – and see what low-dimensional representation you can come up with. How many dimensions are there?

Autoencoders are part of a family of unsupervised deep learning methods, which I cover in-depth in my course, Unsupervised Deep Learning in Python. We discuss how to stack autoencoders to build deep belief networks, and compare them to RBMs which can be used for the same purpose. We derive all the equations and write all the code from scratch – no shortcuts. Ask me for a coupon so I can give you a discount!

P.S. “Autoencoders” means “encodes itself”. Not so cryptic now, right?

#autoencoders #deep learning #machine learning #pca #principal components analysis #unsupervised learning

# Deep Learning Tutorial part 3/3: Deep Belief Networks

June 15, 2015

This is part 3/3 of a series on deep belief networks. Part 1 focused on the building blocks of deep neural nets – logistic regression and gradient descent. Part 2 focused on how to use logistic regression as a building block to create neural networks, and how to train them. Part 3 will focus on answering the question: “What is a deep belief network?” and the algorithms we use to do training and prediction.

This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python.

## What is a deep belief network / deep neural network?

In its simplest form, a deep belief network looks exactly like the artificial neural networks we learned about in part 2! As long as there is at least 1 hidden layer, the model is considered to be “deep”. (I Googled around on this topic for quite awhile, it seems people just started using the term “deep learning” on any kind of neural network one day as a buzzword, regardless of the number of layers.)

It is common to use more than 1 hidden layer, and new research has been exploring different architectures than the simple “feedforward” neural network which we have been studying. Recurrent neural networks have become very popular in recent years. These networks contain “feedback” connections and contain a “memory” of past inputs. We will not talk about these in this post.

Ok, so then how is this different than part 2?

One reason deep learning has come to prominence in the past decade is due to increased computational power. It used to be that computers were just too slow to handle training large networks, especially in computer vision where each pixel of an image is an input. We have new libraries that take advantage of the GPU (graphics processing unit), which can do floating point math much faster than the CPU.

Note that because the architecture of the deep belief network is exactly the same, the flow of data from input to output (i.e. prediction) is exactly the same.

The only part that’s different is how the network is trained.

One problem with traditional multilayer perceptrons / artificial neural networks is that backpropagation can often lead to “local minima”. This is when your “error surface” contains multiple grooves and as you perform gradient descent, you fall into a groove, but it’s not the lowest possible groove.

Deep belief networks solve this problem by using an extra step called “pre-training”. Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there.

So what is this pre-training step and how does it work?

To understand this, we first need to learn about “Restricted Boltzmann Machines” or RBMs.

[Strictly speaking, multiple layers of RBMs would create a deep belief network – this is an unsupervised model. A supervised model with a softmax output would be called a deep neural network.]

## Restricted Boltzmann Machines

Going back to our original simple neural network, let’s draw out the RBM. I’ve circled it in green here.

The RBM contains all the x’s, all the z’s, and the W in between. That’s pretty much all there is to it. An RBM is simply two layers of a neural network and the weights between them.

In an RBM we still refer to the x’s as the “input layer” and the z’s as the “hidden layer”. If you’ve ever learned about PCA, SVD, latent semantic analysis, or Hidden Markov Models – the idea of “hidden” or “latent” variables should be familiar to you.

As a simple example, you might observe that the ground is wet. You could have multiple hidden or latent variables, one representing the fact that it’s raining, another representing the fact that your neighbor is watering her garden.

In a sense they are the hidden causes or “base” facts that generate the observations that you measure.

Since RBMs are just a “slice” of a neural network, deep neural networks can be considered to be a bunch of RBMs “stacked” together.

## Variables in a Restricted Boltzmann Machine

In this section we will look more closely at what an RBM is – what variables are contained and why that makes sense – through a probabilistic model – similar to what we did for logistic regression in part 1.

Although not shown explicitly, each layer of the RBM will have its own bias weights – W is the only weight shared between them. We will denote these bias weight as “a” for the visible units, and “b” for the hidden units.

We’re going to rename some variables to match what they are called in most tutorials and articles on the Internet. We’ll denote the “visible” vectors (i.e. inputs) by v and index each element of v by i. We’ll denote the “hidden” units by h and index each element by j.

Using our new variables, v, h, a, b, and including w(i,j) as before – we can define the “energy” of a network as:

In vector / matrix notation this can be written as:

We can define the probability of observing an input v with hidden vector h as:

Where Z is a normalizing constant so that the sum of all events = 1.

We can get the marginal distribution P(v) by summing over h:

Similar to logistic regression, we can define the conditional probabilities P(v(i) = 1 | h) and P(h(j) = 1 | v):

To train the network we again want to maximize some objective function. What should that be in this case?

Given that all we have are a bunch of training inputs, we simply want to maximize the joint probability of those inputs, i.e.

Equivalently, we can maximize the log probability:

Where V is of course the set of all training inputs.

Note that we do not use any training targets – we simply want to model the input. Thus, RBM is an unsupervised learning algorithm, like the Gaussian Mixture Model, for example.

The learning algorithm used to train RBMs is called “contrastive divergence”.

## Contrastive Divergence

Contrastive divergence is highly non-trivial compared to an algorithm like gradient descent, which involved just taking the derivative of the objective function.

If you are going to use deep belief networks on some task, you probably do not want to reinvent the wheel. There are packages out there, such as Theano, pylearn2, and Torch7 – where a lot of people who are experts at this stuff have already written and optimized the code for performance.

Learning how to use those packages will take some effort in itself – so unless you are going to do research I would recommend holding off on understanding the technical details of contrastive divergence.

You still have a lot to think about – what learning rate should you choose? How many layers should your network have? How many units per layer? What about regularization and momentum?

These are not easy questions to answer, and only through experience will you get a “feel” for it.