# New Years Udemy Coupons! All Udemy Courses only $10 January 1, 2017 Act fast! These$10 Udemy Coupons expire in 10 days.

Ensemble Machine Learning: Random Forest and AdaBoost

Deep Learning Prerequisites: Linear Regression in Python

https://www.udemy.com/data-science-linear-regression-in-python/?couponCode=BOXINGDAY

Deep Learning Prerequisites: Logistic Regression in Python

https://www.udemy.com/data-science-logistic-regression-in-python/?couponCode=BOXINGDAY

Deep Learning in Python

https://www.udemy.com/data-science-deep-learning-in-python/?couponCode=BOXINGDAY

Practical Deep Learning in Theano and TensorFlow

https://www.udemy.com/data-science-deep-learning-in-theano-tensorflow/?couponCode=BOXINGDAY

Deep Learning: Convolutional Neural Networks in Python

https://www.udemy.com/deep-learning-convolutional-neural-networks-theano-tensorflow/?couponCode=BOXINGDAY

Unsupervised Deep Learning in Python

https://www.udemy.com/unsupervised-deep-learning-in-python/?couponCode=BOXINGDAY

Deep Learning: Recurrent Neural Networks in Python

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=BOXINGDAY

Advanced Natural Language Processing: Deep Learning in Python

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=BOXINGDAY

Easy Natural Language Processing in Python

https://www.udemy.com/data-science-natural-language-processing-in-python/?couponCode=BOXINGDAY

Cluster Analysis and Unsupervised Machine Learning in Python

https://www.udemy.com/cluster-analysis-unsupervised-machine-learning-python/?couponCode=BOXINGDAY

Unsupervised Machine Learning: Hidden Markov Models in Python

https://www.udemy.com/unsupervised-machine-learning-hidden-markov-models-in-python/?couponCode=BOXINGDAY

Data Science: Supervised Machine Learning in Python

https://www.udemy.com/data-science-supervised-machine-learning-in-python/?couponCode=BOXINGDAY

Bayesian Machine Learning in Python: A/B Testing

https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing/?couponCode=BOXINGDAY

SQL for Newbs and Marketers

https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data/?couponCode=BOXINGDAY

How to get ANY course on Udemy for \$10 (please use my coupons above for my courses):

# New course – Natural Language Processing: Deep Learning in Python part 6

August 9, 2016

[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at advanced NLP using deep learning.

Previously, you learned about some of the basics, like how many NLP problems are just regular machine learning and data science problems in disguise, and simple, practical methods like bag-of-words and term-document matrices.

These allowed us to do some pretty cool things, like detect spam emails, write poetry, spin articles, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but 4 new architectures in this course.

First up is word2vec.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

• king – man = queen – woman
• France – Paris = England – London
• December – Novemeber = July – June

We are also going to look at the GLoVe method, which also finds word vectors, but uses a technique called matrix factorization, which is a popular algorithm for recommender systems.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like parts-of-speech tagging and named entity recognition, and use recurrent neural networks to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about recursive neural networks, which finally help us solve the problem of negation in sentiment analysis. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

See you in class!

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=SLOWBIRD_SITE

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec

# New course – Deep Learning part 5: Recurrent Neural Networks in Python

July 14, 2016

New course out today – Recurrent Neural Networks in Python: Deep Learning part 5.

If you already know what the course is about (recurrent units, GRU, LSTM), grab your 50% OFF coupon and go!:

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

Like the course I just released on Hidden Markov Models, Recurrent Neural Networks are all about learning sequences – but whereas Markov Models are limited by the Markov assumption, Recurrent Neural Networks are not – and as a result, they are more expressive, and more powerful than anything we’ve seen on tasks that we haven’t made progress on in decades.

Sequences appear everywhere – stock prices, language, credit scoring, and webpage visits.

Recurrent neural networks have a history of being very hard to train. It hasn’t been until recently that we’ve found ways around what is called the vanishing gradient problem, and since then, recurrent neural networks have become one of the most popular methods in deep learning.

If you took my course on Hidden Markov Models, we are going to go through a lot of the same examples in this class, except that our results are going to be a lot better.

Our classification accuracies will increase, and we’ll be able to create vectors of words, or word embeddings, that allow us to visualize how words are related on a graph.

We’ll see some pretty interesting results, like that our neural network seems to have learned that all religions and languages and numbers are related, and that cities and countries have hierarchical relationships.

If you’re interested in discovering how modern deep learning has propelled machine learning and data science to new heights, this course is for you.

I’ll see you in class.

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

#data science #deep learning #gru #lstm #machine learning #word vectors

# New course: Unsupervised Deep Learning in Python

May 15, 2016

This course is the next logical step in my deep learning, data science, and machine learning series. I’ve done a lot of courses about deep learning, and I just released a course about unsupervised learning, where I talked about clustering and density estimation. So what do you get when you put these 2 together? Unsupervised deep learning!

In these course we’ll start with some very basic stuff – principal components analysis (PCA), and a popular nonlinear dimensionality reduction technique known as t-SNE (t-distributed stochastic neighbor embedding).

Next, we’ll look at a special type of unsupervised neural network called the autoencoder. After describing how an autoencoder works, I’ll show you how you can link a bunch of them together to form a deep stack of autoencoders, that leads to better performance of a supervised deep neural network. Autoencoders are like a non-linear form of PCA.

Last, we’ll look at restricted Boltzmann machines (RBMs). These are yet another popular unsupervised neural network, that you can use in the same way as autoencoders to pretrain your supervised deep neural network. I’ll show you an interesting way of training restricted Boltzmann machines, known as Gibbs sampling, a special case of Markov Chain Monte Carlo, and I’ll demonstrate how even though this method is only a rough approximation, it still ends up reducing other cost functions, such as the one used for autoencoders. This method is also known as Contrastive Divergence or CD-k. As in physical systems, we define a concept called free energy and attempt to minimize this quantity.

Finally, we’ll bring all these concepts together and I’ll show you visually what happens when you use PCA and t-SNE on the features that the autoencoders and RBMs have learned, and we’ll see that even without labels the results suggest that a pattern has been found.

All the materials used in this course are FREE. Since this course is the 4th in the deep learning series, I will assume you already know calculus, linear algebra, and Python coding. You’ll want to install Numpy andTheano for this course. These are essential items in your data analytics toolbox.

If you are interested in deep learning and you want to learn about modern deep learning developments beyond just plain backpropagation, including using unsupervised neural networks to interpret what features can be automatically and hierarchically learned in a deep learning system, this course is for you.

Get your EARLY BIRD coupon for 50% off here: https://www.udemy.com/unsupervised-deep-learning-in-python/?couponCode=EARLYBIRD

# New Deep Learning course on Udemy

February 26, 2016

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

# A Tutorial on Autoencoders for Deep Learning

December 31, 2015

Despite its somewhat initially-sounding cryptic name, autoencoders are a fairly basic machine learning model (and the name is not cryptic at all when you know what it does).

Autoencoders belong to the neural network family, but they are also closely related to PCA (principal components analysis).

• It is an unsupervised learning algorithm (like PCA)
• It minimizes the same objective function as PCA
• It is a neural network
• The neural network’s target output is its input

The last point is key here. This is the architecture of an autoencoder:

So the dimensionality of the input is the same as the dimensionality of the output, and essentially what we want is x’ = x.

It can be shown that the objective function for PCA is:

$$J = \sum_{n=1}^{N} |x(n) – \hat{x}(n)|^2$$

Where the prediction $$\hat{x}(n) = Q^{-1}Qx(n)$$.

Q can be the full transformation matrix (which would result in getting exactly the old x back), or it can be a “rank k” matrix (i.e. keeping the k-most relevant eigenvectors), which would then result in only an approximation of x.

So the objective function can be written as:

$$J = \sum_{n=1}^{N} |x(n) – Q^{-1}Qx(n)|^2$$

Recall that to get the value at the hidden layer, we simply multiply the input->hidden weights by the input.

Like so:

$$z = f(Wx)$$

And to get the value at the output, we multiply the hidden->output weights by the hidden layer values, like so:

$$y = g(Vz)$$

The choice of $$f$$ and $$g$$ is up to us, we just have to know how to take the derivative for backpropagation.

We are of course free to make them “identity” functions, such that:

$$y = g(V f(Wx)) = VWx$$

This gives us the objective:

$$J = \sum_{n=1}^{N} |x(n) – VWx(n)|^2$$

Which is the same as PCA!

## If autoencoders are similar to PCA, why do we need autoencoders?

Autoencoders are much more flexible than PCA.

Recall that with neural networks we have an activation function – this can be a “ReLU” (aka. rectifier), “tanh” (hyperbolic tangent), or sigmoid.

This introduces nonlinearities in our encoding, whereas PCA can only represent linear transformations.

The network representation also means you can stack autoencoders to form a deep network.

## Cool theory bro, but what can autoencoders actually do for me?

Good question!

Similar to PCA – autoencoders can be used for finding a low-dimensional representation of your input data. Why is this useful?

Some of your features may be redundant or correlated, resulting in wasted processing time and overfitting in your model (too many parameters).

It is thus ideal to only include the features we need.

If your “reconstruction” of x is very accurate, that means your low-dimensional representation is good.

You can then use this transformation as input into another model.

## Training an autoencoder

Since autoencoders are really just neural networks where the target output is the input, you actually don’t need any new code.

Suppose we’re working with a sci-kit learn-like interface.

model.fit(X, Y)


You would just have:

model.fit(X, X)


Pretty simple, huh?

All the usual neural network training strategies work with autoencoders too:

• backpropagation
• regularization
• dropout
• RBM pre-training

If you want to get good with autoencoders – I would recommend trying to take some data and an existing neural network package you’re comfortable with – and see what low-dimensional representation you can come up with. How many dimensions are there?

Autoencoders are part of a family of unsupervised deep learning methods, which I cover in-depth in my course, Unsupervised Deep Learning in Python. We discuss how to stack autoencoders to build deep belief networks, and compare them to RBMs which can be used for the same purpose. We derive all the equations and write all the code from scratch – no shortcuts. Ask me for a coupon so I can give you a discount!

P.S. “Autoencoders” means “encodes itself”. Not so cryptic now, right?

#autoencoders #deep learning #machine learning #pca #principal components analysis #unsupervised learning

# Logistic Regression in Python video course

November 11, 2015

Hi all!

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of logistic regression and how to program it with Python?

Look no further, that video course is here.

#big data #data science #logistic regression #neural networks #numpy #python

# Deep Learning Tutorial part 3/3: Deep Belief Networks

June 15, 2015

This is part 3/3 of a series on deep belief networks. Part 1 focused on the building blocks of deep neural nets – logistic regression and gradient descent. Part 2 focused on how to use logistic regression as a building block to create neural networks, and how to train them. Part 3 will focus on answering the question: “What is a deep belief network?” and the algorithms we use to do training and prediction.

This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python.

## What is a deep belief network / deep neural network?

In its simplest form, a deep belief network looks exactly like the artificial neural networks we learned about in part 2! As long as there is at least 1 hidden layer, the model is considered to be “deep”. (I Googled around on this topic for quite awhile, it seems people just started using the term “deep learning” on any kind of neural network one day as a buzzword, regardless of the number of layers.)

It is common to use more than 1 hidden layer, and new research has been exploring different architectures than the simple “feedforward” neural network which we have been studying. Recurrent neural networks have become very popular in recent years. These networks contain “feedback” connections and contain a “memory” of past inputs. We will not talk about these in this post.

Ok, so then how is this different than part 2?

One reason deep learning has come to prominence in the past decade is due to increased computational power. It used to be that computers were just too slow to handle training large networks, especially in computer vision where each pixel of an image is an input. We have new libraries that take advantage of the GPU (graphics processing unit), which can do floating point math much faster than the CPU.

Note that because the architecture of the deep belief network is exactly the same, the flow of data from input to output (i.e. prediction) is exactly the same.

The only part that’s different is how the network is trained.

One problem with traditional multilayer perceptrons / artificial neural networks is that backpropagation can often lead to “local minima”. This is when your “error surface” contains multiple grooves and as you perform gradient descent, you fall into a groove, but it’s not the lowest possible groove.

Deep belief networks solve this problem by using an extra step called “pre-training”. Pre-training is done before backpropagation and can lead to an error rate not far from optimal. This puts us in the “neighborhood” of the final solution. Then we use backpropagation to slowly reduce the error rate from there.

So what is this pre-training step and how does it work?

To understand this, we first need to learn about “Restricted Boltzmann Machines” or RBMs.

[Strictly speaking, multiple layers of RBMs would create a deep belief network – this is an unsupervised model. A supervised model with a softmax output would be called a deep neural network.]

## Restricted Boltzmann Machines

Going back to our original simple neural network, let’s draw out the RBM. I’ve circled it in green here.

The RBM contains all the x’s, all the z’s, and the W in between. That’s pretty much all there is to it. An RBM is simply two layers of a neural network and the weights between them.

In an RBM we still refer to the x’s as the “input layer” and the z’s as the “hidden layer”. If you’ve ever learned about PCA, SVD, latent semantic analysis, or Hidden Markov Models – the idea of “hidden” or “latent” variables should be familiar to you.

As a simple example, you might observe that the ground is wet. You could have multiple hidden or latent variables, one representing the fact that it’s raining, another representing the fact that your neighbor is watering her garden.

In a sense they are the hidden causes or “base” facts that generate the observations that you measure.

Since RBMs are just a “slice” of a neural network, deep neural networks can be considered to be a bunch of RBMs “stacked” together.

## Variables in a Restricted Boltzmann Machine

In this section we will look more closely at what an RBM is – what variables are contained and why that makes sense – through a probabilistic model – similar to what we did for logistic regression in part 1.

Although not shown explicitly, each layer of the RBM will have its own bias weights – W is the only weight shared between them. We will denote these bias weight as “a” for the visible units, and “b” for the hidden units.

We’re going to rename some variables to match what they are called in most tutorials and articles on the Internet. We’ll denote the “visible” vectors (i.e. inputs) by v and index each element of v by i. We’ll denote the “hidden” units by h and index each element by j.

Using our new variables, v, h, a, b, and including w(i,j) as before – we can define the “energy” of a network as:

In vector / matrix notation this can be written as:

We can define the probability of observing an input v with hidden vector h as:

Where Z is a normalizing constant so that the sum of all events = 1.

We can get the marginal distribution P(v) by summing over h:

Similar to logistic regression, we can define the conditional probabilities P(v(i) = 1 | h) and P(h(j) = 1 | v):

To train the network we again want to maximize some objective function. What should that be in this case?

Given that all we have are a bunch of training inputs, we simply want to maximize the joint probability of those inputs, i.e.

Equivalently, we can maximize the log probability:

Where V is of course the set of all training inputs.

Note that we do not use any training targets – we simply want to model the input. Thus, RBM is an unsupervised learning algorithm, like the Gaussian Mixture Model, for example.

The learning algorithm used to train RBMs is called “contrastive divergence”.

## Contrastive Divergence

Contrastive divergence is highly non-trivial compared to an algorithm like gradient descent, which involved just taking the derivative of the objective function.

If you are going to use deep belief networks on some task, you probably do not want to reinvent the wheel. There are packages out there, such as Theano, pylearn2, and Torch7 – where a lot of people who are experts at this stuff have already written and optimized the code for performance.

Learning how to use those packages will take some effort in itself – so unless you are going to do research I would recommend holding off on understanding the technical details of contrastive divergence.

You still have a lot to think about – what learning rate should you choose? How many layers should your network have? How many units per layer? What about regularization and momentum?

These are not easy questions to answer, and only through experience will you get a “feel” for it.

This and other related topics are covered in-depth in my course, Unsupervised Deep Learning in Python. We fully derive and implement the contrastive divergence algorithm, so you can see it run yourself! We’ll also demonstrate how it helps us get around the “vanishing gradient problem”.

#ann #artificial intelligence #artificial neural networks #dbn #deep learning #gradient descent #machine learning #mlp #Multilayer Perceptron #rbm #restricted Boltzmann machines

# Deep Learning Tutorial part 2/3: Artificial Neural Networks

June 15, 2015

This is part 2/3 of a series on deep learning and deep belief networks.

See part 3 here.

This section will focus on artificial neural networks (ANNs) by building upon the logistic regression model we learned about last time. It’ll be a little shorter because we already built the foundation for some very important topics in part 1 – namely the objective / error function and gradient descent.

We will focus on 2 main functions of ANNs – the forward pass  (prediction) and backpropagation (learning). Your sci-kit learn analogues would be model.predict() and model.fit().

As with logistic regression, we have some set of training samples, X1, …, Xn, and we will use gradient descent to learn the weights of our model. We then test our model by computing predicted outputs given some test inputs (the forward pass) and comparing them to the true outputs.

This topic is covered in-depth in my course, Data Science: Deep Learning in Python. We derive all the equations by hand, step-by-step, and we implement everything using Numpy and Python. To solidify the concepts, we apply the method to some real-world problems, including an e-commerce dataset and facial expression recognition.

## Prediction

As with logistic regression, we will start with a diagram / schematic of a neural network.

We call the column of x’s the “input layer”, the column of z’s the “hidden layer”, and the column of y’s the “output layer”.

As in part 1, we will only use one y (binary classification) for most of the tutorial. Recall that the only difference is that when you have more than one output, you use the “softmax” output function. The methods (calculating the gradients for gradient descent) remain the same.

Each of the variables can be computed as follows:

z1 = sigma( x1*w(1,1) + x2*w(2,1) )

z2 = sigma( x1*w(1,2) + x2*w(2,2) )

y = sigma( z1*v(1) + z2*v(2) )

We can combine each of the weights w(i,j) into a matrix W – this is useful for coding in languages like Python and MATLAB where matrix and vector operations are much faster than for-loops. The size of W will be N x M where N is the number of x’s and M is the number of z’s.

Similarly, v(j) can be combined into a vector V of size M.

If we had more than one output for y, V would be a matrix of size M x P, where P is the number of y’s.

As in part 1, “sigma” refers to the sigmoid function, but other functions may be used. The hyperbolic tangent, or “tanh” is sometimes used – it is just a vertically scaled version of the sigmoid. Both make it relatively easy to compute the derivatives for gradient descent later on.

If you look at how we compute z1, z2, and y closely – you’ll recognize that these are all just the logistic regression formula. In fact, an artificial neural network is just a combination of multiple logistic regression units put together.

This is the neural network with one logistic unit highlighted:

One way we interpret this is that z1 is some “feature” extracted from (x1,x2), weighted by (w(1,1),w(1,2)), and similarly for z2.

Then y is a logistic regression on (z1,z2) – the features learned from the input.

This all begs the question – why use neural networks in the first place if we are just going to add a bunch of parameters and make it look more complicated?

Recall that logistic regression only worked on linearly separable problems. For example, you couldn’t train a logistic regression unit to learn the XOR function because you can’t draw a line between the classes.

What you could do if you really wanted to use logistic regression, is create another input x3 = x1*x2. As an exercise, convince yourself that this works. Hint: try [w0,w1,w2,w3] = [-0.5,1,1,-2].

The problem with the above approach is that you had to come up with the extra feature (x3) manually. You don’t know ahead of time what will work and what won’t. Real machine learning problems can have hundreds or thousands of inputs – you can’t try every combination possible. We haven’t even considered other functions. What about sin(x)? x^2 or x^3? log(x)? There are infinitely many features we could extract.

The beauty of neural networks is that they learn these features automatically. As an exercise, try manually assigning weights to a neural network with 3 hidden units that can compute the XOR function at y.

Another way of stating what we have just learned – artificial neural networks can learn nonlinear functions.

## Learning aka. Backpropagation

Learning the weights for a neural network is very similar to logistic regression. We will follow the same method here – write out the objective function we want to minimize, calculate its derivative with respect to the parameter we want to update, and use the gradient descent algorithm to perform the weight update.

In fact, the steps remain the same:

for i = 1…number of epochs:
error = negative log-likelihood aka. -L(Y|X,W,V)
w = w - learning rate * error gradient wrt w
v = v - learning rate * error gradient wrt v


The only difference now is that the likelihood depends on W (which was 1-D for logistic regression and is now 2-D) and V – since Y depends on W and V.

Even the objective function J remains the same as with logistic regression – it only depends on the output y and the target t – and will be the squared error or cross-entropy depending on the problem.

Calculating the gradient for any v(j) is simple because y depends directly on V and by the chain rule:

Here we’ve assumed we’re using the cross-entropy error, R is the total number of training samples and we index them by r – running out of letters!

The gradient for W is a little more complicated because it involves calculating the “total derivative”. If you have more than one output y(k), k=1…P – then the objective function will depend on all the y’s. At the same time, each y(k) will depend on the same w(i,j).

In general, if you have a function f(x,y) where x(t) is a function of t and y(t) is a function of t, you can write the “total derivative” of f(x,y) as:

For a vector x with N components, the above can be generalized to:

If we replace f() with the objective function J(), t with the weight w(i,j), and each component x(i) with the outputs of the neural network y(k), k = 1…P, we get the following:

Note that we can expand the right-most derivative so that we take the derivative of y(k) with respect to z(j), multiplied by the derivative of z(j) with respect to w(i,j). The latter term does not depend on k, so it can be removed from the summation.

Although this may seem now like a straightforward application of vector calculus – don’t be fooled – it took researchers many years to figure out how to solve this problem. Read more on Wikipedia.

## Multi-layer neural networks

So far we have looked at neural networks with only one hidden layer, but neural networks can have any number of hidden layers, with any number of dimensions per layer. (You will need to apply the total derivative rule recursively for each layer going backward).

You may want to do your own research as to what type of architectures will work best for your problem.

Neural networks almost give us too many choices – how many layers should I have? 1? 3? 100? How many units per layer? 500? 10000? 10001?

Of course, adding layers and units will only increase the time in takes to train your neural network. Every layer you add will result in an increase of N1 x N2 parameters to your model – where N1 is the number of inputs into the layer and N2 is the number of units in the layer that receives the inputs.

Thus neural networks can be very prone to overfitting. Suppose we are training a network with one hidden layer, where the input is a 32 x 32 image, the hidden layer has 500 units (i.e. 500 features extracted), and the output is 10 (because the images are handwritten digits from 0 to 9).

That’s 32 x 32 x 500 parameters for W, and 500 x 10 parameters for V. That’s 517 000 parameters!

One “rule of thumb” I’ve seen is that you want the number of training samples to be at least 10x the number of parameters. So for the example above, you’d want at least approximately 5.2 million samples to train from.

So you don’t want to needlessly add more layers and more units to your neural network just to make it more expressive.

One well-known result from neural network literature is that neural networks with as few as one hidden layer are “universal approximators” (i.e. they can approximate any function). Source: http://www.sciencedirect.com/science/article/pii/0893608089900208

This topic is covered in-depth in my course, Data Science: Deep Learning in Python. We derive all the equations by hand, step-by-step, and we implement everything using Numpy and Python. To solidify the concepts, we apply the method to some real-world problems, including an e-commerce dataset and facial expression recognition.

#artificial intelligence #deep belief networks #deep learning #machine learning #neural networks #restricted Boltzmann machines

# Deep Learning Tutorial part 1/3: Logistic Regression

April 22, 2015

This is part 1/3 of a series on deep learning and deep belief networks. I’ve wanted to do this for a long time because learning about neural networks introduces a lot of useful topics and algorithms that are useful in machine learning in general.

Unfortunately, while the material I’ve read focusing on logistic regression and the multiple layer perceptron (building blocks of the deep belief network) are great and accessible to a wide audience, I’ve found most of the material I’ve encountered about deep learning are highly technical and hard to follow.

So, I’ve decided to create this series in order to teach the most practical aspects of deep learning and neural networks – enough so that you can implement one yourself, but not so much that you’ll get bogged down by all the theory.

Part 1 will focus on logistic regression. Part 2 will focus on the multilayer perceptron (a.k.a. artificial neural network) and backpropagation. Part 3 will focus on restricted Boltzmann machines and deep networks. Each is designed to be a stepping stone to the next.

The topic of this post (logistic regression) is covered in-depth in my online course, Deep Learning Prerequisites: Logistic Regression in Python. We derive all the equations step-by-step, and fully implement all the code in Python and Numpy. To solidify the concepts, we apply the method to real world datasets, including an e-commerce dataset and facial expression recognition.

Let us begin.

Logistic Regression doesn’t do Regression

Despite its name, Logistic Regression is actually a classification algorithm.

This means the output gives us a label, not a real number.

HOWEVER: the methods you read about in this series can be applied to both regression and classification. Just the equations for the outputs and the error function differ. I will note these differences where appropriate, but the tutorials will focus on classification.

Diagram of how Logistic Regression works

I’ve included a few pictures here so you get used to looking at how we visualize a neural network.

Here’s one where X (input) is 3-dimensional and Y (output) is 2-dimensional.

Here’s one where the weights use the symbol theta and the summation operation and sigmoid function are shown explicitly.

Here’s one where the weights use the variable “w” and the bias is explicitly shown as “b”. Here the sigmoid function uses the Greek letter “phi”, but more often you see the letter “sigma”.

A Little Math

So what do these diagrams mean about how we calculate the output from a set of inputs?

Notice first that we can have more than one output Y.

For K classes/labels, as in the digit recognition problem, we would have K outputs, and Y(k) = 1 if the label is the kth digit, otherwise it is 0.

The only exception is the 2-class case. In this situation, we only need 1 output because Y = 1 is the first class and Y = 0 is the second class.

We’ll focus on this scenario first.

The equation in its compact form is this:

The inside part is the dot product of the weights and the input:

As in linear regression we assume there is an x0 and that it is 1.

The “sigma” part is the sigmoid function:

If we graph the sigmoid, it looks like this:

There are 2 things we can tell from the above equation:

1) For logistic regression to work, the classes must be linearly separable. This is because the dot product between “w” and “x” is a line/plane.

(i.e. ax + by + c = 0)

w0 + w1x1 + w2x2 + … = 0 is the plane (more correctly, hyperplane) here.

So here is a situation where logistic regression would work well:

Here is a situation where it wouldn’t work well.

But we will cover that more in parts 2 and 3.

2) The sigmoid means the output Y is between 0 and 1.

So if w*x = 0, we land right on the hyperplane, and Y = 0.5.

If w*x > 0, we get Y > 0.5, and vice versa for w*x < 0.

As w*x approaches infinity, Y approaches 1, and vice versa.

Probabilistic Interpretation

Because Y is between 0 and 1, we can interpret it as a probability.

This makes more sense if you consider the following:

If we fall right on the barrier/plane between the two classes – our probability of being in either class is 0.5.

If we are further away from that barrier, the probability of being in either class increases.

We usually denote Y as P(Y=1|X) and P(Y=0|X).

Note that while we use some probabilistic concepts here, the way in which we use them is different than for say, a Bayesian classifier.

Also note that P(Y=0|X) = 1 – P(Y=1|X).

Maximizing the Likelihood

We have seen squared error used as an error function before, as with linear regression.

In fact, if we were doing regression, we could use the same thing here.

For classification, we take a different approach.

You may have seen this error function before:

t is the target and y is the output of the network/model.

This is called the cross-entropy error.

Where does this come from?

Let us go back to first principles.

Instead of minimizing error, we maximize likelihood. This seems like a logical place to start – maximizing the probability that our model parameters are correct.

Consider N IID (independent and identically distributed) training samples and corresponding labels (we’ll call them “t” here).

The likelihood of the model given the entire dataset can be represented by this equation. We can use the product rule because each sample is independent.

(Sidenote 1: This is the same thing we do when we want to say, find the maximum likelihood estimate for the mean. We calculate the joint probability aka. likelihood P(data|mean) and find the “argmax” mean that gives us the highest likelihood, hence the term – “maximum likelihood”)

(Sidenote 2: This is the same likelihood you see when we do Bayesian inference – posterior ~ likelihood x prior or P(param | data) ~ P(data | param) P(param))

(Sidenote 3: If you wanted to do regression, you would simply not have a sigmoid at the end, and you would use the squared error. The exponential of the squared error is a Gaussian, because in regression we often assume the error is Gaussian distributed. By making these 2 changes, we would just be doing linear regression.)

Recall y = P(y=1|x).

The target t can be 1 or 0.

When t is 1, only the left part of the product matters (the right side evaluates to 1). All the y’s here are the probability that the output of the network is 1. Given that the target is 1, we want to maximize this probability.

When t is 0, only the right part of the product matters. Recall that 1-y is the probability that the output of the network is 0. So when t = 0, we want to maximize this probability.

Since each sample is independent, we can get the joint probability by multiplying all the individual probabilities together.

2 key points:

1) There is no analytic solution, we must use iterative methods. In this tutorial we will cover gradient descent, but there are others (such as conjugate gradient, and Newton’s method).

The added advantage of learning gradient descent now is that it is also used to train neural networks.

2) As is usual with these ML problems, we will work with the log likelihood instead of the likelihood. Just try taking the derivative of both, and you will see why.

If you take the log of the above expression, notice you’ll get the same error function we started with!

We take the negative because we want something to minimize. We call this the “error” or “cost” function.

Maximizing the likelihood is equivalent to minimizing the negative likelihood.

It is also equivalent to minimizing the negative log-likelihood. This is because log() is a monotonically increasing function.

How do we actually minimize the negative log-likelihood if we can’t simply set the derivative = 0 and solve for the weights?

This is where gradient descent enters the picture.

Note that gradient descent is just a numerical method – it can be applied whenever you want to solve for the minima of a function, not just for machine learning. If you have never studied numerical methods, it is analogous to Newton’s method for solving for the zeros of a polynomial.

Here is a picture of what we’re trying to do:

We start at some random weight, w = random().

Then we update the weight by going in the direction of the derivative of the error function (slope), which we have previously stated is the negative log-likelihood.

With squared error it is easy to see that the error function is quadratic, and so we are descending down a parabola in that case. The minimum is global.

With log-likelihood the extremum is also global.

It may help to consider the function E(y,t) = tlog(y) + (1-t)log(1-y) to see why.

The equation for updating the weights is:

Here j indexes the dimension, so j = 1…D.

t indexes the iteration number (not to be confused with the other t, which was the target).

“Eta” is called the “learning rate”. This parameter determines how far along the error surface we travel on each iteration. Bigger values mean we go further, which means our weights might converge to the final solution faster, but it also means we may “overshoot” that solution.

Since w is a vector, we can usually speed up our code by doing vector operations (i.e. in MATLAB or Python). In this case, we can use this equation:

The full training algorithm is:

for i = 1…number of epochs:
error = negative log-likelihood ( -L(Y|X,w) )
w = w – learning rate * error gradient

The number of epochs is yet another parameter. There are many ways to determine when to stop the gradient descent process – but using the same number of epochs all the time is the simplest way.

Some other methods you may want to look into:

• Stopping when the gradient is small enough
• Stopping when the training error is no longer decreasing or approaching 0
• Stopping when the error on a held out test set starts to increase (overfitting)

Sometimes we call things like learning rate and epochs “hyperparameters”. These are parameters that are not part of the model itself, but can still be optimized, perhaps via cross-validation.

Biological Inspiration

In computational neuroscience, a logistic regression unit is sometimes referred to as a “neuron”. How are the two related?

Here is a diagram of a typical neuron.

Some notable components:

• Dendrites: These are the “inputs” into the neuron – they take electrical signals from other neurons’ axons.
• Cell body / Nucleus: This part of the neuron “sums up” all the inputs and propagates this summed signal to the axon.
• Axon: This is the “output” of the neuron. It sends the signal from this neuron to other neurons’ dendrites.

So dendrites are our logistic unit’s X, and axons are the Y.

The brain is essentially a network of neurons, or rather, a neural network. An artificial neuron network, which is the topic discussed in Part 2 of this tutorial, is a network of connected logistic regression units.

Another notable feature of neurons is the behavior of the “action potential”.

Observe a typical amplitude/potential (voltage) vs. time signal:

Notice how the potential rises gradually and then spikes. We call this the “all-or-nothing” principle. If the sum of the inputs to the neuron is high enough, a spike is generated. Otherwise, the voltage stays relatively low.

This is reflected in the logistic units’ binary output. The output if a sigmoid is interpreted as P(Y=1|X) – the probability of being “on”, or in other words, the probability that a spike is generated.

Inhibitory vs. Excitatory neurons:

It is well-known that the signal a neuron sends can either “excite” or “inhibit” the receiving neuron. These are reflected in the logistic model by the weights. A positive weight is excitatory. A negative weight is inhibitory.

Researchers have tried to create models with “spiking” neurons, however, it has been difficult to get them to actually learn anything.