# New course – Natural Language Processing: Deep Learning in Python part 6

August 9, 2016

[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at advanced NLP using deep learning.

Previously, you learned about some of the basics, like how many NLP problems are just regular machine learning and data science problems in disguise, and simple, practical methods like bag-of-words and term-document matrices.

These allowed us to do some pretty cool things, like detect spam emails, write poetry, spin articles, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but 4 new architectures in this course.

First up is word2vec.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

• king – man = queen – woman
• France – Paris = England – London
• December – Novemeber = July – June

We are also going to look at the GLoVe method, which also finds word vectors, but uses a technique called matrix factorization, which is a popular algorithm for recommender systems.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like parts-of-speech tagging and named entity recognition, and use recurrent neural networks to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about recursive neural networks, which finally help us solve the problem of negation in sentiment analysis. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

See you in class!

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=SLOWBIRD_SITE

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec

# New Course: Unsupervised Machine Learning – Hidden Markov Models in Python

June 13, 2016

Hidden Markov Models are all about learning sequences.

A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you’re going to default. In short, sequences are everywhere, and being able to analyze them is an important skill in your data science toolbox.

The easiest way to appreciate the kind of information you get from a sequence is to consider what you are reading right now. If I had written the previous sentence backwards, it wouldn’t make much sense to you, even though it contained all the same words. So order is important.

While the current fad in deep learning is to use recurrent neural networks to model sequences, I want to first introduce you guys to a machine learning algorithm that has been around for several decades now – the Hidden Markov Model.

This course follows directly from my first course in Unsupervised Machine Learning for Cluster Analysis, where you learned how to measure the probability distribution of a random variable. In this course, you’ll learn to measure the probability distribution of a sequence of random variables.

You guys know how much I love deep learning, so there is a little twist in this course. We’ve already covered gradient descent and you know how central it is for solving deep learning problems. I claimed that gradient descent could be used to optimize any objective function. In this course I will show you how you can use gradient descent to solve for the optimal parameters of an HMM, as an alternative to the popular expectation-maximization algorithm.

We’re going to do it in Theano, which is a popular library for deep learning. This is also going to teach you how to work with sequences in Theano, which will be very useful when we cover recurrent neural networks and LSTMs.

This course is also going to go through the many practical applications of Markov models and hidden Markov models. We’re going to look at a model of sickness and health, and calculate how to predict how long you’ll stay sick, if you get sick. We’re going to talk about how Markov models can be used to analyze how people interact with your website, and fix problem areas like high bounce rate, which could be affecting your SEO. We’ll build language models that can be used to identify a writer and even generate text – imagine a machine doing your writing for you.

We’ll look at what is possibly the most recent and prolific application of Markov models – Google’s PageRank algorithm. And finally we’ll discuss even more practical applications of Markov models, including generating images, smartphone autosuggestions, and using HMMs to answer one of the most fundamental questions in biology – how is DNA, the code of life, translated into physical or behavioral attributes of an organism?

#data science #deep learning #hidden markov models #machine learning #recurrent neural networks #theano

# New Deep Learning course on Udemy

February 26, 2016

Shut up and gimme the link!: https://deeplearningcourses.com/c/data-science-deep-learning-in-theano-tensorflow/

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

# Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

• Are interested in deep learning (this tutorial uses gradient descent)
• Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
• Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
• Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z = T.dot(X, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)


I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2$$

Where $$\hat{X}$$ is the reconstruction of $$X$$. If there is only one eigenvector, let’s call this $$v$$, then this becomes:

$$J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2$$

This is equivalent to the Frobenius norm, so we can write:

$$J = |X – Xvv^T|_F$$

One identity of the Frobenius norm is:

$$|A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) }$$

Which means we can rewrite the loss function as:

$$J = Tr( (X – Xvv^T)^T(X – Xvv^T) )$$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on $$v$$, you can use matrix algebra to rearrange this to get:

$$v^* = argmin\{-Tr(X^TXvv^T) \}$$

Which again using reordering would be equivalent to maximizing:

$$v^* = argmax\{ v^TX^TXv \}$$

The corresponding eigenvalue would then be:

$$\lambda = v^TX^TXv$$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$v \leftarrow v + \eta \nabla_v(v^TX^TXv)$$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i )$$

Next, note that to implement this algorithm you never need to actually calculate the covariance $$X^T X$$. If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get $$Xv$$, which is only of size $$N \times 1$$. You can then “dot” this with itself to get a scalar, which is only an $$O(N)$$ operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv = T.dot(X, v)
cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i))



Note that we re-normalize the eigenvector on each step, so that $$v^T v = 1$$.

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
inputs=[X],
)


Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
... initialize variables and expressions ...
... initialize theano train function ...
while t < max_iterations and change in v < tol:
outputs = train(data)

... return eigenvalues and eigenvectors ...


This is not really trivial but at the same time it’s a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course “Unsupervised Deep Learning in Python” and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano