January 27, 2017

I would like to announce my latest course – **Artificial Intelligence: Reinforcement Learning in Python**.

This has been one of my most requested topics since I started covering **deep learning**. This course has been brewing in the background for months.

The result: This is my most MASSIVE course yet.

Usually, my courses will introduce you to a handful of new algorithms (which is a lot for people to handle already). This course covers SEVENTEEN (17!) new algorithms.

This will keep you busy for a LONG time.

If you’re used to supervised and unsupervised machine learning, realize this: Reinforcement Learning is a whole new ball game.

There are so many new concepts to learn, and so much depth. It’s COMPLETELY different from anything you’ve seen before.

That’s why we build everything slowly, from the ground up.

There’s tons of new theory, but as you’ve come to expect, anytime we introduce new theory it is accompanied by full code examples.

What is Reinforcement Learning? It’s the technology behind self-driving cars, AlphaGo, video game-playing programs, and more.

You’ll learn that while deep learning has been very useful for tasks like driving and playing Go, it’s in fact just a small part of the picture.

Reinforcement Learning provides the *framework* that allows deep learning to be useful.

Without reinforcement learning, all we have is a basic (albeit very accurate) labeling machine.

*With* Reinforcement Learning, you have intelligence.

Reinforcement Learning has even been used to model processes in psychology and neuroscience. It’s truly the closest thing we have to “machine intelligence” and “general AI”.

What are you waiting for? Sign up now!!

COUPON:

https://www.udemy.com/artificial-intelligence-reinforcement-learning-in-python/?couponCode=EARLYBIRDSITE

#artificial intelligence #deep learning #reinforcement learning
Go to comments
December 25, 2016

[Skip to the bottom if you just want the coupon]

This course is all about **ensemble methods**.

We’ve already learned some classic machine learning models like **k-nearest neighbor** and **decision tree**. We’ve studied their limitations and drawbacks.

But what if we could combine these models to eliminate those limitations and produce a much more powerful classifier or regressor?

In this course you’ll study ways to combine models like decision trees and logistic regression to build models that can reach much higher accuracies than the base models they are made of.

In particular, we will study the **Random Forest** and **AdaBoost** algorithms in detail.

To motivate our discussion, we will learn about an important topic in statistical learning, the **bias-variance trade-off**. We will then study the **bootstrap** technique and **bagging** as methods for reducing both bias and variance simultaneously.

We’ll do plenty of **experiments** and use these algorithms on **real datasets** so you can see first-hand how powerful they are.

Since **deep learning** is so popular these days, we will study some interesting commonalities between random forests, AdaBoost, and deep learning neural networks.

https://www.udemy.com/machine-learning-in-python-random-forest-adaboost/?couponCode=EARLYBIRDSITE2

Go to comments
November 17, 2016

[If you already know you want to sign up for my **Bayesian machine learning** course, just scroll to the bottom to get your $10 coupon!]

Boy, do I have some exciting news today!

You guys have already been keeping up with my **deep learning** series.

Hopefully, you’ve noticed that I’ve been releasing non-deep learning machine learning courses as well, in parallel (and they often tie into the deep learning series quite nicely).

Well today, I am announcing the start of a BRAND NEW series on Bayesian machine learning.

Bayesian methods require an entirely new way of thinking – a paradigm shift.

But don’t worry, it’s not just all theory.

In fact, the first course I’m releasing in the series is VERY practical – it’s on** A/B testing**.

Every online advertiser, e-commerce store, marketing team, etc etc etc. does A/B testing.

But did you know that traditional A/B testing is both horribly confusing and inefficient?

Did you know that there are cool, new adaptive methods inspired by reinforcement learning that improve on those old crusty tests?

(Those old methods, and the way they are traditionally taught, are probably the reason you cringe when you hear the word “statistics”)

Well, Bayesian methods not only represent a state-of-the-art solution to many A/B testing challenges, they are also surprisingly theoretically simpler!

You’ll end the course by doing your own simulation – comparing and contrasting the various adaptive A/B testing algorithms (including the final Bayesian method).

This is VERY practical stuff and any digital media, newsfeed, or advertising startup will be EXTREMELY IMPRESSED if you know this stuff.

This WILL advance your career, and any company would be lucky to have someone that knows this stuff on their team.

Awesome coincidence #1: As I mentioned above, a lot of these techniques cross-over with reinforcement learning, so if you are itching for a preview of my upcoming deep reinforcement learning course, this will be very interesting for you.

Awesome coincidence #2: Bayesian learning also crosses over with deep learning, one example being the variational autoencoder, which I may incorporate into a more advanced deep learning course in the future. They heavily rely on concepts from both Bayesian learning AND deep learning, and are very powerful state-of-the-art algorithms.

~~Due to all the black Friday madness going on, I am going to do a ONE-TIME ONLY $10 special for this course. With my coupons, the price will remain at $10, even if Udemy’s site-wide sale price goes up (which it will).~~

See you in class!

As promised, here is the coupon:

~~https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing/?couponCode=EARLYBIRDSITE2~~

UPDATE: The Black Friday sale is over, but the early bird coupon is still up for grabs:

https://www.udemy.com/bayesian-machine-learning-in-python-ab-testing/?couponCode=EARLYBIRDSITE

LAST THING: Udemy is currently having an awesome Black Friday sale. $10 for ANY course starting Nov 15, but the price goes up by $1 every 2 days, so you need to ACT FAST.

I was going to tell you earlier but I was hard at work on my course. =)

Just click this link to get ANY course on Udemy for $10 (+$1 every 2 days): http://bit.ly/2fY3y5M

#bayesian #data science #machine learning #statistics
Go to comments
September 16, 2016

If you don’t want to read about the course and just want the 88% OFF coupon code, skip to the bottom.

In recent years, we’ve seen a resurgence in AI, or artificial intelligence, and machine learning.

Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts.

Google’s AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning.

Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

Google famously announced that they are now “machine learning first”, meaning that machine learning is going to get a lot more attention now, and this is what’s going to drive innovation in the coming years.

Machine learning is used in many industries, like finance, online advertising, medicine, and robotics.

It is a widely applicable tool that will benefit you no matter what industry you’re in, and it will also open up a ton of career opportunities once you get good.

Machine learning also raises some philosophical questions. Are we building a machine that can think? What does it mean to be conscious? Will computers one day take over the world?

The best part about this course is that it requires WAY less math than my usual courses; just some basic probability and geometry, no calculus!

In this course, we are first going to discuss the **K-Nearest Neighbor** algorithm. It’s extremely simple and intuitive, and it’s a great first classification algorithm to learn. After we discuss the concepts and implement it in code, we’ll look at some ways in which KNN can fail.

It’s important to know both the advantages and disadvantages of each algorithm we look at.

Next we’ll look at the **Naive Bayes** Classifier and the General Bayes Classifier. This is a very interesting algorithm to look at because it is grounded in probability.

We’ll see how we can transform the Bayes Classifier into a linear and quadratic classifier to speed up our calculations.

Next we’ll look at the famous **Decision Tree** algorithm. This is the most complex of the algorithms we’ll study, and most courses you’ll look at won’t implement them. We will, since I believe implementation is good practice.

The last algorithm we’ll look at is the **Perceptron** algorithm. Perceptrons are the ancestor of neural networks and deep learning, so they are important to study in the context of machine learning.

One we’ve studied these algorithms, we’ll move to more practical machine learning topics. Hyperparameters, cross-validation, feature extraction, feature selection, and multiclass classification.

We’ll do a comparison with **deep learning** so you understand the pros and cons of each approach.

We’ll discuss the **Sci-Kit Learn** library, because even though implementing your own algorithms is fun and educational, you should use optimized and well-tested code in your actual work.

We’ll cap things off with a very practical, real-world example by writing a web service that runs a machine learning model and makes predictions. This is something that real companies do and make money from.

All the materials for this course are FREE. You can download and install **Python**, **Numpy**, and **Scipy** with simple commands on **Windows**, **Linux**, or **Mac**.

https://www.udemy.com/data-science-supervised-machine-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/data-science-supervised-machine-learning-in-python/?couponCode=SLOWBIRD_SITE

#data science #machine learning #matplotlib #numpy #pandas #python
Go to comments
August 9, 2016

[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at **advanced** NLP using **deep learning**.

Previously, you learned about some of the basics, like how many NLP problems are just regular** machine learning** and **data science** problems in disguise, and simple, practical methods like **bag-of-words** and term-document matrices.

These allowed us to do some pretty cool things, like** detect spam** emails, **write poetry**, **spin articles**, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but **4** new architectures in this course.

First up is **word2vec**.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

- king – man = queen – woman
- France – Paris = England – London
- December – Novemeber = July – June

We are also going to look at the **GLoVe** method, which also finds word vectors, but uses a technique called** matrix factorization**, which is a popular algorithm for **recommender systems**.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like **parts-of-speech tagging** and **named entity recognition**, and use** recurrent neural networks** to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about **recursive neural networks**, which finally help us solve the problem of negation in **sentiment analysis**. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

All of the materials required for this course can be downloaded and installed for FREE. We will do most of our work in **Numpy** and **Matplotlib**,and **Theano**. I am always available to answer your questions and help you along your data science journey.

See you in class!

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=EARLYBIRDSITE

UPDATE: New coupon if the above is sold out:

https://www.udemy.com/natural-language-processing-with-deep-learning-in-python/?couponCode=SLOWBIRD_SITE

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec
Go to comments
July 14, 2016

New course out today – Recurrent Neural Networks in Python: Deep Learning part 5.

If you already know what the course is about (recurrent units, GRU, LSTM), grab your 50% OFF coupon and go!:

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

Like the course I just released on Hidden Markov Models, Recurrent Neural Networks are all about learning sequences – but whereas Markov Models are limited by the Markov assumption, Recurrent Neural Networks are not – and as a result, they are more expressive, and more powerful than anything we’ve seen on tasks that we haven’t made progress on in decades.

Sequences appear everywhere – stock prices, language, credit scoring, and webpage visits.

Recurrent neural networks have a history of being very hard to train. It hasn’t been until recently that we’ve found ways around what is called the vanishing gradient problem, and since then, recurrent neural networks have become one of the most popular methods in deep learning.

If you took my course on Hidden Markov Models, we are going to go through a lot of the same examples in this class, except that our results are going to be a lot better.

Our classification accuracies will increase, and we’ll be able to create vectors of words, or word embeddings, that allow us to visualize how words are related on a graph.

We’ll see some pretty interesting results, like that our neural network seems to have learned that all religions and languages and numbers are related, and that cities and countries have hierarchical relationships.

If you’re interested in discovering how modern deep learning has propelled machine learning and data science to new heights, this course is for you.

I’ll see you in class.

Click here for 50% OFF:

https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python/?couponCode=WEBSITE

#data science #deep learning #gru #lstm #machine learning #word vectors
Go to comments
June 13, 2016

EARLY BIRD 50% OFF COUPON: CLICK HERE

**Hidden Markov Models** are all about learning sequences.

A lot of the data that would be very useful for us to model is in sequences. **Stock prices** are sequences of prices. Language is a sequence of words. **Credit scoring** involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you’re going to default. In short, sequences are everywhere, and being able to analyze them is an important skill in your **data science** toolbox.

The easiest way to appreciate the kind of information you get from a sequence is to consider what you are reading right now. If I had written the previous sentence backwards, it wouldn’t make much sense to you, even though it contained all the same words. So order is important.

While the current fad in **deep learning **is to use **recurrent neural networks** to model sequences, I want to first introduce you guys to a machine learning algorithm that has been around for several decades now – the Hidden Markov Model.

This course follows directly from my first course in **Unsupervised Machine Learning for Cluster Analysis**, where you learned how to measure the **probability distribution** of a **random variable**. In this course, you’ll learn to measure the probability distribution of a sequence of random variables.

You guys know how much I love **deep learning**, so there is a little twist in this course. We’ve already covered **gradient descent** and you know how central it is for solving deep learning problems. I claimed that gradient descent could be used to optimize any objective function. In this course I will show you how you can use gradient descent to solve for the optimal parameters of an HMM, as an alternative to the popular **expectation-maximization** algorithm.

We’re going to do it in Theano, which is a popular library for deep learning. This is also going to teach you how to work with sequences in Theano, which will be very useful when we cover **recurrent neural networks** and **LSTMs**.

This course is also going to go through the many practical applications of Markov models and hidden Markov models. We’re going to look at a model of sickness and health, and calculate how to predict how long you’ll stay sick, if you get sick. We’re going to talk about how Markov models can be used to analyze how people interact with your website, and fix problem areas like high **bounce rate**, which could be affecting your **SEO**. We’ll build language models that can be used to identify a writer and even generate text – imagine a machine doing your writing for you.

We’ll look at what is possibly the most recent and prolific application of Markov models – **Google’s PageRank** algorithm. And finally we’ll discuss even more practical applications of Markov models, including generating images, **smartphone** **autosuggestions**, and using HMMs to answer one of the most fundamental questions in **biology** – how is **DNA**, the code of life, translated into physical or behavioral attributes of an organism?

All of the materials of this course can be downloaded and installed for FREE. We will do most of our work in **Numpy** and **Matplotlib**, along with a little bit of **Theano**. I am always available to answer your questions and help you along your data science journey.

Sign up now and get 50% off by clicking HERE

#data science #deep learning #hidden markov models #machine learning #recurrent neural networks #theano
Go to comments
May 15, 2016

This course is the next logical step in my **deep learning, data science,** and **machine learning** series. I’ve done a lot of courses about deep learning, and I just released a course about **unsupervised learning**, where I talked about **clustering** and **density estimation**. So what do you get when you put these 2 together? Unsupervised deep learning!

In these course we’ll start with some very basic stuff – **principal components analysis (PCA)**, and a popular nonlinear dimensionality reduction technique known as **t-SNE (t-distributed stochastic neighbor embedding)**.

Next, we’ll look at a special type of unsupervised neural network called the **autoencoder**. After describing how an autoencoder works, I’ll show you how you can link a bunch of them together to form a deep stack of autoencoders, that leads to better performance of a supervised** deep neural network**. Autoencoders are like a non-linear form of PCA.

Last, we’ll look at **restricted Boltzmann machines (RBMs)**. These are yet another popular unsupervised neural network, that you can use in the same way as autoencoders to **pretrain** your supervised deep neural network. I’ll show you an interesting way of training restricted Boltzmann machines, known as **Gibbs sampling**, a special case of **Markov Chain Monte Carlo,** and I’ll demonstrate how even though this method is only a rough approximation, it still ends up reducing other cost functions, such as the one used for autoencoders. This method is also known as **Contrastive Divergence** or **CD-k**. As in physical systems, we define a concept called **free energy** and attempt to minimize this quantity.

Finally, we’ll bring all these concepts together and I’ll show you visually what happens when you use PCA and t-SNE on the features that the autoencoders and RBMs have learned, and we’ll see that even without labels the results suggest that a pattern has been found.

All the materials used in this course are FREE. Since this course is the 4th in the deep learning series, I will assume you already know calculus, linear algebra, and **Python** coding. You’ll want to install **Numpy** and**Theano** for this course. These are essential items in your **data analytics** toolbox.

If you are interested in deep learning and you want to learn about modern deep learning developments beyond just plain **backpropagation**, including using unsupervised neural networks to interpret what features can be automatically and hierarchically learned in a deep learning system, this course is for you.

Get your EARLY BIRD coupon for 50% off here: https://www.udemy.com/unsupervised-deep-learning-in-python/?couponCode=EARLYBIRD

Go to comments
April 25, 2016

This article will be of interest to you if you want to learn about recommender systems and predicting movie ratings (or book ratings, or product ratings, or any other kind of rating).

Contests like the $1 million Netflix Challenge are an example of what collaborative filtering can be used for.

## Problem Setup

Let’s use the “users rating movies” example for this tutorial. After some Internet searching, we can determine that there are approximately 500, 000 movies in existence. Let’s also suppose that your very popular movie website has 1 billion users (Facebook has 1.6 billion users as of 2015, so this number is plausible).

How many possible user-movie ratings can you have? That is \( 10^9 \times 5 \times 10^5 = 5 \times 10^{14} \). That’s a lot of ratings! Way too much to fit into your RAM, in fact.

But that’s just one problem.

How many movies have you seen in your life? Of those movies, what percentage of them have you rated? The number is miniscule. In fact, *most* users have not rated *most* movies.

This is why recommender systems exist in the first place – so we can recommend you movies that you haven’t seen yet, that we know you’ll like.

So if you were to create a user-movie matrix of movie ratings, most of it would just have missing values.

However, that’s not to say there isn’t a pattern to be found.

Suppose we look at a subset of movie ratings, and we find the following:

Batman

Batman Returns

Batman Begins

The Dark Knight

Batman v. Superman

Where we’ve used N/A to show that a movie has not yet been rated by a user.

If we used the “cosine distance” ( \( \frac{u^T v}{ |u||v| } \) ) on the vectors created by looking at only the common movies, we could see that Guy A and Guy B have similar tastes. We could then surmise, based on this closeness, that Guy A might rate the Batman movie a “4”, and Guy B might rate Batman Returns a “4”. And since this is a pretty high rating, we might want to recommend these movies to these users.

This is the idea behind collaborative filtering.

## Enter Matrix Factorization

Matrix factorization solves the above problems by reducing the number of free parameters (so the total number of parameters is much smaller than #users times #movies), and by fitting these parameters to the data (ratings) that do exist.

What is matrix factorization?

Think of factorization in general:

15 = 3 x 5 (15 is made up of the factors 3 and 5)

\( x^2 + x = x(x + 1) \)

We can do the same thing with matrices:

$$\left( \begin{matrix}3 & 4 & 5 \\ 6 & 8 & 10 \end{matrix} \right) = \left( \begin{matrix}1 \\ 2 \end{matrix} \right) \left( \begin{matrix}3 & 4 & 5 \end{matrix} \right) $$

In fact, this is exactly what we do in matrix factorization. We “pretend” the big ratings matrix (the one that can’t fit into our RAM) is actually made up of 2 smaller matrices multiplied together.

Remember that to do a valid matrix multiply, the inner dimensions must match. What is the size of this dimension? We call it “K”. It is unknown, but we can choose it via possibly cross-validation so that our model generalizes well.

If we have \( M \) users and \( N \) ratings, then the total number of parameters in our model is \( MK + NK \). If we set \( K = 10 \), the total number of parameters we’d have for the user-movie problem would be \( 10^{10} + 5 \times 10^6 \), which is still approximately \( 10^{10} \), which is a factor of \( 10^4 \) smaller than before.

This is a big improvement!

So now we have:

$$ A \simeq \hat{ A } = UV $$

If you were to picture the matrices themselves, they would look like this:

Because I am lazy and took this image from elsewhere on the Internet, the “d” here is what I am calling “K”. And their “R” is my “A”.

You know that with any machine learning algorithm we have 2 procedures – the fitting procedure and the prediction procedure.

For the fitting procedure, we want every known \( A_{ij} \) to be as close to \( \hat{A}_{ij} = u_i^Tv_j \) as possible. \( u_i \) is the ith row of \( U \). \( v_j \) is the jth column of \( V \).

For the prediction procedure, we won’t have an \( A_{ij} \), but we can use \( \hat{A}_{ij} = u_i^Tv_j \) to tell us what user i *might* rate movie j given the existing patterns.

## The Cost Function

A natural cost function for this problem is the squared error. Think of it as a regression. This is just:

$$ J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2 $$

Where \( \Omega \) is the set of all pairs \( (i, j) \) where user i has rated movie j.

Later, we will use \( \Omega_i \) to be the set of all j’s (movies) that user i has rated, and we will use \( \Omega_j \) to be the set of all i’s (users) that have rated movie j.

## Coordinate Descent

What do you do when you want to minimize a function? Take the derivative and set it to 0, of course. No need to use anything more complicated if the simple approach is solvable and performs well. It is also possible to use gradient descent on this problem by taking the derivative and then taking small steps in that direction.

You will notice that there are 2 derivatives to take here. The first is \( \partial{J} / \partial{u} \).

The other is \( \partial{J} / \partial{v} \). After calculating the derivatives and solving for \( u \) and \( v \), you get:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} A_{ij} v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j} A_{ij} u_i $$

So you take both derivatives. You set both to 0. You solve for the optimal u and v. Now what?

The answer is: coordinate descent.

You first update \( u \) using the current setting of \( v \), then you update \( v \) using the current setting of \( u \). The order doesn’t matter, just that you alternate between the two.

There is a mathematical guarantee that J will improve on each iteration.

This technique is also known as **alternating least squares**. (This makes sense because we’re minimizing the squared error and updating \( u \) and \( v \) in an alternating fashion.)

## Bias Parameters

As with other methods like linear regression and logistic regression, we can add bias parameters to our model to improve accuracy. In this case our model becomes:

$$ \hat{A}_{ij} = u_i^T v_j + b_i + c_j + \mu $$

Where \( \mu \) is the global mean (average of all known ratings).

You can interpret \( b_i \) as the bias of a user. A negative bias means this user just hates movies more than the average person. A positive bias would mean the opposite. Similarly, \( c_j \) is the bias of a movie. A positive bias would mean, “Wow, this movie is good, regardless of who is watching it!” A negative bias would be a movie like* Avatar: The Last Airbender*.

We can re-calculate the optimal settings for each parameter (again by taking the derivatives and setting them to 0) to get:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i $$

$$ b_i = \frac{1}{| \Omega_i |}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu $$

$$ c_j= \frac{1}{| \Omega_j |}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu $$

## Regularization

With the above model, you may encounter what is called the “singular covariance” problem. This is what happens when you can’t invert the matrix that appears in the updates for \( u \) and \( v \).

The solution is again, similar to what you would do in linear regression or logistic regression: Add a squared error term with a weight \( \lambda \) that keeps the parameters small.

In terms of the likelihood, the previous formulation assumes that the difference between \( A_{ij} \) and \( \hat{A}_{ij} \) is normally distributed, while the cost function with regularization is like adding a normally-distributed prior on each parameter centered at 0.

i.e. \( u_i, v_j, b_i, c_j \sim N(0, 1/\lambda) \).

So the cost function becomes:

$$ J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2 + \lambda(||U||_F + ||V||_F + ||b||^2 + ||c||^2) $$

Where \( ||X||_F \) is the Frobenius norm of \( X \).

For each parameter, setting the derivative with respect to that parameter, setting it to 0 and solving for the optimal value yields:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T + \lambda{I})^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T + \lambda{I})^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i $$

$$ b_i = \frac{1}{1 + \lambda} \frac{1}{| \Omega_i |}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu $$

$$ c_j= \frac{1}{1 + \lambda} \frac{1}{| \Omega_j |}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu $$

## Python Code

The simplest way to implement the above formulas would be to just code them directly.

Initialize your parameters as follows:

U = np.random.randn(M, K) / K
V = np.random.randn(K, N) / K
B = np.zeros(M)
C = np.zeros(N)

Next, you want \( \Omega_i \) and \( \Omega_j \) to be easily accessible, so create dictionaries “ratings_by_i” where “i” is the key, and the value is an array of all the (j, r) pairs that user i has rated (r is the rating). Do the same for “ratings_by_j”.

Then, your updates would be as follows:

for t in xrange(T):
# update B
for i in xrange(M):
if i in ratings_by_i:
accum = 0
for j, r in ratings_by_i[i]:
accum += (r - U[i,:].dot(V[:,j]) - C[j] - mu)
B[i] = accum / (1 + reg) / len(ratings_by_i[i])
# update U
for i in xrange(M):
if i in ratings_by_i:
matrix = np.zeros((K, K)) + reg*np.eye(K)
vector = np.zeros(K)
for j, r in ratings_by_i[i]:
matrix += np.outer(V[:,j], V[:,j])
vector += (r - B[i] - C[j] - mu)*V[:,j]
U[i,:] = np.linalg.solve(matrix, vector)
# update C
for j in xrange(N):
if j in ratings_by_j:
accum = 0
for i, r in ratings_by_j[j]:
accum += (r - U[i,:].dot(V[:,j]) - B[i] - mu)
C[j] = accum / (1 + reg) / len(ratings_by_j[j])
# update V
for j in xrange(N):
if j in ratings_by_j:
matrix = np.zeros((K, K)) + reg*np.eye(K)
vector = np.zeros(K)
for i, r in ratings_by_j[j]:
matrix += np.outer(U[i,:], U[i,:])
vector += (r - B[i] - C[j] - mu)*U[i,:]
V[:,j] = np.linalg.solve(matrix, vector)

And that’s all there is to it!

For more free machine learning and data science tutorials, sign up for my newsletter.

Go to comments
April 20, 2016

[Scroll to the bottom if you want to jump straight to the coupon]

**Cluster analysis** is a staple of **unsupervised machine learning** and **data science**.

It is very useful for **data mining** and **big data** because it **automatically finds patterns** in the data, without the need for labels, unlike supervised machine learning.

In a real-world environment, you can imagine that a robot or an **artificial intelligence** won’t always have access to the optimal answer, or maybe there isn’t an optimal correct answer. You’d want that robot to be able to explore the world on its own, and learn things just by looking for patterns.

Do you ever wonder how we get the data that we use in our supervised machine learning algorithms?

We always seem to have a nice CSV or a table, complete with Xs and corresponding Ys.

If you haven’t been involved in acquiring data yourself, you might not have thought about this, but someone has to make this data!

Those “Y”s have to come from somewhere, and a lot of the time that involves manual labor.

Sometimes, you don’t have access to this kind of information or it is infeasible or costly to acquire.

But you still want to have some idea of the structure of the data. If you’re doing **data analytics** automating **pattern recognition** in your data would be invaluable.

This is where unsupervised machine learning comes into play.

In this course we are first going to talk about clustering. This is where instead of training on labels, we try to create our own labels! We’ll do this by grouping together data that looks alike.

There are 2 methods of clustering we’ll talk about: **k-means clustering** and **hierarchical clustering**.

Next, because in machine learning we like to talk about probability distributions, we’ll go into **Gaussian mixture models** and **kernel density estimation**, where we talk about how to “learn” the probability distribution of a set of data.

One interesting fact is that under certain conditions, Gaussian mixture models and k-means clustering are exactly the same! You can think of GMMs as a “souped up” version of k-means. We’ll prove how this is the case.

All the algorithms we’ll talk about in this course are staples in machine learning and data science, so if you want to know how to automatically find patterns in your data with data mining and pattern extraction, without needing someone to put in manual work to label that data, then this course is for you.

All the materials for this course are FREE. You can download and install **Python, Numpy, and Scipy** with simple commands on **Windows, Linux, or Mac**.

50% OFF COUPON: https://www.udemy.com/cluster-analysis-unsupervised-machine-learning-python/?couponCode=EARLYBIRD

#agglomerative clustering #cluster analysis #data mining #data science #expectation-maximization #gaussian mixture model #hierarchical clustering #k-means clustering #kernel density estimation #pattern recognition #udemy #unsupervised machine learning
Go to comments