# New Years 2019

### How to meet your New Years resolutions in 2019

Firstly, I’d like to wish everyone on this list a happy new year, we are off to a great start. The new year is a time to set goals, turn things around, and be better than we were before.

What better way than to learn from thousands of experts around the world who are the best at what they do? Luckily, I’ve got something that will make it just a little easier.

I know a lot of you have been waiting for this – well here it is – the LOWEST price possible on ALL Udemy courses (yes, the whole site!)

For the next 10 days, ALL courses on Udemy (not just mine) are available for just $9.99! For my courses, please use the Udemy coupons below (included in the links below), or if you want, enter the coupon code: JAN2019. For prerequisite courses (math, stats, Python programming) and all other courses (Bitcoin, meditation, yoga, guitar, photography, whatever else you want to learn), follow the links at the bottom (or go to my website). Since ALL courses on Udemy are on sale, if you want any course not listed here, just click the general (site-wide) link, and search for courses from that page. https://www.udemy.com/recommender-systems/?couponCode=JAN2019 https://www.udemy.com/deep-learning-advanced-nlp/?couponCode=JAN2019 ### PREREQUISITE COURSE COUPONS And just as important,$9.99 coupons for some helpful prerequisite courses. You NEED to know this stuff to understand machine learning in-depth:

General (site-wide): http://bit.ly/2oCY14Z
Python http://bit.ly/2pbXxXz
Calc 1 http://bit.ly/2okPUib
Calc 2 http://bit.ly/2oXnhpX
Calc 3 http://bit.ly/2pVU0gQ
Linalg 1 http://bit.ly/2oBBir1
Linalg 2 http://bit.ly/2q5SGEE
Probability (option 1) http://bit.ly/2p8kcC0
Probability (option 2) http://bit.ly/2oXa2pb
Probability (option 3) http://bit.ly/2oXbZSK

### OTHER UDEMY COURSE COUPONS

As you know, I’m the “Lazy Programmer”, not just the “Lazy Data Scientist” – I love all kinds of programming!

iOS courses:
https://lazyprogrammer.me/ios

Android courses:
https://lazyprogrammer.me/android

Ruby on Rails courses:
https://lazyprogrammer.me/ruby-on-rails

Python courses:
https://lazyprogrammer.me/python

Big Data (Spark + Hadoop) courses:

Javascript, ReactJS, AngularJS courses:
https://lazyprogrammer.me/javascript

### EVEN MORE COOL STUFF

Into Yoga in your spare time? Photography? Painting? There are courses, and I’ve got coupons! If you find a course on Udemy that you’d like a coupon for, just let me know and I’ll hook you up!

# Black Friday 2018 – Udemy’s BIGGEST Sale of the YEAR is back!

November 14, 2018

#### Deep Learning and AI Courses for just $9.99 # Black Friday 2018 ### Udemy’s BIGGEST Sale of the YEAR is back! I know a lot of you have been waiting for this – well here it is – the LOWEST price possible on ALL Udemy courses (yes, the whole site!) For the next 7 days, ALL courses on Udemy (not just mine) are available for just$9.99!

For my courses, please use the coupons below (included in the links below), or if you want, enter the coupon code: NOV2018.

For prerequisite courses (math, stats, Python programming) and all other courses (yoga, guitar, photography, whatever else you want to learn), follow the links at the bottom.

Since ALL courses on Udemy are on sale, if you want any course not listed here, just click the general (site-wide) link, and search for courses from that page.

https://www.udemy.com/recommender-systems/?couponCode=NOV2018

## Problem Setup

Let’s use the “users rating movies” example for this tutorial. After some Internet searching, we can determine that there are approximately 500, 000 movies in existence. Let’s also suppose that your very popular movie website has 1 billion users (Facebook has 1.6 billion users as of 2015, so this number is plausible).

How many possible user-movie ratings can you have? That is $$10^9 \times 5 \times 10^5 = 5 \times 10^{14}$$. That’s a lot of ratings! Way too much to fit into your RAM, in fact.

But that’s just one problem.

How many movies have you seen in your life? Of those movies, what percentage of them have you rated? The number is miniscule. In fact, most users have not rated most movies.

This is why recommender systems exist in the first place – so we can recommend you movies that you haven’t seen yet, that we know you’ll like.

So if you were to create a user-movie matrix of movie ratings, most of it would just have missing values.

However, that’s not to say there isn’t a pattern to be found.

Suppose we look at a subset of movie ratings, and we find the following:

Batman
Batman Returns
Batman Begins
The Dark Knight
Batman v. Superman
Guy A
N/A
4
5
5
2
Guy B
4
N/A
5
5
1

Where we’ve used N/A to show that a movie has not yet been rated by a user.

If we used the “cosine distance” ( $$\frac{u^T v}{ |u||v| }$$ ) on the vectors created by looking at only the common movies, we could see that Guy A and Guy B have similar tastes. We could then surmise, based on this closeness, that Guy A might rate the Batman movie a “4”, and Guy B might rate Batman Returns a “4”. And since this is a pretty high rating, we might want to recommend these movies to these users.

This is the idea behind collaborative filtering.

## Enter Matrix Factorization

Matrix factorization solves the above problems by reducing the number of free parameters (so the total number of parameters is much smaller than #users times #movies), and by fitting these parameters to the data (ratings) that do exist.

What is matrix factorization?

Think of factorization in general:

15 = 3 x 5 (15 is made up of the factors 3 and 5)

$$x^2 + x = x(x + 1)$$

We can do the same thing with matrices:

$$\left( \begin{matrix}3 & 4 & 5 \\ 6 & 8 & 10 \end{matrix} \right) = \left( \begin{matrix}1 \\ 2 \end{matrix} \right) \left( \begin{matrix}3 & 4 & 5 \end{matrix} \right)$$

In fact, this is exactly what we do in matrix factorization. We “pretend” the big ratings matrix (the one that can’t fit into our RAM) is actually made up of 2 smaller matrices multiplied together.

Remember that to do a valid matrix multiply, the inner dimensions must match. What is the size of this dimension? We call it “K”. It is unknown, but we can choose it via possibly cross-validation so that our model generalizes well.

If we have $$M$$ users and $$N$$ ratings, then the total number of parameters in our model is $$MK + NK$$. If we set $$K = 10$$, the total number of parameters we’d have for the user-movie problem would be $$10^{10} + 5 \times 10^6$$, which is still approximately $$10^{10}$$, which is a factor of $$10^4$$ smaller than before.

This is a big improvement!

So now we have:

$$A \simeq \hat{ A } = UV$$

If you were to picture the matrices themselves, they would look like this:

Because I am lazy and took this image from elsewhere on the Internet, the “d” here is what I am calling “K”. And their “R” is my “A”.

You know that with any machine learning algorithm we have 2 procedures – the fitting procedure and the prediction procedure.

For the fitting procedure, we want every known $$A_{ij}$$ to be as close to $$\hat{A}_{ij} = u_i^Tv_j$$ as possible. $$u_i$$ is the ith row of $$U$$. $$v_j$$ is the jth column of $$V$$.

For the prediction procedure, we won’t have an $$A_{ij}$$, but we can use $$\hat{A}_{ij} = u_i^Tv_j$$ to tell us what user i might rate movie j given the existing patterns.

## The Cost Function

A natural cost function for this problem is the squared error. Think of it as a regression. This is just:

$$J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2$$

Where $$\Omega$$ is the set of all pairs $$(i, j)$$ where user i has rated movie j.

Later, we will use $$\Omega_i$$ to be the set of all j’s (movies) that user i has rated, and we will use $$\Omega_j$$ to be the set of all i’s (users) that have rated movie j.

## Coordinate Descent

What do you do when you want to minimize a function? Take the derivative and set it to 0, of course. No need to use anything more complicated if the simple approach is solvable and performs well. It is also possible to use gradient descent on this problem by taking the derivative and then taking small steps in that direction.

You will notice that there are 2 derivatives to take here. The first is $$\partial{J} / \partial{u}$$.

The other is $$\partial{J} / \partial{v}$$. After calculating the derivatives and solving for $$u$$ and $$v$$, you get:

$$u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} A_{ij} v_j$$

$$v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j} A_{ij} u_i$$

So you take both derivatives. You set both to 0. You solve for the optimal u and v. Now what?

You first update $$u$$ using the current setting of $$v$$, then you update $$v$$ using the current setting of $$u$$. The order doesn’t matter, just that you alternate between the two.

There is a mathematical guarantee that J will improve on each iteration.

This technique is also known as alternating least squares. (This makes sense because we’re minimizing the squared error and updating $$u$$ and $$v$$ in an alternating fashion.)

## Bias Parameters

As with other methods like linear regression and logistic regression, we can add bias parameters to our model to improve accuracy. In this case our model becomes:

$$\hat{A}_{ij} = u_i^T v_j + b_i + c_j + \mu$$

Where $$\mu$$ is the global mean (average of all known ratings).

You can interpret $$b_i$$ as the bias of a user. A negative bias means this user just hates movies more than the average person. A positive bias would mean the opposite. Similarly, $$c_j$$ is the bias of a movie. A positive bias would mean, “Wow, this movie is good, regardless of who is watching it!” A negative bias would be a movie like Avatar: The Last Airbender.

We can re-calculate the optimal settings for each parameter (again by taking the derivatives and setting them to 0) to get:

$$u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j$$

$$v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i$$

$$b_i = \frac{1}{| \Omega_i |}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu$$

$$c_j= \frac{1}{| \Omega_j |}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu$$

## Regularization

With the above model, you may encounter what is called the “singular covariance” problem. This is what happens when you can’t invert the matrix that appears in the updates for $$u$$ and $$v$$.

The solution is again, similar to what you would do in linear regression or logistic regression: Add a squared error term with a weight $$\lambda$$ that keeps the parameters small.

In terms of the likelihood, the previous formulation assumes that the difference between $$A_{ij}$$ and $$\hat{A}_{ij}$$ is normally distributed, while the cost function with regularization is like adding a normally-distributed prior on each parameter centered at 0.

i.e. $$u_i, v_j, b_i, c_j \sim N(0, 1/\lambda)$$.

So the cost function becomes:

$$J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2 + \lambda(||U||_F + ||V||_F + ||b||^2 + ||c||^2)$$

Where $$||X||_F$$ is the Frobenius norm of $$X$$.

For each parameter, setting the derivative with respect to that parameter, setting it to 0 and solving for the optimal value yields:

$$u_i = ( \sum_{j \in \Omega_i} v_j v_j^T + \lambda{I})^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j$$

$$v_j = ( \sum_{i \in \Omega_j} u_i u_i^T + \lambda{I})^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i$$

$$b_i = \frac{1}{| \Omega_i | +\lambda}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu$$

$$c_j= \frac{1}{| \Omega_j | +\lambda}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu$$

## Python Code

The simplest way to implement the above formulas would be to just code them directly.

U = np.random.randn(M, K) / K
V = np.random.randn(K, N) / K
B = np.zeros(M)
C = np.zeros(N)


Next, you want $$\Omega_i$$ and $$\Omega_j$$ to be easily accessible, so create dictionaries “ratings_by_i” where “i” is the key, and the value is an array of all the (j, r) pairs that user i has rated (r is the rating). Do the same for “ratings_by_j”.

for t in xrange(T):

# update B
for i in xrange(M):
if i in ratings_by_i:
accum = 0
for j, r in ratings_by_i[i]:
accum += (r - U[i,:].dot(V[:,j]) - C[j] - mu)
B[i] = accum / (len(ratings_by_i[i]) + reg)

# update U
for i in xrange(M):
if i in ratings_by_i:
matrix = np.zeros((K, K)) + reg*np.eye(K)
vector = np.zeros(K)
for j, r in ratings_by_i[i]:
matrix += np.outer(V[:,j], V[:,j])
vector += (r - B[i] - C[j] - mu)*V[:,j]
U[i,:] = np.linalg.solve(matrix, vector)

# update C
for j in xrange(N):
if j in ratings_by_j:
accum = 0
for i, r in ratings_by_j[j]:
accum += (r - U[i,:].dot(V[:,j]) - B[i] - mu)
C[j] = accum / (len(ratings_by_j[j]) + reg)

# update V
for j in xrange(N):
if j in ratings_by_j:
matrix = np.zeros((K, K)) + reg*np.eye(K)
vector = np.zeros(K)
for i, r in ratings_by_j[j]:
matrix += np.outer(U[i,:], U[i,:])
vector += (r - B[i] - C[j] - mu)*U[i,:]
V[:,j] = np.linalg.solve(matrix, vector)


And that’s all there is to it!