August 25, 2016

For some reason Udemy announced a promotion but when you go to the site it doesn’t appear.

Just use this link to get ANY course on Udemy for $10:

Go to commentsAugust 25, 2016

For some reason Udemy announced a promotion but when you go to the site it doesn’t appear.

Just use this link to get ANY course on Udemy for $10:

Go to commentsApril 25, 2016

This article will be of interest to you if you want to learn about recommender systems and predicting movie ratings (or book ratings, or product ratings, or any other kind of rating).

Contests like the $1 million Netflix Challenge are an example of what collaborative filtering can be used for.

Let’s use the “users rating movies” example for this tutorial. After some Internet searching, we can determine that there are approximately 500, 000 movies in existence. Let’s also suppose that your very popular movie website has 1 billion users (Facebook has 1.6 billion users as of 2015, so this number is plausible).

How many possible user-movie ratings can you have? That is \( 10^9 \times 5 \times 10^5 = 5 \times 10^{14} \). That’s a lot of ratings! Way too much to fit into your RAM, in fact.

But that’s just one problem.

How many movies have you seen in your life? Of those movies, what percentage of them have you rated? The number is miniscule. In fact, *most* users have not rated *most* movies.

This is why recommender systems exist in the first place – so we can recommend you movies that you haven’t seen yet, that we know you’ll like.

So if you were to create a user-movie matrix of movie ratings, most of it would just have missing values.

However, that’s not to say there isn’t a pattern to be found.

Suppose we look at a subset of movie ratings, and we find the following:

Batman

Batman Returns

Batman Begins

The Dark Knight

Batman v. Superman

Guy A

N/A

4

5

5

2

Guy B

4

N/A

5

5

1

Where we’ve used N/A to show that a movie has not yet been rated by a user.

If we used the “cosine distance” ( \( \frac{u^T v}{ |u||v| } \) ) on the vectors created by looking at only the common movies, we could see that Guy A and Guy B have similar tastes. We could then surmise, based on this closeness, that Guy A might rate the Batman movie a “4”, and Guy B might rate Batman Returns a “4”. And since this is a pretty high rating, we might want to recommend these movies to these users.

This is the idea behind collaborative filtering.

Matrix factorization solves the above problems by reducing the number of free parameters (so the total number of parameters is much smaller than #users times #movies), and by fitting these parameters to the data (ratings) that do exist.

What is matrix factorization?

Think of factorization in general:

15 = 3 x 5 (15 is made up of the factors 3 and 5)

\( x^2 + x = x(x + 1) \)

We can do the same thing with matrices:

$$\left( \begin{matrix}3 & 4 & 5 \\ 6 & 8 & 10 \end{matrix} \right) = \left( \begin{matrix}1 \\ 2 \end{matrix} \right) \left( \begin{matrix}3 & 4 & 5 \end{matrix} \right) $$

In fact, this is exactly what we do in matrix factorization. We “pretend” the big ratings matrix (the one that can’t fit into our RAM) is actually made up of 2 smaller matrices multiplied together.

Remember that to do a valid matrix multiply, the inner dimensions must match. What is the size of this dimension? We call it “K”. It is unknown, but we can choose it via possibly cross-validation so that our model generalizes well.

If we have \( M \) users and \( N \) ratings, then the total number of parameters in our model is \( MK + NK \). If we set \( K = 10 \), the total number of parameters we’d have for the user-movie problem would be \( 10^{10} + 5 \times 10^6 \), which is still approximately \( 10^{10} \), which is a factor of \( 10^4 \) smaller than before.

This is a big improvement!

So now we have:

$$ A \simeq \hat{ A } = UV $$

If you were to picture the matrices themselves, they would look like this:

Because I am lazy and took this image from elsewhere on the Internet, the “d” here is what I am calling “K”. And their “R” is my “A”.

You know that with any machine learning algorithm we have 2 procedures – the fitting procedure and the prediction procedure.

For the fitting procedure, we want every known \( A_{ij} \) to be as close to \( \hat{A}_{ij} = u_i^Tv_j \) as possible. \( u_i \) is the ith row of \( U \). \( v_j \) is the jth column of \( V \).

For the prediction procedure, we won’t have an \( A_{ij} \), but we can use \( \hat{A}_{ij} = u_i^Tv_j \) to tell us what user i *might* rate movie j given the existing patterns.

A natural cost function for this problem is the squared error. Think of it as a regression. This is just:

$$ J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2 $$

Where \( \Omega \) is the set of all pairs \( (i, j) \) where user i has rated movie j.

Later, we will use \( \Omega_i \) to be the set of all j’s (movies) that user i has rated, and we will use \( \Omega_j \) to be the set of all i’s (users) that have rated movie j.

What do you do when you want to minimize a function? Take the derivative and set it to 0, of course. No need to use anything more complicated if the simple approach is solvable and performs well. It is also possible to use gradient descent on this problem by taking the derivative and then taking small steps in that direction.

You will notice that there are 2 derivatives to take here. The first is \( \partial{J} / \partial{u} \).

The other is \( \partial{J} / \partial{v} \). After calculating the derivatives and solving for \( u \) and \( v \), you get:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} A_{ij} v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j} A_{ij} u_i $$

So you take both derivatives. You set both to 0. You solve for the optimal u and v. Now what?

The answer is: coordinate descent.

You first update \( u \) using the current setting of \( v \), then you update \( v \) using the current setting of \( u \). The order doesn’t matter, just that you alternate between the two.

There is a mathematical guarantee that J will improve on each iteration.

This technique is also known as **alternating least squares**. (This makes sense because we’re minimizing the squared error and updating \( u \) and \( v \) in an alternating fashion.)

As with other methods like linear regression and logistic regression, we can add bias parameters to our model to improve accuracy. In this case our model becomes:

$$ \hat{A}_{ij} = u_i^T v_j + b_i + c_j + \mu $$

Where \( \mu \) is the global mean (average of all known ratings).

You can interpret \( b_i \) as the bias of a user. A negative bias means this user just hates movies more than the average person. A positive bias would mean the opposite. Similarly, \( c_j \) is the bias of a movie. A positive bias would mean, “Wow, this movie is good, regardless of who is watching it!” A negative bias would be a movie like* Avatar: The Last Airbender*.

We can re-calculate the optimal settings for each parameter (again by taking the derivatives and setting them to 0) to get:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T )^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T )^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i $$

$$ b_i = \frac{1}{| \Omega_i |}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu $$

$$ c_j= \frac{1}{| \Omega_j |}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu $$

With the above model, you may encounter what is called the “singular covariance” problem. This is what happens when you can’t invert the matrix that appears in the updates for \( u \) and \( v \).

The solution is again, similar to what you would do in linear regression or logistic regression: Add a squared error term with a weight \( \lambda \) that keeps the parameters small.

In terms of the likelihood, the previous formulation assumes that the difference between \( A_{ij} \) and \( \hat{A}_{ij} \) is normally distributed, while the cost function with regularization is like adding a normally-distributed prior on each parameter centered at 0.

i.e. \( u_i, v_j, b_i, c_j \sim N(0, 1/\lambda) \).

So the cost function becomes:

$$ J = \sum_{(i, j) \in \Omega} (A_{ij} – \hat{A}_{ij})^2 + \lambda(||U||_F + ||V||_F + ||b||^2 + ||c||^2) $$

Where \( ||X||_F \) is the Frobenius norm of \( X \).

For each parameter, setting the derivative with respect to that parameter, setting it to 0 and solving for the optimal value yields:

$$ u_i = ( \sum_{j \in \Omega_i} v_j v_j^T + \lambda{I})^{-1} \sum_{j \in \Omega_i} (A_{ij} – b_i – c_j – \mu )v_j $$

$$ v_j = ( \sum_{i \in \Omega_j} u_i u_i^T + \lambda{I})^{-1} \sum_{i \in \Omega_j}(A_{ij} – b_i – c_j – \mu )u_i $$

$$ b_i = \frac{1}{1 + \lambda} \frac{1}{| \Omega_i |}\sum_{j \in \Omega_i} A_{ij} – u_i^Tv_j – c_j – \mu $$

$$ c_j= \frac{1}{1 + \lambda} \frac{1}{| \Omega_j |}\sum_{i \in \Omega_j} A_{ij} – u_i^Tv_j – b_i – \mu $$

The simplest way to implement the above formulas would be to just code them directly.

Initialize your parameters as follows:

U = np.random.randn(M, K) / K V = np.random.randn(K, N) / K B = np.zeros(M) C = np.zeros(N)

Next, you want \( \Omega_i \) and \( \Omega_j \) to be easily accessible, so create dictionaries “ratings_by_i” where “i” is the key, and the value is an array of all the (j, r) pairs that user i has rated (r is the rating). Do the same for “ratings_by_j”.

Then, your updates would be as follows:

for t in xrange(T): # update B for i in xrange(M): if i in ratings_by_i: accum = 0 for j, r in ratings_by_i[i]: accum += (r - U[i,:].dot(V[:,j]) - C[j] - mu) B[i] = accum / (1 + reg) / len(ratings_by_i[i]) # update U for i in xrange(M): if i in ratings_by_i: matrix = np.zeros((K, K)) + reg*np.eye(K) vector = np.zeros(K) for j, r in ratings_by_i[i]: matrix += np.outer(V[:,j], V[:,j]) vector += (r - B[i] - C[j] - mu)*V[:,j] U[i,:] = np.linalg.solve(matrix, vector) # update C for j in xrange(N): if j in ratings_by_j: accum = 0 for i, r in ratings_by_j[j]: accum += (r - U[i,:].dot(V[:,j]) - B[i] - mu) C[j] = accum / (1 + reg) / len(ratings_by_j[j]) # update V for j in xrange(N): if j in ratings_by_j: matrix = np.zeros((K, K)) + reg*np.eye(K) vector = np.zeros(K) for i, r in ratings_by_j[j]: matrix += np.outer(U[i,:], U[i,:]) vector += (r - B[i] - C[j] - mu)*U[i,:] V[:,j] = np.linalg.solve(matrix, vector)

And that’s all there is to it!

For more free machine learning and data science tutorials, sign up for my newsletter.

Go to comments

April 2, 2016

I was aiming to get this course out before the end of March, and it is now April. So you know I put it some extra work to make it as awesome as possible.

Course summary (scroll down for coupons):

This is the 3rd part in my Data Science and Machine Learning series on Deep Learning in Python. At this point, you already know a lot about neural networks and deep learning, including not just the basics like backpropagation, but how to improve it using modern techniques like momentum and adaptive learning rates. You’ve already written deep neural networks in **Theano** and **TensorFlow**, and you know how to run code using the GPU.

This course is all about how to use deep learning for **computer vision** using **convolutional neural networks**. These are the state of the art when it comes to **image classification** and they beat vanilla deep networks at tasks like MNIST.

In this course we are going to up the ante and look at the **StreetView House Number (SVHN) **dataset – which uses larger color images at various angles – so things are going to get tougher both computationally and in terms of the difficulty of the classification task. But we will show that convolutional neural networks, or CNNs, are capable of handling the challenge!

Because **convolution** is such a central part of this type of neural network, we are going to go in-depth on this topic. It has more applications than you might imagine, such as **modeling artificial organs** like the pancreas and the heart. I’m going to show you how to build convolutional filters that can be applied to **audio**, like the echo effect, and I’m going to show you how to build filters for **image effects, **like the **Gaussian blur **and **edge detection**.

We will also do some **biology** and talk about how convolutional neural networks have been inspired by the **animal visual cortex**.

After describing the architecture of a convolutional neural network, we will jump straight into code, and I will show you how to extend the deep neural networks we built last time (in part 2) with just a few new functions to turn them into CNNs. We will then test their performance and show how convolutional neural networks written in both Theano and TensorFlow can outperform the accuracy of a plain neural network on the StreetView House Number dataset.

All the materials for this course are FREE. You can download and install Python, Numpy, Scipy, Theano, and TensorFlow with simple commands shown in previous courses.

Coupons:

If other people beat you to that one:

And if you were ultra slow:

Go to commentsMarch 30, 2016

Quick note to announce 2 free books out this week (3/28 – 4/1):

Deep Learning Prerequisites https://kdp.amazon.com/amazon-dp-action/us/bookshelf.marketplacelink/B01D7GDRQ2

Do you find deep learning difficult?

So you want to learn about deep learning and neural networks, but you don’t have a clue what machine learning even is. This book is for you.

Perhaps you’ve already tried to read some tutorials about deep learning, and were just left scratching your head because you did not understand any of it. This book is for you.

This book was designed to contain all the prerequisite information you need for my next book, Deep Learning in Python: Master Data Science and Machine Learning with Modern Neural Networks written in Python, Theano, and TensorFlow.

There are many techniques that you should be comfortable with before diving into deep learning. For example, the “backpropagation” algorithm is just gradient descent, which is the same technique that is used to solve logistic regression.

SQL for Marketers https://kdp.amazon.com/amazon-dp-action/us/bookshelf.marketplacelink/B01D42UBP4

Do you want to know how to optimize your sales funnel using SQL, look at the seasonal trends in your industry, and run a SQL query on Hadoop? Then join me now in my new book, SQL for marketers: Dominate data analytics, data science, and big data.

Go to comments

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

- Are interested in deep learning (this tutorial uses gradient descent)
- Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
- Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
- Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np import theano import theano.tensor as T X = T.matrix('X') Q = T.matrix('Q') Z = T.dot(X, Q) transform = theano.function(inputs=[X,Q], outputs=Z) X_val = np.random.randn(100,10) Q_val = np.random.randn(10,10) Z_val = transform(X_val, Q_val)

I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$ J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2 $$

Where \( \hat{X} \) is the reconstruction of \( X \). If there is only one eigenvector, let’s call this \( v \), then this becomes:

$$ J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2 $$

This is equivalent to the Frobenius norm, so we can write:

$$ J = |X – Xvv^T|_F $$

One identity of the Frobenius norm is:

$$ |A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) } $$

Which means we can rewrite the loss function as:

$$ J = Tr( (X – Xvv^T)^T(X – Xvv^T) ) $$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on \( v \), you can use matrix algebra to rearrange this to get:

$$ v^* = argmin\{-Tr(X^TXvv^T) \} $$

Which again using reordering would be equivalent to maximizing:

$$ v^* = argmax\{ v^TX^TXv \} $$

The corresponding eigenvalue would then be:

$$ \lambda = v^TX^TXv $$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$ v \leftarrow v + \eta \nabla_v(v^TX^TXv) $$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$ v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i ) $$

Next, note that to implement this algorithm you **never **need to actually calculate the covariance \( X^T X \). If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get \( Xv \), which is only of size \( N \times 1 \). You can then “dot” this with itself to get a scalar, which is only an \( O(N) \) operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v") Xv = T.dot(X, v) cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i)) gv = T.grad(cost, v)

Note that we re-normalize the eigenvector on each step, so that \( v^T v = 1 \).

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv update_expression = y / y.norm(2) train = theano.function( inputs=[X], outputs=[your outputs], updates=((v, update_expression),) )

Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find): ... initialize variables and expressions ... ... initialize theano train function ... while t < max_iterations and change in v < tol: outputs = train(data) ... return eigenvalues and eigenvectors ...

This is not really trivial but at the same time it's a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course on Udemy "Data Science: Deep Learning in Python" and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theanoGo to comments

December 12, 2015

I have to admit that, despite all this science-y stuff I do, I still have my vices.

You know when you’re just sitting there waiting for the train to take you home, and you wish you had a game to play on your iPhone? Then you go home and download that game, fully intending to only play it when you’re bored or waiting? Then you end up sacrificing the time you should have spent doing important things playing that game instead?

That’s **The Simpsons: Tapped Out** for me (along with new contender Marvel Contest of Champions, but that’s another story).

Every holiday, and even sometimes when it’s not a holiday, The Simpsons has a special holiday update. This year’s Christmas special lets you play a classic game originally known as the Prisoner’s Dilemma.

It is the canonical example of a “game” in **Game Theory**. **John Nash**, the subject of the movie **A Beautiful Mind**, made fundamental contributions to the theory. It has applications in military strategy, politics, and even sports.

In Tapped Out, every player has their own town with their own characters and houses, etc. You can visit other towns to get points.

In the Christmas update, when you visit a friend’s town, you drop a present off, and you are asked if you want to make the present “nice” (Lisa), or “naughty” (Bart). Naturally.

When you receive a present, you also have to choose nice/Lisa, or naughty/Bart.

So there are 4 possibilities:

- You choose Bart, other player chooses Lisa
- You choose Bart, other player chooses Bart
- You choose Lisa, other player chooses Lisa
- You choose Lisa, other player chooses Bart

But your choice affects how many points you receive.

This is a picture of the game that explains how it is played.

So as you can see, the point system is as follows:

- You choose Bart, other player chooses Lisa – You get 25, other player gets 1
- You choose Bart, other player chooses Bart – You get 5, other player gets 5
- You choose Lisa, other player chooses Lisa – You get 15, other player gets 15
- You choose Lisa, other player chooses Bart – You get 1, other player gets 25

In the original prisoner’s dilemma, the problem is posed as follows:

2 prisoners are given a choice of whether to cooperate with or betray the other. If they both cooperate, they both serve 1 year. If they both betray each other, they both serve 2 years. But if one betrays and one cooperates, then the one who betrays goes free, while the one who cooperated gets 3 years.

The problem can be summarized in a table like this:

Where P, R, S, and T are called “payoffs”. To be a prisoner’s dilemma game, we must have T > R > P > S, which is true for The Simpsons: Tapped Out version also.

Once you understand the problem, the question becomes – how do I make a choice that will maximize my winnings?

The Prisoner’s Dilemma is a dilemma because it seems obvious that to maximize both our winnings (altruism), we should cooperate.

But if I know you’re rational and will cooperate with me, and I am selfish, then I will defect, so I win more and you get nothing.

The problem is, if we both think that way, then we will both defect and we will both end up with less.

I would also be hesitant to cooperate because if you are selfish, then you will defect and I will end up with nothing.

The only “rational” solution is to be selfish and betray. Why? Because it is the only choice for which choosing the other thing will result in a worse outcome.

Regardless of what you do – if I choose to betray, it is better than if I had chosen to cooperate.

If you choose cooperate – I choose betray, I get 25 points. I choose cooperate, I only get 15.

If you choose betray – I choose betray, I get 5 points, but if I choose cooperate, I only get 1.

So no matter what you choose, my choice of betray is better than my choice of cooperate.

Of course, since I can drop off multiple presents to my neighbors’ towns, there isn’t only one “game” to play. In fact, multiple or iterated games make the prisoner’s dilemma even more interesting.

There is a bigger social element to repeated games.

If you betray me, then now I know you are a betrayer, so I will not cooperate with you.

If you cooperate with me, then perhaps I can trust you, and we can cooperate with each other until the end of the holiday special. That will result in more points for both of us.

But not as many for me compared to what I’d get if I continually betrayed you.

Then again, if I kept betraying you, you would stop cooperating with me.

You see how this gets complex?

Research done by Robert Axelrod in his 1984 book **The Evolution of Cooperation** helps to shed light on this issue.

He studied competing strategies in a tournament and discovered that purely selfish strategies performed very poorly in the long run, and that more altruistic strategies performed better, even when measured from a selfish perspective (how many points did I win?).

In terms of natural selection, this may explain why we have evolved over time to be both altruistic and self-interested.

The winning strategy was called “tit for tat”, which in short, means you just play the move your opponent previously played. The tit for tat strategy can be improved upon slightly by adding a small random chance that you will forgive your opponent if they betray you, and cooperate.

Interestingly, Axelrod’s analysis of the top strategies have a very “human” component. The conditions for a successful strategy can be summarized as:

- Be nice. Cooperate unless your opponent betrays you.
- Have a spine. If your opponent betrays you, don’t keep cooperating with them.
- Forgive. Don’t fall into a cycle of repeated betrayals.
- Don’t be jealous. Don’t strive to get more points than the other player and don’t be outcome dependent.

In The Simpsons: Tapped Out, you don’t have an infinite number of games to learn from or an infinite number of gifts to give. You are limited to 5 per day. Thus your ultimate strategy will probably be adaptive as well – choose only the 5 players you want to mutually benefit from your gift giving and choose the ones who have a high probability of cooperating.

Let’s be real though. I don’t think the people who are playing this game are really considering this stuff. For the first few hours I just choose Bart, because that was the naughty one.

So yeah.

#a beautiful mind #game theory #john nash #prisoner's dilemma #tapped out #the simpsons #tstoGo to comments

December 4, 2015

Was having this conversation recently about what a z-score is.

What is a z-score? Why is it mentioned so often in statistics? You’ve probably heard of it in the context of the Student t-test.

If you are a machine learning guy more than a statistics guy, like me, you’ve probably heard you should “standardize” or “normalize” your variables before putting them into a machine learning model.

For example, if you’re doing linear regression and \( X_1 \) varies from \( 0..1 \) and \( X_2 \) varies from \( 0..1000 \), the weights \( \beta_1 \) and \( \beta_2 \) will give an inaccurate picture as to their importance.

Well, the z-score actually *is* the standardized variable.

I would avoid this terminology if possible, because in machine learning we usually think of a “score” as the output of a model, i.e. I’m getting the “score” of a set of links because I want to know in what order to rank them.

$$ z = \frac{x – \mu}{\sigma} $$

So instead of thinking of “z-scores”, think of “I need to standardize my variables before using them in my machine learning model so that the effect of any one variable is on the same scale as the others”.

Do you find statistics terminology confusing? “Z score”? “F statistic”? Share your thoughts in the comments!

#frequentist statistics #machine learning #statistics #z scoreGo to comments

December 3, 2015

This article has been a long time coming. I wrote a shitty version years ago, but wanted to update it with new and current info, in a more organized and less shitty format.

In the current environment you are probably mostly concerned with “big data”, where both for-profit companies and the government download 1000s of TBs of data about you everyday. New and fancy technologies are popping up all the time, marketers and spammers love writing about them on LinkedIn, and gullible executives think they are must-haves.

The talking heads at your workplace might say, “we need to build a scalable product!”, or some such. So you end up creating a Hadoop cluster with a few tiny chunks of data and the overhead of your MapReduce actually takes longer than a for-loop by itself would have.

With all this fanciness you lose sight of the simple solutions – such as flat files, SQLite, and SQL. This article is a short survey of existing data solutions (both big data and small data) and at what scale they are appropriate for use.

You are probably familiar with writing code in your first semester C++ class like this:

char* bob = "Bob"; char* jane = "Jane"; printf("Hi %s! Hi %s!\n", bob, jane);

In the real world, your code has to work on more cases than just Bob and Jane. Maybe you are writing an automated Twitter script that programmatically direct messages people when they start following you. If you use Twitter you’ve probably been annoyed at least a few times by this type of spam.

Working off this example, suppose you (the spammer) decides that you’re going to be somewhat nice and try not to spam people more than once.

So you would like to save the usernames you’ve direct messaged somewhere. Enter the flat file.

Flat files are great for storing small data or where you don’t have to look stuff up. Just load the whole file into an array line by line, and do what you need to do.

In our case, we might load the data into a “set” datastructure so that when we want to look up a username, it’s an O(1) search.

Flat files are great for server configurations. As are JSON.

For scripts that automate something in your personal life, flat files are usually adequate.

A problem arises when you want to load your entire dataset into memory (like a set or a hash), and it doesn’t fit. Remember, your hard drive is on the order of 1TB large. Your RAM is on the order of 8GB, much of which is used by the OS (or most if you’re using Mac).

Enter the database. Databases are stored on disk. i.e. They are just a file or set of files.

The magic happens when you want to find something. Usually you’d have to look through the entire database if you didn’t have some “index” (think like the index at the back of a large textbook) to tell you where everything was.

Databases index a whole bunch of metadata so that looking for stuff is really fast. You’ll often see the term “balanced tree” in reference to database indexes. These are better than regular binary trees where searching is worst case O(N).

Also called “RDBMS”, short for “relational database management system” (they loved verbose terminology in the 80s and 90s), relational databases usually store things in tables.

Examples: MySQL, PostgreSQL.

For example, you might have one table that stores every user’s ID, name, email, and password.

But you might have another table that stores friendships, so that would store the first user’s ID, and the second user’s ID.

Quite appropriately, relational databases keep track of “relationships”, so that, suppose you deleted the user with ID = 3. That would delete all the rows from the friendships table that contain user ID = 3 also, so that in the application, there won’t be any errors when it’s looking for the friends of user ID = 5, who is friends with user ID = 3, when the actual user with ID = 3 has already been deleted.

There is a special relational database called SQLite3. It works on “small data”, so it’s very appropriate for applications on your phone, for instance. iPhone apps on iOS use SQLite3. Many apps on your computer use SQLite3 without you even knowing it.

SQLite3 is stored locally on your machine, whereas bigger relational databases like Postgres can be stored either on your machine or on another machine over the Internet.

Relational databases sort of hit a wall when data got too big to store in one database. Advertising companies can collect 1TB of data per day. In effect, you’d fill up an entire database in that one day. What do you do the next day? And the next?

Hadoop is the open source version of Google’s “Google File System” (GFS) and MapReduce framework.

Suppose for instance that your hard drives have a 1% chance of failing on any given day, and that your data is stored on 1000 hard drives. That means every day, 10 hard drives will fail. How do you make sure you don’t lose this data? You replicate it.

Some very smart people have determined how many copies of your data must be stored so that, even though hard drives are basically guaranteed to fail, you will never lose your data.

In addition to data replication, the data is also spread across multiple “chunks”. So multiple chunks (really files) make up one original data file.

MapReduce is a framework (a.k.a. a fancy way of writing a for loop), that distributes copies of the same program onto multiple machines, where each machine works on different chunks than the other machines.

Ideally, if you use N machines your running time would be reduced by 1/N, but there is lots of overhead that comes with coordinating the work that is done by each machine and merging it all together at the end.

Spark is seen as the “successor” to Hadoop MapReduce. I find that in general Spark jobs are a little easier to write. Note that it’s a framework, NOT a database, but I list it here to ease the confusion.

We will return to Hadoop later, but first, more “big data” generation technologies.

One database that became popular when startups started acquiring lots of data is MongoDB. MongoDB, unlike the other databases we’ve talked about, is not relational. In MongoDB, we don’t have “tables”, we have “collections”. In MongoDB, we don’t have “rows”, we have “documents”.

Documents are JSON documents. The nice thing about MongoDB is that you use Javascript to interact with it.

Startups started using the MEAN stack, which is made up of: MongoDB, ExpressJS, AngularJS, and NodeJS, for an all-Javascript environment.

MongoDB and similar databases don’t guarantee “consistency”. If you’re a bank, and I take out $50 so that my total balance is now $5, I don’t want someone else trying to take out $50 at the same time and putting my balance in the negative.

With MongoDB, I could take out $50, but some other user might still read that same document and see that my account still has $55, and hence try to take out another $50, even though this user read the database after I did my withdrawal.

In many applications this doesn’t matter and it’s good for performance.

MongoDB also allows “replication” and “sharding”.

“Replication” means you can have “masters” and “slaves” which store the same data. Different instances of the application can read from different slaves to decrease the load on any one machine running MongoDB.

“Sharding” means splitting up the data so that certain IDs go on one machine, while other IDs go to another. This also decreases the load.

Often times, people make the mistake of using MongoDB, because it’s new and cool, when their data is actually relational. What happens? They often end up having to program those relationships themselves in the application, which is more tedious and cumbersome than you might imagine.

Some people say “Redis is like a big key-value store”. At a very high level this is indeed what Redis does, and it does so very fast. If you know you don’t have “relationships” in your data, and you know you won’t need to store, query, and update JSON-like structures, then Redis is a great choice. You can also use sharding and replication with Redis, so it can store more stuff than would fit on just one hard drive.

Back to Hadoop…

Hadoop is not a database. The “Hadoop File System” (or HDFS) is the open source analogue of Google’s GFS. A database exists “on top of” a file system. For example, Postgres can exist on top of your “FAT32” file system. It’s a program that coordinates the storage and retrieval of data.

There are indeed databases that can work on top of HDFS/GFS.

Some examples are: Google’s BigTable and Hadoop’s HBase.

They allow you to do “queries”, like you do with SQL, as opposed to MapReduce’s plain for-loop-like structure.

Lessons I think we can learn from other business’ experiences:

1) Don’t use something just because it’s cool and new.

2) Don’t use big data tech when you don’t have big data.

3) Even if you think you have big data, check to see if it’s really that big.

4) Be honest with yourself about how long it’ll take to get big and whether it’s worth investing in a big data solution now.

5) Don’t forget about SQL.

Did this answer all the questions you ever had about databases? Do you have any stories to share about how you once chose a database you thought would be awesome and it not only let you down but caused you to divert your attention for weeks or months just trying to fix its issues? Do you like using stuff even if it’s still at version 0.1? Let me know in the comments!

Update: I don’t mean to suggest that MySQL and Postgres do not support master-slave configurations; they do. And despite MySQL not being what is traditionally thought of as a big data solution, Facebook famously altered MySQL to work on their backend (and they have more data than most companies doing big data).

#big data #big-oh #bigtable #cassandra #Databases #hbase #mapreduce #mongodb #MySQL #nosql #postgresql #redis #spark #sql #sqliteGo to comments

November 23, 2015

My brand-spanking new blog software told me that this title was too long: GoDaddy vs. BlueHost vs. DreamHost vs. Hostgator vs. Namecheap

Read all the way to the end if you want to find out who I chose, why it took me so long to choose, and for some nice discount buttons.

If you noticed, there’ve been some subtle changes on this site lately.

In actuality, the site has gone through a complete overhaul. I tried my hardest to create my own WordPress theme to be exactly like the old theme (which was hosted on Tumblr), with a few enhancements.

So as you probably already know, WordPress needs to be hosted on paid servers that have PHP installed. The typical players are GoDaddy et al. (the ones listed above, except for Namecheap). For the somewhat technical, your site needs to live on boxes like these:

I moved all of my domain names to Namecheap from GoDaddy awhile ago when I read about some of their shady business practices (they initially supported SOPA).

So I’m browsing around, basically just price-shopping. I know all these companies have bad reviews for poor customer service. I try to do my due diligence by noting the percentage of downtime and cPanel features. But really I just want a good price on hosting so I can test out new sites cheaply and efficiently.

Most of these guys have a 1-3 year discount rate, something like $6.99/month if you sign up for 3 years. Then it goes up to $11.99/month after that.

I’m not going to list the exact prices here because they are always changing, but approx. $5+ to approx. $10+ after the discount period is typical. The problem is, most of these sites don’t post their non-discount rate.

As a sidenote, I’ve noticed lately that a lot of sites, including banks, who you’d think would be more professional, have a lot of footnotes. Like this.^{1}

And then, nothing on the page actually explains what it refers to. It’s like saying, “read the fine print if you want to understand how we’re going to fuck you”, and then just excluding that fine print, in hopes you’ll just forget about it.

Alright, lemme talk to their support guys to tally up what everyone’s non-discounted rate is. This helped me narrow down my options a little bit. HostGator’s support was very unhelpful. I partly wrote this post to tell you not to use HostGator. After some time doing this it hit me… Namecheap offers hosting.

I’ve been using Namecheap for years for my domains, and I’ve found their user interface and support to be above par. I look up their rate for shared web hosting and to my surprise, it’s $19.88 for the first year. After this discount period, it goes up to $78.88, which is $6.57 / month, which is way better than what everyone I mention above provides.

If you use the coupon code COOLDAYS you can get 20% off, which means you’ll only pay $15. This is basically just above $1 / month. I think (not 100% sure) the coupon code only lasts until the end of November, so get that shit now.

And since Namecheap sucks at advertising their hosting, which is why I almost signed up for an inferior service instead of theirs, I am going to post some buttons here in case you’re looking for a new domain, web hosting, and a free WhoIsGuard (so people can’t stalk you on the net, since by default your full name, address, and email are shown when someone looks you up).

Full disclosure, these are referral links, so buying from these links helps me too and helps keep this site up. =)

It’s entirely possible I haven’t covered a really awesome, cheap web host in this post, so if I missed something, please let me know in the comments below.

#bluehost #cheap web hosting #coupon code #dreamhost #godaddy #hostgator #namecheap #web hostingGo to comments

November 20, 2015

I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

**Linear transformation**

PCA finds a matrix \( Q \) that, when multiplied by your original data matrix \( X \), gives you a linearly transformed data matrix \( Z \), where:

$$ Z = XQ $$

In other words, for a single sample vector \( x \), we can obtain its transformation \( z = Qx \).

The \( Q \) and \( x \) appear in a different order here because when we load a data matrix, it’s usually \( N \times D \), so each sample vector is \( 1 \times D \), but when we talk about vectors by themselves we usually think of them as \( D \times 1 \).

The interesting thing about PCA is how it chooses \( Q \).

**Dimensionality reduction**

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST, where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 × 1920 pixels. Each image would thus have 2073600 \( \approx \) 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information via unpredictability, i.e. variance. The end result is that our transformed matrix \( Z \) has most of its variance in the first column, less variance in the second column, even less variance in the third column, etc.

You will see later how our choice of \( Q \) accomplishes this.

**De-correlation**

\( D \approx 2000000 \) is a much bigger problem than the ones you learn about in the classroom. But often there are **correlations** in the data that make many of the dimensions redundant. This is related to the problem of linear regression – if we can accurately predict one dimension of the inputs, let’s say, \( x_3 \), in terms of another, \( x_5 \), then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of \( Z \) is uncorrelated with the other.

The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axis is a) **rotated** (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed and the distributions on each dimension are independent and uncorrelated.

**Visualization**

Once we’ve transformed \( X \) into \( Z \), noting that most of the “information” from \( X \) is in the first few dimensions of \( Z \) – let’s say… 90% of the variance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard time visualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbations or noise.

**Pre-processing**

You may have heard that too many parameters in your model can lead to **overfitting**. So if you are planning to use a logistic regression model, and the dimensionality of each input is 2 million, then the number of weights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we would like to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then we only need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is also an unsupervised algorithm.

Everybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge (knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).

**Covariance**

I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then the x and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between \( X_i \) and \( X_j \) (called ‘c’ in the image):

(If you really know your statistics, then you will recall that independence *implies* 0 correlation, but 0 correlation does not imply independence, *unless* the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in the rotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables \( X_i \) and \( X_j \) are uncorrelated, their corresponding element in the covariance matrix, \( \Sigma_{ij} = 0 \). And since the covariance matrix is symmetric (for our purposes), \( \Sigma_{ji} = 0 \) also.

For every variable to be uncorrelated to every other variable, all the elements of \( \Sigma \) will be 0, except those along the diagonal, \( \Sigma_{ii} = \sigma_i^2 \), which is just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

$$ \Sigma_X = \frac{1}{N} (X – \mu_X)^T(X – \mu_X) \tag{1}$$

In non-matrix form this is:

$$ \Sigma_{ij} = \frac{1}{N} \sum_{n=1}^{N} (x(n)_i – \mu_i)(x(n)_j – \mu_j) $$

I’ve labeled the equation because we’re going to return to it. You can verify that \( \Sigma \) is \( D \times D \) since those are the outer dimensions of the matrix multiply.

Note the mathematical sleight of hand I used above. \( X \) is \( N \times D \) and the mean \( \mu_X \) is \( 1 \times D \), so technically you can’t subtract them under the rules of matrix algebra, but let’s assume that the above notation means each row of \( X \) has \( \mu_X \) subtracted from it.

**Eigenvalues and Eigenvectors**

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of \( \Sigma_X \).

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes the direction of the vector.

So for example, the vector \( (1,2) \) is pointing in the same direction as \( (2,4) = 2(1,2) \). But if we were to multiply \( (1,2) \) by a matrix:

$$ \begin{pmatrix} 5 \\ 11 \end{pmatrix} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix} $$

The result is a vector in a different direction. Eigenvalues \( \lambda \) and eigenvectors \( v \) have the property that, if you multiply \( \Sigma_Xv \), this is the same by multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longer by a factor of \( \lambda \). In other words:

$$ \Sigma_Xv = \lambda v $$

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are \( D \) such pairs. We’ll say, each eigenvalue \( \lambda_i \) has a corresponding eigenvector \( v_i \).

Now again, for no particular reason at all, we are going to combine all the \( v_i \) into a matrix, let’s call it \( V \).

We *could* line up the \( v_i \) in any random order, but instead we are going to choose a very particular order – we are going to line them up such that the corresponding eigenvalues \( \lambda_i \) are in descending order. In other words:

$$ \lambda_1 > \lambda_2 > \lambda_3 > … > \lambda_D $$

So:

$$ V = \begin{pmatrix}

| & | & | & … & | \\

v_1 & v_2 & v_3 & … & v_D \\

| & | & | & … & |

\end{pmatrix}

$$

We are also going to combine the eigenvalues into a diagonal matrix like this:

$$ \Lambda = \begin{pmatrix}

\lambda_1 & 0 & … & 0 \\

0 & \lambda_2 & … & 0 \\

… & … & … & … \\

0 & 0 & … & \lambda_D

\end{pmatrix}

$$

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method you used in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get the eigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll just use numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

$$ \Sigma_X V = V \Lambda \tag{2}$$

Just one more ingredient. We want to show that \( V \) is **orthonormal**. In terms of the individual eigenvectors this means that \( {v_i}^T v_j = 0 \) if \( i \ne j \), and \( {v_i}^T v_i = 1 \). In matrix form this means \( V^T V = I \), or in other words, \( V^T = V^{-1} \).

Proof of orthogonality: Consider \( \lambda_j {v_i}^T v_j \) for \( i \ne j \). \( \lambda_j {v_i}^T v_j = {v_i}^T (\lambda_j v_j) = {v_i}^T \Sigma_X v_j = (\Sigma_X^T v_i)^T v_j \). But we know that \( \Sigma_X \) is symmetric so this is \( (\Sigma_X v_i)^T v_j = (\lambda_i v_i)^T v_j = \lambda_i {v_i}^T v_j \). Since \( \lambda_i \ne \lambda_j \), this can only be true if \( {v_i}^T v_j = 0 \).

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using \( (1) \), consider the covariance of the transformed data, \( Z \).

$$ \begin{align*}

\Sigma_Z &= \frac{1}{N} (Z – \mu_Z)^T (Z – \mu_Z) \\

&= \frac{1}{N} (XQ – \mu_XQ)^T (XQ – \mu_XQ) \\

&= \frac{1}{N} Q^T(X – \mu_X)^T (X – \mu_X)Q \\

&= Q^T \Sigma_X Q

\end{align*}

$$

Now, suppose we choose \( Q = V \). Then, by plugging in \( (2) \), we get:

$$ \begin{align*}

\Sigma_Z &= V^T \Sigma_X V \\

&= V^T V \Lambda \\

&= \Lambda

\end{align*}

$$

So what does this tell us? Since \( \Lambda \) is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of \( Z \) is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of \( Z \) has the most variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

import numpy as np from sklearn.decomposition import PCA import matplotlib.pyplot as plt pca = PCA() X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrix Z = pca.fit_transform(X) # visualize the covariance of Z plt.imshow(np.cov(Z.T)) plt.show()

Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that \( Q^T Q = I \).

QTQ = pca.components_.T.dot(pca.components_) plt.imshow(QTQ) plt.show() print np.around(QTQ, decimals=2)

Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and other data science topics (both with the math and the code), check out some of my data science video courses online.

#anomaly detection #covariance #dimensionality reduction #eigenvalues #eigenvectors #linear algebra #machine learning #pca #principal components analysis #statisticsGo to comments