Linear Regression in the Wild – AV1: Next Generation Video Codec

April 10, 2018

A lot of students come up to me and ask about when they’re going to learn the latest and greatest new deep learning algorithm.

Sometimes, it’s easy to forget how applicable even the most basic of tools are.

As you know, I consider Linear Regression to be the best starting point for deep learning and machine learning in general.

And wouldn’t you know, here it is being used in the most advanced, state-of-the-art video codec we have today:

next generation video: Introducing AV1

Check it out!


This new state-of-the-art video codec is based on research done by multiple big companies, such as Google, Cisco, and Mozilla.

As you can see, the final equation is just a line (\( y = mx + b \)).

$$ CfL(\alpha) = \alpha L^{AC} + DC $$

Go to comments

$10 Deep Learning Courses (+more) July Special part 2!

July 24, 2017

Since I am still busy hacking away at my next course, we are going to do another HUGE sale. ALL courses on Udemy are now $10. Take this opportunity to grab as many courses as you can because you never know when the next sale is going to be!

As usual, I’m providing $10 coupons for all my courses in the links below. Please use these links and share them with your friends!

You can also just type in the coupon code “JUL456”.

The $10 promo doesn’t come around often, so make sure you pick up everything you are interested in, or could become interested in later this year. The promo goes until July 31. Don’t wait!

At the end of this post, I’m going to provide you with some additional links to get machine learning prerequisites (calculus, linear algebra, Python, etc…) for $10 too!

If you don’t know what order to take the courses in, please check here:
Here are the links for my courses:

Deep Learning Prerequisites: Linear Regression in Python

Deep Learning Prerequisites: Logistic Regression in Python

Deep Learning in Python

Practical Deep Learning in Theano and TensorFlow

Deep Learning: Convolutional Neural Networks in Python

Unsupervised Deep Learning in Python

Deep Learning: Recurrent Neural Networks in Python

Advanced Natural Language Processing: Deep Learning in Python

Advanced AI: Deep Reinforcement Learning in Python

Easy Natural Language Processing in Python

Cluster Analysis and Unsupervised Machine Learning in Python

Unsupervised Machine Learning: Hidden Markov Models in Python

Data Science: Supervised Machine Learning in Python

Bayesian Machine Learning in Python: A/B Testing

Ensemble Machine Learning in Python: Random Forest and AdaBoost

Artificial Intelligence: Reinforcement Learning in Python

SQL for Newbs and Marketers


And last but not least, $10 coupons for some helpful prerequisite courses. You NEED to know this stuff before you study machine learning:

General (site-wide):
Calc 1
Calc 2
Calc 3
Linalg 1
Linalg 2
Probability (option 1)
Probability (option 2)
Probability (option 3)
Probability (option 4)


As you know, I’m the “Lazy Programmer”, not just the “Lazy Data Scientist” – I love all kinds of programming!

And I’ve got sales for everything:

Ruby on Rails
Big Data (Spark, Hadoop)
Javascript, ReactJS, AngularJS

Remember, these links will self-destruct on July 31 (7 days). Act NOW!

Go to comments

AI news of the week: Tensorflow Object Detection API and Free Online Data Science Books

July 11, 2017


Tensorflow Object Detection API

Summary: Creating accurate machine learning models capable of localizing and identifying multiple objects in a single image remains a core challenge in computer vision. The TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. At Google we’ve certainly found this codebase to be useful for our computer vision needs, and we hope that you will as well.

TechCrunch Article


Free Online Data Science Books

The title says they are “mathematics” textbooks but there’s tons of material for data scientists (who do lots of math) too.

Particularly relevant: probability, statistics, linear algebra, graph theory, inference, learning algorithms, convex optimization, statistical signal processing, and more!


Go to comments

New machine learning course! Cluster Analysis and Unsupervised Machine Learning in Python

April 20, 2016


[Scroll to the bottom if you want to jump straight to the coupon]

Cluster analysis is a staple of unsupervised machine learning and data science.

It is very useful for data mining and big data because it automatically finds patterns in the data, without the need for labels, unlike supervised machine learning.

In a real-world environment, you can imagine that a robot or an artificial intelligence won’t always have access to the optimal answer, or maybe there isn’t an optimal correct answer. You’d want that robot to be able to explore the world on its own, and learn things just by looking for patterns.

Do you ever wonder how we get the data that we use in our supervised machine learning algorithms?

We always seem to have a nice CSV or a table, complete with Xs and corresponding Ys.

If you haven’t been involved in acquiring data yourself, you might not have thought about this, but someone has to make this data!

Those “Y”s have to come from somewhere, and a lot of the time that involves manual labor.

Sometimes, you don’t have access to this kind of information or it is infeasible or costly to acquire.

But you still want to have some idea of the structure of the data. If you’re doing data analytics automating pattern recognition in your data would be invaluable.

This is where unsupervised machine learning comes into play.

In this course we are first going to talk about clustering. This is where instead of training on labels, we try to create our own labels! We’ll do this by grouping together data that looks alike.

There are 2 methods of clustering we’ll talk about: k-means clustering and hierarchical clustering.

Next, because in machine learning we like to talk about probability distributions, we’ll go into Gaussian mixture models and kernel density estimation, where we talk about how to “learn” the probability distribution of a set of data.

One interesting fact is that under certain conditions, Gaussian mixture models and k-means clustering are exactly the same! You can think of GMMs as a “souped up” version of k-means. We’ll prove how this is the case.

All the algorithms we’ll talk about in this course are staples in machine learning and data science, so if you want to know how to automatically find patterns in your data with data mining and pattern extraction, without needing someone to put in manual work to label that data, then this course is for you.

All the materials for this course are FREE. You can download and install Python, Numpy, and Scipy with simple commands on Windows, Linux, or Mac.


#agglomerative clustering #cluster analysis #data mining #data science #expectation-maximization #gaussian mixture model #hierarchical clustering #k-means clustering #kernel density estimation #pattern recognition #udemy #unsupervised machine learning

Go to comments

New Deep Learning course on Udemy

February 26, 2016


Shut up and gimme the link!:

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

#adagrad #aws #batch gradient descent #deep learning #ec2 #gpu #machine learning #nesterov momentum #numpy #nvidia #python #rmsprop #stochastic gradient descent #tensorflow #theano

Go to comments

Covariance Matrix: Divide by N or N-1?

February 14, 2016

This is a statistics post. It’s probably very boring. I am posting it for my own reference, because I seem to forget how this is derived every time I need it.

Sometimes you see the sample variance defined as:

$$ \hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^{N} (X_n – \mu)^2 $$

But you might also see it defined as:

$$ \hat{\sigma}^2 = \frac{1}{N-1} \sum_{n=1}^{N} (X_n – \hat{\mu})^2 $$

Where as usual the “hat” symbol means that is our prediction.

Why do statisticians sometimes divide by N, and sometimes divide by N-1?

The same question arises for the calculation of the sample covariance matrix, and this is what we will work with in this post.


This has to do with whether you want your estimate to be a biased estimate or an unbiased estimate.

For any parameter \( \theta \), our estimate \( \hat{ \theta } \) is unbiased if:

$$ E\{ \hat{ \theta } – \theta \} = 0 $$

In this tutorial we will calculate the bias of the sample covariance on the multivariate Gaussian, which is defined as:

$$ p(x | \mu, \sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}} exp( -\frac{1}{2} (x – \mu)^T \Sigma^{-1} (x – \mu) ) $$

The maximum likelihood estimates of \( \mu \) and \( \Sigma \) can be found by taking the derivative of the log-likelihood and setting it to 0.

The likelihood is:

$$ p(X | \mu, \Sigma) = \prod_{n=1}^{N} p(x_n | \mu, \Sigma) $$

So the log-likelihood (expanded) is:

$$ L(X | \mu, \Sigma) = -\frac{ND}{2} log(2\pi)  -\frac{N}{2} log(|\Sigma|)  -\sum_{n=1}^{N} \frac{1}{2}(x_n – \mu)^T \Sigma^{-1} (x_n – \mu) $$

To take “vector derivatives” and “matrix derivatives” you’ll want to consult the Matrix Cookbook.

If you made it this far it is almost trivial to calculate \( \frac{ \partial L }{ \partial \mu } = 0 \) to get the usual result:

$$ \hat{ \mu } = \frac{1}{N} \sum_{n=1}^{N} x_n $$

To get the sample covariance, we calculate:

$$ \frac{ \partial L }{ \partial \Sigma } = -\frac{N}{2} (\Sigma^{-1})^T – \frac{1}{2} \sum_{n=1}^{N} – \Sigma^{-T} (x_n – \mu) (x_n – \mu)^T \Sigma^{-T} $$

Set that to 0 and solve for \( \Sigma \) to get:

$$ \hat{ \Sigma } = \frac{1}{N} \sum_{n=1}^{N} (x_n – \hat{\mu}) (x_n – \hat{\mu})^T $$

Note that we are assuming we don’t have the “true mean”, so we are estimating the mean using the maximum likelihood estimate before calculating the maximum likelihood estimate for the covariance.

Now we will show that \( E\{ \hat{ \Sigma } \}  \neq \Sigma \). By definition:

$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} (x_n – \hat{\mu})(x_n – \hat{\mu})^T \} $$


$$ E\{ \hat{ \Sigma } \} = \frac{1}{N} E\{ \sum_{n} ((x_n – \mu) – (\hat{\mu} – \mu))((x_n – \mu) – (\hat{\mu} – \mu)))^T \} $$

Expand that:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \frac{1}{N} E\{ \sum_{n} (\hat{\mu} – \mu)) (\hat{\mu} – \mu))^T – (x_n – \mu)(\hat{\mu} – \mu))^T – (\hat{\mu} – \mu))(x_n – \mu)^T \} $$

Multiply out the terms:

$$ E\{ \hat{ \Sigma } \} = \Sigma  + E\{ \hat{\mu}\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ x_n\hat{\mu}^T \} – \frac{1}{N} \sum_{n} E\{ \hat{\mu} x_n^T \} + \mu\mu^T $$

We can combine the expected values to get:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – E\{ \hat{\mu}\hat{\mu}^T \} $$

Now the exercise becomes finding the expected value on the right side.

We need some identities.

First, for \( m \neq n \):

$$ E\{ x_m x_n^T \} = E\{ x_m \} E\{ x_n^T \} =  \mu\mu^T $$

Because each sample is IID: independent and identically distributed.

Next, the definition of covariance:

$$ \Sigma = E\{ (x – \mu)(x – \mu)^T \} = E\{ xx^T \} – \mu\mu^T $$

We can rearrange this to get:

$$ E\{ xx^T \} = \Sigma + \mu\mu^T $$

The term \( E\{\hat{\mu}\hat{\mu}^T \} \) can be expanded as:

$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N^2} E\{ (x_1 + x_2 + … + x_N)(x_1 + x_2 + … + x_N)^T \} $$

When expanding the multiplication, there are \( N \) terms that are the same, so that would be a \( N E\{ x_n x_n^T \} \) contribution. There are \( N(N-1) \) terms that are different, and since different terms are independent that is a \( N(N-1)\mu\mu^T \) contribution.

So in total:

$$ E\{\hat{\mu}\hat{\mu}^T \} =\frac{1}{N^2}(N(\Sigma + \mu\mu^T) + N(N-1)\mu\mu^T) $$


$$ E\{\hat{\mu}\hat{\mu}^T \} = \frac{1}{N}\Sigma + \mu\mu^T $$

Plugging this back into the expression for the bias:

$$ E\{ \hat{ \Sigma } \} = \Sigma + \mu\mu^T – \frac{1}{N}\Sigma – \mu\mu^T $$


$$ E\{ \hat{ \Sigma } \} = \frac{N-1}{N} \Sigma \neq \Sigma $$

So, if you want the unbiased estimator, you can multiply the biased maximum likelihood estimator by \( \frac{N}{N-1} \), which gives the expected unbiased formula.


#covariance #maximum likelihood #MLE #multivariate Gaussian #multivariate normal #sample variance #statistics #unbiased estimator

Go to comments

The Simpsons: Tapped Out and Prisoner’s Dilemma

December 12, 2015

I have to admit that, despite all this science-y stuff I do, I still have my vices.

You know when you’re just sitting there waiting for the train to take you home, and you wish you had a game to play on your iPhone? Then you go home and download that game, fully intending to only play it when you’re bored or waiting? Then you end up sacrificing the time you should have spent doing important things playing that game instead?

That’s The Simpsons: Tapped Out for me (along with new contender Marvel Contest of Champions, but that’s another story).


Every holiday, and even sometimes when it’s not a holiday, The Simpsons has a special holiday update. This year’s Christmas special lets you play a classic game originally known as the Prisoner’s Dilemma.

It is the canonical example of a “game” in Game Theory. John Nash, the subject of the movie A Beautiful Mind, made fundamental contributions to the theory. It has applications in military strategy, politics, and even sports.

So how does it work?

In Tapped Out, every player has their own town with their own characters and houses, etc. You can visit other towns to get points.

In the Christmas update, when you visit a friend’s town, you drop a present off, and you are asked if you want to make the present “nice” (Lisa), or “naughty” (Bart). Naturally.

When you receive a present, you also have to choose nice/Lisa, or naughty/Bart.

So there are 4 possibilities:

  • You choose Bart, other player chooses Lisa
  • You choose Bart, other player chooses Bart
  • You choose Lisa, other player chooses Lisa
  • You choose Lisa, other player chooses Bart

But your choice affects how many points you receive.

This is a picture of the game that explains how it is played.

The Simpsons Tapped Out Christmas Update

So as you can see, the point system is as follows:

  • You choose Bart, other player chooses Lisa – You get 25, other player gets 1
  • You choose Bart, other player chooses Bart – You get 5, other player gets 5
  • You choose Lisa, other player chooses Lisa – You get 15, other player gets 15
  • You choose Lisa, other player chooses Bart – You get 1, other player gets 25

In the original prisoner’s dilemma, the problem is posed as follows:

2 prisoners are given a choice of whether to cooperate with or betray the other. If they both cooperate, they both serve 1 year. If they both betray each other, they both serve 2 years. But if one betrays and one cooperates, then the one who betrays goes free, while the one who cooperated gets 3 years.

The problem can be summarized in a table like this:

Screen Shot 2015-12-12 at 4.11.12 AM

Where P, R, S, and T are called “payoffs”. To be a prisoner’s dilemma game, we must have T > R > P > S, which is true for The Simpsons: Tapped Out version also.

Once you understand the problem, the question becomes – how do I make a choice that will maximize my winnings?

How do I maximize my winnings in prisoner’s dilemma?

The Prisoner’s Dilemma is a dilemma because it seems obvious that to maximize both our winnings (altruism), we should cooperate.

But if I know you’re rational and will cooperate with me, and I am selfish, then I will defect, so I win more and you get nothing.

The problem is, if we both think that way, then we will both defect and we will both end up with less.

I would also be hesitant to cooperate because if you are selfish, then you will defect and I will end up with nothing.

The only “rational” solution is to be selfish and betray. Why? Because it is the only choice for which choosing the other thing will result in a worse outcome.

Regardless of what you do – if I choose to betray, it is better than if I had chosen to cooperate.

If you choose cooperate – I choose betray, I get 25 points. I choose cooperate, I only get 15.

If you choose betray – I choose betray, I get 5 points, but if I choose cooperate, I only get 1.

So no matter what you choose, my choice of betray is better than my choice of cooperate.

Of course, since I can drop off multiple presents to my neighbors’ towns, there isn’t only one “game” to play. In fact, multiple or iterated games make the prisoner’s dilemma even more interesting.

Repeated Games

There is a bigger social element to repeated games.

If you betray me, then now I know you are a betrayer, so I will not cooperate with you.

If you cooperate with me, then perhaps I can trust you, and we can cooperate with each other until the end of the holiday special. That will result in more points for both of us.

But not as many for me compared to what I’d get if I continually betrayed you.

Then again, if I kept betraying you, you would stop cooperating with me.

You see how this gets complex?


What is the best strategy?

Research done by Robert Axelrod in his 1984 book The Evolution of Cooperation helps to shed light on this issue.

He studied competing strategies in a tournament and discovered that purely selfish strategies performed very poorly in the long run, and that more altruistic strategies performed better, even when measured from a selfish perspective (how many points did I win?).

In terms of natural selection, this may explain why we have evolved over time to be both altruistic and self-interested.

The winning strategy was called “tit for tat”, which in short, means you just play the move your opponent previously played. The tit for tat strategy can be improved upon slightly by adding a small random chance that you will forgive your opponent if they betray you, and cooperate.

Interestingly, Axelrod’s analysis of the top strategies have a very “human” component. The conditions for a successful strategy can be summarized as:

  • Be nice. Cooperate unless your opponent betrays you.
  • Have a spine. If your opponent betrays you, don’t keep cooperating with them.
  • Forgive. Don’t fall into a cycle of repeated betrayals.
  • Don’t be jealous. Don’t strive to get more points than the other player and don’t be outcome dependent.


Limited number of games

In The Simpsons: Tapped Out, you don’t have an infinite number of games to learn from or an infinite number of gifts to give. You are limited to 5 per day. Thus your ultimate strategy will probably be adaptive as well – choose only the 5 players you want to mutually benefit from your gift giving and choose the ones who have a high probability of cooperating.


Let’s be real though. I don’t think the people who are playing this game are really considering this stuff. For the first few hours I just choose Bart, because that was the naughty one.

So yeah.

#a beautiful mind #game theory #john nash #prisoner's dilemma #tapped out #the simpsons #tsto

Go to comments

Big Mac Index

July 26, 2014


Food for thought: Economists use the price of a Big Mac to determine if a country’s currency is undervalued or overvalued.

#big mac #economics #gdp #mcdonalds

Go to comments

Linear Regression

July 25, 2014

Code for this tutorial is here:


Prerequisites for understanding this material:

  • calculus (taking partial derivatives)

Linear regression is one of the simplest machine learning techniques you can use. It is often useful as a baseline relative to more powerful techniques.

To start, we will look at a simple 1-D case.

Like all regressions, we wish to map some input X to some input Y.


Y = f(X)

With linear regression:

Y = aX + b

Or we can say:

h(X) = aX + b

Where “h” is our “hypothesis”.

You may recall from your high school studies that this is just the equation for a straight line.


When X is 1-D, or when “Y has one explanatory variable”, we call this “simple linear regression”.

When we use linear regression, we are using it to model linear relationships, or what we think may be linear relationships.

As with all supervised machine learning problems, we are given labeled data points:

(X1, Y1), (X2, Y2), (X3, Y3), …, (Xn, Yn)

And we will try to fit the line (aX + b) as best we can to these data points.


This means we have to optimize the parameters “a” and “b”.

How do we do this?

We will define an error function and then find the “a” and “b” that will make the error as small as possible.

You will see that many regression problems work this way.

What is our error function?

We could use the difference between the predicted Y and the actual Y like so:


But if we had equal amounts of errors where Y was bigger than the prediction, and where Y was smaller than the prediction, then the errors would cancel out, even though the absolute difference in errors is large.

Typically in machine learning, the squared error is a good place to start.


Now, whether or not the difference in the actual and predicted output is positive or negative, its contribution to the total error is still positive.

We call this sum the “sum of squared errors”.

Recall that we want to minimize it.

Recall from calculus that to minimize something, you want to take its derivative.


Because there are two parameters, we have to take the derivatives both with respect to a and with respect to b, set them to 0, and solve for a and b.

Luckily, because the error function is a quadratic it increases as (a,b) get further and further away from the minimum.

As an exercise I will let you calculate the derivatives.

You will get 2 equations (the derivatives) and 2 unknowns (a, b). From high school math you should know how to solve this by rearranging the terms.

Note that these equations can be solved analytically. Meaning you can just plug and chug the values of your inputs and get the final value of a and b by blindly using a formula.

Note that this method is also called “ordinary least squares”.

Measuring the error (R-squared)

To determine how well our model fits the data, we need a measure called the “R-square”.

Note that in classification problems, we can simply use the “classification rate”, which is the number of correctly classified inputs divided by the total number of inputs. With the real-valued outputs we have in regression, this is not possible.

Here are the equations we use to predict the R-square.


SS(residual) is the sum of squared error between the actual and predicted output. This is the same as the error we were trying to minimize before!

SS(total) is the sum of squared error between each sample output and the mean of all the sample outputs, i.e. What the residual error would be if we just predicted the average output every time.

So the R-square then, is just how much better our model is compared to predicting the mean each time. If we just predicted the mean each time, the R-square would be 1-1=0. If our model is perfect, then the R-square would be 1-0=1.

Something to think about: If our model performs worse than predicting the mean each time, what would be the R-square value?

Limitations of Linear Regression

  • It only models linear equations. You can model higher order polynomials (link to later post) but the model is still linear in its parameters.
  • It is sensitive to outliers. Meaning if we have one data point very far away from all the others, it could “pull” the regression line in its direction, away from all the other data points, just to minimize the error.


Learn more about Linear Regression in the course Deep Learning Prerequisites: Linear Regression in Python.

#calculus #linear regression #machine learning #statistics

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements