A Tutorial on Autoencoders for Deep Learning

December 31, 2015

Despite its somewhat initially-sounding cryptic name, autoencoders are a fairly basic machine learning model (and the name is not cryptic at all when you know what it does).

Autoencoders belong to the neural network family, but they are also closely related to PCA (principal components analysis).

Some facts about the autoencoder:

  • It is an unsupervised learning algorithm (like PCA)
  • It minimizes the same objective function as PCA
  • It is a neural network
  • The neural network’s target output is its input

The last point is key here. This is the architecture of an autoencoder:


So the dimensionality of the input is the same as the dimensionality of the output, and essentially what we want is x’ = x.

It can be shown that the objective function for PCA is:

$$ J = \sum_{n=1}^{N} |x(n) – \hat{x}(n)|^2 $$

Where the prediction \( \hat{x}(n) = Q^{-1}Qx(n) \).

Q can be the full transformation matrix (which would result in getting exactly the old x back), or it can be a “rank k” matrix (i.e. keeping the k-most relevant eigenvectors), which would then result in only an approximation of x.

So the objective function can be written as:

$$ J = \sum_{n=1}^{N} |x(n) – Q^{-1}Qx(n)|^2 $$

Now let’s return to autoencoders.

Recall that to get the value at the hidden layer, we simply multiply the input->hidden weights by the input.

Like so:

$$ z = f(Wx) $$

And to get the value at the output, we multiply the hidden->output weights by the hidden layer values, like so:

$$ y = g(Vz) $$

The choice of \( f \) and \( g \) is up to us, we just have to know how to take the derivative for backpropagation.

We are of course free to make them “identity” functions, such that:

$$ y = g(V f(Wx)) = VWx $$

This gives us the objective:

$$ J = \sum_{n=1}^{N} |x(n) – VWx(n)|^2 $$

Which is the same as PCA!


If autoencoders are similar to PCA, why do we need autoencoders?

Autoencoders are much more flexible than PCA.

Recall that with neural networks we have an activation function – this can be a “ReLU” (aka. rectifier), “tanh” (hyperbolic tangent), or sigmoid.

This introduces nonlinearities in our encoding, whereas PCA can only represent linear transformations.

The network representation also means you can stack autoencoders to form a deep network.


Cool theory bro, but what can autoencoders actually do for me?

Good question!

Similar to PCA – autoencoders can be used for finding a low-dimensional representation of your input data. Why is this useful?

Some of your features may be redundant or correlated, resulting in wasted processing time and overfitting in your model (too many parameters).

It is thus ideal to only include the features we need.

If your “reconstruction” of x is very accurate, that means your low-dimensional representation is good.

You can then use this transformation as input into another model.


Training an autoencoder

Since autoencoders are really just neural networks where the target output is the input, you actually don’t need any new code.

Suppose we’re working with a sci-kit learn-like interface.

Instead of:

model.fit(X, Y)

You would just have:

model.fit(X, X)

Pretty simple, huh?

All the usual neural network training strategies work with autoencoders too:

  • backpropagation
  • regularization
  • dropout

If you want to get good with autoencoders – I would recommend trying to take some data and an existing neural network package you’re comfortable with – and see what low-dimensional representation you can come up with. How many dimensions are there?


Where can I learn more?

Autoencoders are part of a family of unsupervised deep learning methods, which I cover in-depth in my course, Unsupervised Deep Learning in Python. We discuss how to stack autoencoders to build deep belief networks, and compare them to RBMs which can be used for the same purpose. We derive all the equations and write all the code from scratch – no shortcuts. Ask me for a coupon so I can give you a discount!

P.S. “Autoencoders” means “encodes itself”. Not so cryptic now, right?

Leave a comment!

#autoencoders #deep learning #machine learning #pca #principal components analysis #unsupervised learning

Go to comments

The Simpsons: Tapped Out and Prisoner’s Dilemma

December 12, 2015

I have to admit that, despite all this science-y stuff I do, I still have my vices.

You know when you’re just sitting there waiting for the train to take you home, and you wish you had a game to play on your iPhone? Then you go home and download that game, fully intending to only play it when you’re bored or waiting? Then you end up sacrificing the time you should have spent doing important things playing that game instead?

That’s The Simpsons: Tapped Out for me (along with new contender Marvel Contest of Champions, but that’s another story).


Every holiday, and even sometimes when it’s not a holiday, The Simpsons has a special holiday update. This year’s Christmas special lets you play a classic game originally known as the Prisoner’s Dilemma.

It is the canonical example of a “game” in Game Theory. John Nash, the subject of the movie A Beautiful Mind, made fundamental contributions to the theory. It has applications in military strategy, politics, and even sports.

So how does it work?

In Tapped Out, every player has their own town with their own characters and houses, etc. You can visit other towns to get points.

In the Christmas update, when you visit a friend’s town, you drop a present off, and you are asked if you want to make the present “nice” (Lisa), or “naughty” (Bart). Naturally.

When you receive a present, you also have to choose nice/Lisa, or naughty/Bart.

So there are 4 possibilities:

  • You choose Bart, other player chooses Lisa
  • You choose Bart, other player chooses Bart
  • You choose Lisa, other player chooses Lisa
  • You choose Lisa, other player chooses Bart

But your choice affects how many points you receive.

This is a picture of the game that explains how it is played.

The Simpsons Tapped Out Christmas Update

So as you can see, the point system is as follows:

  • You choose Bart, other player chooses Lisa – You get 25, other player gets 1
  • You choose Bart, other player chooses Bart – You get 5, other player gets 5
  • You choose Lisa, other player chooses Lisa – You get 15, other player gets 15
  • You choose Lisa, other player chooses Bart – You get 1, other player gets 25

In the original prisoner’s dilemma, the problem is posed as follows:

2 prisoners are given a choice of whether to cooperate with or betray the other. If they both cooperate, they both serve 1 year. If they both betray each other, they both serve 2 years. But if one betrays and one cooperates, then the one who betrays goes free, while the one who cooperated gets 3 years.

The problem can be summarized in a table like this:

Screen Shot 2015-12-12 at 4.11.12 AM

Where P, R, S, and T are called “payoffs”. To be a prisoner’s dilemma game, we must have T > R > P > S, which is true for The Simpsons: Tapped Out version also.

Once you understand the problem, the question becomes – how do I make a choice that will maximize my winnings?

How do I maximize my winnings in prisoner’s dilemma?

The Prisoner’s Dilemma is a dilemma because it seems obvious that to maximize both our winnings (altruism), we should cooperate.

But if I know you’re rational and will cooperate with me, and I am selfish, then I will defect, so I win more and you get nothing.

The problem is, if we both think that way, then we will both defect and we will both end up with less.

I would also be hesitant to cooperate because if you are selfish, then you will defect and I will end up with nothing.

The only “rational” solution is to be selfish and betray. Why? Because it is the only choice for which choosing the other thing will result in a worse outcome.

Regardless of what you do – if I choose to betray, it is better than if I had chosen to cooperate.

If you choose cooperate – I choose betray, I get 25 points. I choose cooperate, I only get 15.

If you choose betray – I choose betray, I get 5 points, but if I choose cooperate, I only get 1.

So no matter what you choose, my choice of betray is better than my choice of cooperate.

Of course, since I can drop off multiple presents to my neighbors’ towns, there isn’t only one “game” to play. In fact, multiple or iterated games make the prisoner’s dilemma even more interesting.

Repeated Games

There is a bigger social element to repeated games.

If you betray me, then now I know you are a betrayer, so I will not cooperate with you.

If you cooperate with me, then perhaps I can trust you, and we can cooperate with each other until the end of the holiday special. That will result in more points for both of us.

But not as many for me compared to what I’d get if I continually betrayed you.

Then again, if I kept betraying you, you would stop cooperating with me.

You see how this gets complex?


What is the best strategy?

Research done by Robert Axelrod in his 1984 book The Evolution of Cooperation helps to shed light on this issue.

He studied competing strategies in a tournament and discovered that purely selfish strategies performed very poorly in the long run, and that more altruistic strategies performed better, even when measured from a selfish perspective (how many points did I win?).

In terms of natural selection, this may explain why we have evolved over time to be both altruistic and self-interested.

The winning strategy was called “tit for tat”, which in short, means you just play the move your opponent previously played. The tit for tat strategy can be improved upon slightly by adding a small random chance that you will forgive your opponent if they betray you, and cooperate.

Interestingly, Axelrod’s analysis of the top strategies have a very “human” component. The conditions for a successful strategy can be summarized as:

  • Be nice. Cooperate unless your opponent betrays you.
  • Have a spine. If your opponent betrays you, don’t keep cooperating with them.
  • Forgive. Don’t fall into a cycle of repeated betrayals.
  • Don’t be jealous. Don’t strive to get more points than the other player and don’t be outcome dependent.


Limited number of games

In The Simpsons: Tapped Out, you don’t have an infinite number of games to learn from or an infinite number of gifts to give. You are limited to 5 per day. Thus your ultimate strategy will probably be adaptive as well – choose only the 5 players you want to mutually benefit from your gift giving and choose the ones who have a high probability of cooperating.


Let’s be real though. I don’t think the people who are playing this game are really considering this stuff. For the first few hours I just choose Bart, because that was the naughty one.

So yeah.

#a beautiful mind #game theory #john nash #prisoner's dilemma #tapped out #the simpsons #tsto

Go to comments

What the hell is a “z-score”?

December 4, 2015

Was having this conversation recently about what a z-score is.

What is a z-score? Why is it mentioned so often in statistics? You’ve probably heard of it in the context of the Student t-test.

If you are a machine learning guy more than a statistics guy, like me, you’ve probably heard you should “standardize” or “normalize” your variables before putting them into a machine learning model.

For example, if you’re doing linear regression and \( X_1 \) varies from \( 0..1 \) and \( X_2 \) varies from \( 0..1000 \), the weights \( \beta_1 \) and \( \beta_2 \) will give an inaccurate picture as to their importance.

Well, the z-score actually is the standardized variable.

I would avoid this terminology if possible, because in machine learning we usually think of a “score” as the output of a model, i.e. I’m getting the “score” of a set of links because I want to know in what order to rank them.

$$ z = \frac{x – \mu}{\sigma} $$

So instead of thinking of “z-scores”, think of “I need to standardize my variables before using them in my machine learning model so that the effect of any one variable is on the same scale as the others”.

Do you find statistics terminology confusing? “Z score”? “F statistic”? Share your thoughts in the comments!

#frequentist statistics #machine learning #statistics #z score

Go to comments

Why databases?

December 3, 2015

This article has been a long time coming. I wrote a shitty version years ago, but wanted to update it with new and current info, in a more organized and less shitty format.

In the current environment you are probably mostly concerned with “big data”, where both for-profit companies and the government download 1000s of TBs of data about you everyday. New and fancy technologies are popping up all the time, marketers and spammers love writing about them on LinkedIn, and gullible executives think they are must-haves.

The talking heads at your workplace might say, “we need to build a scalable product!”, or some such. So you end up creating a Hadoop cluster with a few tiny chunks of data and the overhead of your MapReduce actually takes longer than a for-loop by itself would have.

With all this fanciness you lose sight of the simple solutions – such as flat files, SQLite, and SQL. This article is a short survey of existing data solutions (both big data and small data) and at what scale they are appropriate for use.

Why do you even need data storage?

You are probably familiar with writing code in your first semester C++ class like this:

char* bob = "Bob";
char* jane = "Jane";
printf("Hi %s! Hi %s!\n", bob, jane);

In the real world, your code has to work on more cases than just Bob and Jane. Maybe you are writing an automated Twitter script that programmatically direct messages people when they start following you. If you use Twitter you’ve probably been annoyed at least a few times by this type of spam.

Working off this example, suppose you (the spammer) decides that you’re going to be somewhat nice and try not to spam people more than once.

So you would like to save the usernames you’ve direct messaged somewhere. Enter the flat file.

Flat files

flat file

Flat files are great for storing small data or where you don’t have to look stuff up. Just load the whole file into an array line by line, and do what you need to do.

In our case, we might load the data into a “set” datastructure so that when we want to look up a username, it’s an O(1) search.

Flat files are great for server configurations. As are JSON.

For scripts that automate something in your personal life, flat files are usually adequate.

A problem arises when you want to load your entire dataset into memory (like a set or a hash), and it doesn’t fit. Remember, your hard drive is on the order of 1TB large. Your RAM is on the order of 8GB, much of which is used by the OS (or most if you’re using Mac).

Why databases?


Enter the database. Databases are stored on disk. i.e. They are just a file or set of files.

The magic happens when you want to find something. Usually you’d have to look through the entire database if you didn’t have some “index” (think like the index at the back of a large textbook) to tell you where everything was.

Databases index a whole bunch of metadata so that looking for stuff is really fast. You’ll often see the term “balanced tree” in reference to database indexes. These are better than regular binary trees where searching is worst case O(N).

Relational Databases

relational database

Also called “RDBMS”, short for “relational database management system” (they loved verbose terminology in the 80s and 90s), relational databases usually store things in tables.

Examples: MySQL, PostgreSQL.

For example, you might have one table that stores every user’s ID, name, email, and password.

But you might have another table that stores friendships, so that would store the first user’s ID, and the second user’s ID.

Quite appropriately, relational databases keep track of “relationships”, so that, suppose you deleted the user with ID = 3. That would delete all the rows from the friendships table that contain user ID = 3 also, so that in the application, there won’t be any errors when it’s looking for the friends of user ID = 5, who is friends with user ID = 3, when the actual user with ID = 3 has already been deleted.

Relational small data

iphone sqlite

There is a special relational database called SQLite3. It works on “small data”, so it’s very appropriate for applications on your phone, for instance. iPhone apps on iOS use SQLite3. Many apps on your computer use SQLite3 without you even knowing it.

SQLite3 is stored locally on your machine, whereas bigger relational databases like Postgres can be stored either on your machine or on another machine over the Internet.

Relational big data

big data graph

Relational databases sort of hit a wall when data got too big to store in one database. Advertising companies can collect 1TB of data per day. In effect, you’d fill up an entire database in that one day. What do you do the next day? And the next?

Big data – Hadoop


Hadoop is the open source version of Google’s “Google File System” (GFS) and MapReduce framework.

Suppose for instance that your hard drives have a 1% chance of failing on any given day, and that your data is stored on 1000 hard drives. That means every day, 10 hard drives will fail. How do you make sure you don’t lose this data? You replicate it.

Some very smart people have determined how many copies of your data must be stored so that, even though hard drives are basically guaranteed to fail, you will never lose your data.

In addition to data replication, the data is also spread across multiple “chunks”. So multiple chunks (really files) make up one original data file.

MapReduce is a framework (a.k.a. a fancy way of writing a for loop), that distributes copies of the same program onto multiple machines, where each machine works on different chunks than the other machines.

Ideally, if you use N machines your running time would be reduced by 1/N, but there is lots of overhead that comes with coordinating the work that is done by each machine and merging it all together at the end.



Spark is seen as the “successor” to Hadoop MapReduce. I find that in general Spark jobs are a little easier to write. Note that it’s a framework, NOT a database, but I list it here to ease the confusion.

We will return to Hadoop later, but first, more “big data” generation technologies.



One database that became popular when startups started acquiring lots of data is MongoDB. MongoDB, unlike the other databases we’ve talked about, is not relational. In MongoDB, we don’t have “tables”, we have “collections”. In MongoDB, we don’t have “rows”, we have “documents”.

Documents are JSON documents. The nice thing about MongoDB is that you use Javascript to interact with it.

Startups started using the MEAN stack, which is made up of: MongoDB, ExpressJS, AngularJS, and NodeJS, for an all-Javascript environment.

MongoDB and similar databases don’t guarantee “consistency”. If you’re a bank, and I take out $50 so that my total balance is now $5, I don’t want someone else trying to take out $50 at the same time and putting my balance in the negative.

With MongoDB, I could take out $50, but some other user might still read that same document and see that my account still has $55, and hence try to take out another $50, even though this user read the database after I did my withdrawal.

In many applications this doesn’t matter and it’s good for performance.

MongoDB also allows “replication” and “sharding”.

“Replication” means you can have “masters” and “slaves” which store the same data. Different instances of the application can read from different slaves to decrease the load on any one machine running MongoDB.

“Sharding” means splitting up the data so that certain IDs go on one machine, while other IDs go to another. This also decreases the load.

Often times, people make the mistake of using MongoDB, because it’s new and cool, when their data is actually relational. What happens? They often end up having to program those relationships themselves in the application, which is more tedious and cumbersome than you might imagine.



Some people say “Redis is like a big key-value store”. At a very high level this is indeed what Redis does, and it does so very fast. If you know you don’t have “relationships” in your data, and you know you won’t need to store, query, and update JSON-like structures, then Redis is a great choice. You can also use sharding and replication with Redis, so it can store more stuff than would fit on just one hard drive.

Back to Hadoop…

Hadoop is not a database

Hadoop is not a database. The “Hadoop File System” (or HDFS) is the open source analogue of Google’s GFS. A database exists “on top of” a file system. For example, Postgres can exist on top of your “FAT32” file system. It’s a program that coordinates the storage and retrieval of data.

There are indeed databases that can work on top of HDFS/GFS.

Some examples are: Google’s BigTable and Hadoop’s HBase.

They allow you to do “queries”, like you do with SQL, as opposed to MapReduce’s plain for-loop-like structure.

Which do you choose?

Lessons I think we can learn from other business’ experiences:

1) Don’t use something just because it’s cool and new.
2) Don’t use big data tech when you don’t have big data.
3) Even if you think you have big data, check to see if it’s really that big.
4) Be honest with yourself about how long it’ll take to get big and whether it’s worth investing in a big data solution now.
5) Don’t forget about SQL.

Did this answer all the questions you ever had about databases? Do you have any stories to share about how you once chose a database you thought would be awesome and it not only let you down but caused you to divert your attention for weeks or months just trying to fix its issues? Do you like using stuff even if it’s still at version 0.1? Let me know in the comments!

Update: I don’t mean to suggest that MySQL and Postgres do not support master-slave configurations; they do. And despite MySQL not being what is traditionally thought of as a big data solution, Facebook famously altered MySQL to work on their backend (and they have more data than most companies doing big data).

#big data #big-oh #bigtable #cassandra #Databases #hbase #mapreduce #mongodb #MySQL #nosql #postgresql #redis #spark #sql #sqlite

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements