This is an annoucement along with free and discount coupons for my new course, SQL for Marketers: Dominate data analytics, data science, and big data

More and more companies these days are learning that they need to make DATA-DRIVEN decisions.

With big data and data science on the rise, we have more data than we know what to do with.

One of the basic languages of data analytics is SQL, which is used for many popular databases including MySQL, Postgres, Microsoft SQL Server, Oracle, and even big data solutions like Hive and Cassandra.

I’m going to let you in on a little secret. Most high-level marketers and product managers at big tech companies know how to manipulate data to gain important insights. No longer do you have to wait around the entire day for some software engineer to answer your questions – now you can find the answers directly, by yourself, using SQL.

Do you want to know how to optimize your sales funnel using SQL, look at the seasonal trends in your industry, and run a SQL query on Hadoop? Then join me now in my new class, SQL for marketers: Dominate data analytics, data science, and big data!

P.S. If you haven’t yet signed up for my newsletter at lazyprogrammer [dot] me, you’ll want to do so before Monday, especially if you want to learn more about deep learning, because I have a special announcement coming up that will NOT be announced on Udemy.

This article has been a long time coming. I wrote a shitty version years ago, but wanted to update it with new and current info, in a more organized and less shitty format.

In the current environment you are probably mostly concerned with “big data”, where both for-profit companies and the government download 1000s of TBs of data about you everyday. New and fancy technologies are popping up all the time, marketers and spammers love writing about them on LinkedIn, and gullible executives think they are must-haves.

The talking heads at your workplace might say, “we need to build a scalable product!”, or some such. So you end up creating a Hadoop cluster with a few tiny chunks of data and the overhead of your MapReduce actually takes longer than a for-loop by itself would have.

With all this fanciness you lose sight of the simple solutions – such as flat files, SQLite, and SQL. This article is a short survey of existing data solutions (both big data and small data) and at what scale they are appropriate for use.

Why do you even need data storage?

You are probably familiar with writing code in your first semester C++ class like this:

char* bob = "Bob";
char* jane = "Jane";
printf("Hi %s! Hi %s!\n", bob, jane);

In the real world, your code has to work on more cases than just Bob and Jane. Maybe you are writing an automated Twitter script that programmatically direct messages people when they start following you. If you use Twitter you’ve probably been annoyed at least a few times by this type of spam.

Working off this example, suppose you (the spammer) decides that you’re going to be somewhat nice and try not to spam people more than once.

So you would like to save the usernames you’ve direct messaged somewhere. Enter the flat file.

Flat files

Flat files are great for storing small data or where you don’t have to look stuff up. Just load the whole file into an array line by line, and do what you need to do.

In our case, we might load the data into a “set” datastructure so that when we want to look up a username, it’s an O(1) search.

Flat files are great for server configurations. As are JSON.

For scripts that automate something in your personal life, flat files are usually adequate.

A problem arises when you want to load your entire dataset into memory (like a set or a hash), and it doesn’t fit. Remember, your hard drive is on the order of 1TB large. Your RAM is on the order of 8GB, much of which is used by the OS (or most if you’re using Mac).

Why databases?

Enter the database. Databases are stored on disk. i.e. They are just a file or set of files.

The magic happens when you want to find something. Usually you’d have to look through the entire database if you didn’t have some “index” (think like the index at the back of a large textbook) to tell you where everything was.

Databases index a whole bunch of metadata so that looking for stuff is really fast. You’ll often see the term “balanced tree” in reference to database indexes. These are better than regular binary trees where searching is worst case O(N).

Relational Databases

Also called “RDBMS”, short for “relational database management system” (they loved verbose terminology in the 80s and 90s), relational databases usually store things in tables.

Examples: MySQL, PostgreSQL.

For example, you might have one table that stores every user’s ID, name, email, and password.

But you might have another table that stores friendships, so that would store the first user’s ID, and the second user’s ID.

Quite appropriately, relational databases keep track of “relationships”, so that, suppose you deleted the user with ID = 3. That would delete all the rows from the friendships table that contain user ID = 3 also, so that in the application, there won’t be any errors when it’s looking for the friends of user ID = 5, who is friends with user ID = 3, when the actual user with ID = 3 has already been deleted.

Relational small data

There is a special relational database called SQLite3. It works on “small data”, so it’s very appropriate for applications on your phone, for instance. iPhone apps on iOS use SQLite3. Many apps on your computer use SQLite3 without you even knowing it.

SQLite3 is stored locally on your machine, whereas bigger relational databases like Postgres can be stored either on your machine or on another machine over the Internet.

Relational big data

Relational databases sort of hit a wall when data got too big to store in one database. Advertising companies can collect 1TB of data per day. In effect, you’d fill up an entire database in that one day. What do you do the next day? And the next?

Big data – Hadoop

Hadoop is the open source version of Google’s “Google File System” (GFS) and MapReduce framework.

Suppose for instance that your hard drives have a 1% chance of failing on any given day, and that your data is stored on 1000 hard drives. That means every day, 10 hard drives will fail. How do you make sure you don’t lose this data? You replicate it.

Some very smart people have determined how many copies of your data must be stored so that, even though hard drives are basically guaranteed to fail, you will never lose your data.

In addition to data replication, the data is also spread across multiple “chunks”. So multiple chunks (really files) make up one original data file.

MapReduce is a framework (a.k.a. a fancy way of writing a for loop), that distributes copies of the same program onto multiple machines, where each machine works on different chunks than the other machines.

Ideally, if you use N machines your running time would be reduced by 1/N, but there is lots of overhead that comes with coordinating the work that is done by each machine and merging it all together at the end.

Spark

Spark is seen as the “successor” to Hadoop MapReduce. I find that in general Spark jobs are a little easier to write. Note that it’s a framework, NOT a database, but I list it here to ease the confusion.

We will return to Hadoop later, but first, more “big data” generation technologies.

MongoDB

One database that became popular when startups started acquiring lots of data is MongoDB. MongoDB, unlike the other databases we’ve talked about, is not relational. In MongoDB, we don’t have “tables”, we have “collections”. In MongoDB, we don’t have “rows”, we have “documents”.

Documents are JSON documents. The nice thing about MongoDB is that you use Javascript to interact with it.

Startups started using the MEAN stack, which is made up of: MongoDB, ExpressJS, AngularJS, and NodeJS, for an all-Javascript environment.

MongoDB and similar databases don’t guarantee “consistency”. If you’re a bank, and I take out $50 so that my total balance is now $5, I don’t want someone else trying to take out $50 at the same time and putting my balance in the negative.

With MongoDB, I could take out $50, but some other user might still read that same document and see that my account still has $55, and hence try to take out another $50, even though this user read the database after I did my withdrawal.

In many applications this doesn’t matter and it’s good for performance.

MongoDB also allows “replication” and “sharding”.

“Replication” means you can have “masters” and “slaves” which store the same data. Different instances of the application can read from different slaves to decrease the load on any one machine running MongoDB.

“Sharding” means splitting up the data so that certain IDs go on one machine, while other IDs go to another. This also decreases the load.

Often times, people make the mistake of using MongoDB, because it’s new and cool, when their data is actually relational. What happens? They often end up having to program those relationships themselves in the application, which is more tedious and cumbersome than you might imagine.

Redis

Some people say “Redis is like a big key-value store”. At a very high level this is indeed what Redis does, and it does so very fast. If you know you don’t have “relationships” in your data, and you know you won’t need to store, query, and update JSON-like structures, then Redis is a great choice. You can also use sharding and replication with Redis, so it can store more stuff than would fit on just one hard drive.

Back to Hadoop…

Hadoop is not a database

Hadoop is not a database. The “Hadoop File System” (or HDFS) is the open source analogue of Google’s GFS. A database exists “on top of” a file system. For example, Postgres can exist on top of your “FAT32” file system. It’s a program that coordinates the storage and retrieval of data.

There are indeed databases that can work on top of HDFS/GFS.

Some examples are: Google’s BigTable and Hadoop’s HBase.

They allow you to do “queries”, like you do with SQL, as opposed to MapReduce’s plain for-loop-like structure.

Which do you choose?

Lessons I think we can learn from other business’ experiences:

1) Don’t use something just because it’s cool and new.
2) Don’t use big data tech when you don’t have big data.
3) Even if you think you have big data, check to see if it’s really that big.
4) Be honest with yourself about how long it’ll take to get big and whether it’s worth investing in a big data solution now.
5) Don’t forget about SQL.

Did this answer all the questions you ever had about databases? Do you have any stories to share about how you once chose a database you thought would be awesome and it not only let you down but caused you to divert your attention for weeks or months just trying to fix its issues? Do you like using stuff even if it’s still at version 0.1? Let me know in the comments!

Update: I don’t mean to suggest that MySQL and Postgres do not support master-slave configurations; they do. And despite MySQL not being what is traditionally thought of as a big data solution, Facebook famously altered MySQL to work on their backend (and they have more data than most companies doing big data).

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of logistic regression and how to program it with Python?

In order to make your operations and data-driven decisions scalable – you need to distribute the processing of your data.

Two popular libraries that do such distributed machine learning are Mahout (which uses MapReduce) and MLlib (which uses Spark, which is sometimes considered as a successor to MapReduce).

What I want to do with this tutorial is to show you how easy it is to do distributed machine learning using Spark and EC2.

When I started a recent project of mine, I was distraught at how complicated a Mahout setup could be.

I am not an ops person. I hate installing and configuring things. For something like running distributed k-means clustering, 90% of the work could go into just setting up a Hadoop cluster, installing all the libraries your code needs to run, making sure they are the right versions, etc…

The Hadoop ecosystem is very sensitive to these things, and sometimes MapReduce jobs can be very hard to debug.

With Spark, everything is super easy. Installing Spark and Hadoop is tedious but do-able. Spinning up a cluster is very easy. Running a job is very easy. We will use Python, but you can also use Scala or Java.

Outline of this tutorial:

Install Spark on a driver machine.

Create a cluster.

Run a job.

1. Install Spark

I used an Ubuntu instance on EC2. I’m assuming you already know how to set up a security group, get your PEM, and SSH into the machine.

Once you’ve spun up your AMI, we can begin installing all the stuff we’ll need.

To make this even easier you can probably do this on your local machine, but if for some reason you’re using Windows or you don’t want to mess up your local machine, then you’ll want to do this.

First, set your AWS ID and secret environment variables.

I set my zone as “us-east-1b” but you can set it to a zone of your choice.

When you’re finished, don’t forget to tear down your cluster! On-demand machines are expensive.

./spark-ec2 destroy <cluster name>

For some reason, numpy isn’t installed when you create a cluster, and the default Python distribution on the m1.large machines is 2.6, while Spark installs its own 2.7. So, even if you easy_install numpy on each of the machines in the cluster, it won’t work for Spark.

You can instead copy the library over to each cluster machine from your driver machine:

Remember you’ll have to instantiate your own SparkContext in this case.

Future Improvements

The goal of this tutorial is to make things easy.

There are many areas for improvement – for instance – on-demand machines on Amazon are the most expensive.

Spark still spins up “m1.large” instances, even though EC2′s current documentation recommends using the better, faster, AND cheaper “m3.large” instance instead.

At the same time, that custom configuration could mean we can’t use the spark-ec2 script to spin up the cluster automatically. There might be an option there to choose. I didn’t really look.

One major reason I wrote this tutorial is because all the information in it is out there in some form, but it is disparate and some of it can be hard to find without knowing what to search for.

So that’s it. The easiest possible way to run distributed machine learning.

The Naive Bayes classifier is a simple classifier that is often used as a baseline for comparison with more complex classifiers.

It is also conceptually very simple and as you’ll see it is just a fancy application of Bayes rule from your probability class.

We will use the famous MNIST data set for this tutorial.

The MNIST dataset is a set of handwritten digits, and our job is to build a computer program that takes as input an image of a digit, and outputs what digit it is.

Recall Bayes rule:

$$ P(c | x) = \frac{P(x | c)P(c)}{P(x)} $$

If you’re like me, you may have found this notation a little confusing at first.

Here \( x \) represents the image, or more precisely, the pixel values of the image formatted as a vector, and \( c \) represents the digit, which can be 0, 1, …, 9.

Sidenote: Images are grayscale and of size 28×28, which, if flattened, yields a vector of length 28×28=784. If you look at the code (linked below) you can see how this is done, and also that we scale the pixel values to be between 0…1. Scaling isn’t strictly necessary, but can be useful for many machine learning algorithms.

We can read the left side \( P(c | x) \) as “the probability that the class is \( c \) given the data \( x \)”. (this is called the “posterior”)

We can read the right side \( P(x | c) \) as “the probability that the data \( x \) belongs to the class \( c \)”. (this is called the “likelihood”)

One little efficiency trick we can do:

We don’t actually care about the value of \( P(c | x) \).

We care about the value of \( c \) itself. That tells us “which digit” the image belongs to.

The class chosen is simply the one that yields the highest probability for that data:

You will notice that \( P(x) \) is constant for all values of \( c \) in \( P(c | x) \).

So when I take the argmax over \( \frac{ P(x | c)P(c) }{ P(x) } \) I can ignore \( P(x) \).

As a simple example, suppose I have that \( A > B \), or numerically, say, \( 10 > 5 \).

If I multiply or divide these numbers by a positive constant (\( P(x) \) will always be positive) then the relationship still holds.

Ex. \( 2A > 2B \), or \( 20 > 10 \).

Using this information, we can simplify our problem so that, in order to choose “which digit” given an image, all we need to do is calculate this argmax (notice \( P(x) \) is removed):

$$ c^* = argmax_{c}{ P(x | c)P(c) }$$

The next step we can take is to think about how to calculate \( P(c) \).

This is just counting.

If I have 100 students in my class, and I want to figure out the probability that a student is born in January, how can I do that?

I simply count up all the students born in January, and divide by the total number of students.

If I want to know the probability of getting heads when I flip a coin, I flip the coin a bunch of times, and divide the number of heads by the total number of coin flips.

The challenge is choosing a model that accurately fits the data for \( P(x | c) \).

As a thought-exercise, think about how you’d do this naively.

Each image is of size 28×28, which means there are 784 (=28×28) pixels per image. Each pixel can take on integer values in the range 0..255 inclusive.

So, if you modeled this as a discrete probability distribution, you’d have \( 255^{784} \) different possibilities. That’s way more than the number of images you have (~50, 000), and hence, you’d never be able to use the “counting method” (used above for calculating \( P(c) \)) to accurately measure those probabilities.

To make the problem tractable and easily computable, we recall that pixels represent light intensity, and light intensity is actually continuous. It’s only discrete inside a computer because computers are discrete.

A reasonable first-guess for modeling continuous data is the multivariate Gaussian or the multivariate Normal.

Note that because the data are continuous, we are not actually calculating probabilities, but probability densities, on the right for \( P(x | c) \). Luckily, Bayes rule still holds for probability densities.

Another thing to note is that because probabilities are very small when dimensionality is high, we’ll work with log-likelihood instead of likelihood. Then instead of getting numbers very close to 0, which is inaccurate when using a computer to represent them, we’ll just get negative numbers.

Which you should try to derive yourself. (D is the dimensionality)

By the way, to calculate \( \mu \) and \( \Sigma \), you can use the sample mean and covariance: https://en.wikipedia.org/wiki/Sample_mean_and_covariance. Note that it’s \( P(x | c) \), not just \( P(x) \). So, if we want to calculate the mean and covariance for all the images of the digit 9, then we’d first grab only the images of the digit 9 (ignoring the rest of the images), and calculate the sample mean and covariance from this subset.

Earlier, we wanted the argmax over \( P( x | c)P(c) \). Since \( log(AB) = log(A) + log(B) \), then using log probabilities, we can choose the digit class using:

This works since the \( log() \) function is monotonically increasing. If \( A > B \) then \( log(A) > log(B) \). Try any 2 numbers on your calculator if you don’t believe me.

Now this problem is tractable.

Training the classifier would work as follows:

For each class = 0..9
get all x’s (images) for the class
save the mean and covariance of those x’s with the class

Prediction using the classifier would work as follows:

Given a test point x:
Calculate the probability that x is in each class c
Return the class c that yields the highest posterior probability

What makes a Bayes classifier a Naive Bayes classifier?

So far we have only discussed general Bayes classifiers.

A Naive Bayes classifier is one that assumes all input dimensions of \( x \) are independent.

Recall that when 2 random variables \( A \) and \( B \) are independent, their joint probability can be expressed as the product of their individual probabilities:

$$ P(A, B) = P(A)P(B) $$

For the Gaussian case, this means that instead of having a joint Gaussian with a full covariance matrix \( \Sigma \), we can instead express it as a product of 784 individual univariate Gaussians:

One advantage of Naive Bayes is that we don’t have to worry about any interactions between any \( x_i \) and \( x_j \) if \( i \neq j \).

In more practical terms, before we had \( \Sigma \), which is of size \( D \times D = 784^2 \).

Now, we only have \( \sigma_i^2, i=1…784 \).

That’s 784 times less numbers you have to store.

You can express the joint distribution of 784 individual univariate Gaussians as one big multivariate Gaussian, it just means that the covariance matrix \( \Sigma \) will have zeros everywhere except along the diagonal, which just stores the 784 univariate variances.

Having 784 individual variances means we don’t have to invert \( \Sigma \) to calculate the PDF or log PDF, which leads to even more savings.

The downside of Naive Bayes is that the Naive assumption (that all input dimensions are independent) is most often incorrect.

So what do we get after training our model?

Visually, we can “see” what the model has learned by plotting the mean \( \mu \) for each class.

Here are the plots you’d get:

As you’d expect, the mean of each class very closely captures what that digit typically looks like.

We get about 94% accuracy on the test set, which is pretty good!

Notice how we achieve only about 80% on the test set with Naive Bayes, due to the fact that the Naive assumption is pretty obviously not correct. E.g. if we’re looking at a black pixel at one of the corners, are the pixels around it also not very likely to be black?

A problem that often arises when you’re dealing with lots of data is that it takes forever to process.

So you SSH into your Amazon EC2 machine, start your script, and go do other things while it’s running.

You check back a few hours later to see that your SSH session seems to be frozen or you’ve gotten a “broken pipe” error.

You log back in to your EC2 machine, only to discover your script has terminated.

What to do…

Screen for Linux

Screen is like having “tabs” for your command line. You can have multiple screens running at any time. They stay active, so even if you exit from a screen, or exit your entire SSH session, whatever you were doing inside that screen will continue.

This is great if you have a script that takes longer than EC2’s allowed session duration.

Start Screen

To start screen, just enter:

screen

Hit enter to exit out of this info screen.

How Screen Works

Once you’ve started screen, there are a few things you can do:

1. Create a screen

2. Detach from a screen (i.e. go back to your “regular” terminal)

3. Re-attach to a screen (that you’ve previously detached from).

4. List the screens you have open

This is all you need for the use case described above.

Enter commands in Screen

Once inside Screen, you can tell screen you’re about to enter a command by pressing:

Ctrl + A

Now Screen knows you’re about to enter a command.

Create a Screen

After hitting “Ctrl+A”, hit “C”.

Detach from a Screen

After hitting “Ctrl+A”, hit “D”.

Re-attach to a Screen

screen -r

List the Screens you have open

The above only works if you have only one screen open. Otherwise, you’ll see this:

If you want to re-attach to a particular screen, enter:

screen -r 14366.pts-0.affinity-proto

(Obviously you would choose the screen you want to go back to).