# SQL for Marketers: Dominate Data Analytics, Data Science, and Big Data

March 19, 2016

This is an annoucement along with free and discount coupons for my new course, SQL for Marketers: Dominate data analytics, data science, and big data

More and more companies these days are learning that they need to make DATA-DRIVEN decisions.

With big data and data science on the rise, we have more data than we know what to do with.

One of the basic languages of data analytics is SQL, which is used for many popular databases including MySQL, Postgres, Microsoft SQL Server, Oracle, and even big data solutions like Hive and Cassandra.

I’m going to let you in on a little secret. Most high-level marketers and product managers at big tech companies know how to manipulate data to gain important insights. No longer do you have to wait around the entire day for some software engineer to answer your questions – now you can find the answers directly, by yourself, using SQL.

Do you want to know how to optimize your sales funnel using SQL, look at the seasonal trends in your industry, and run a SQL query on Hadoop? Then join me now in my new class, SQL for marketers: Dominate data analytics, data science, and big data!

P.S. If you haven’t yet signed up for my newsletter at lazyprogrammer [dot] me, you’ll want to do so before Monday, especially if you want to learn more about deep learning, because I have a special announcement coming up that will NOT be announced on Udemy.

Here’s the coupons:

FREE coupon for early early birds:

EARLYBIRD (Sold out)

If the first coupon has run out, you may still use the 2nd coupon, which gives you 70% off:

EARLYBIRD2

#aws #big data #cassandra #Data Analytics #ec2 #hadoop #Hive #Microsoft SQL Server #MySQL #Oracle #Postgres #S3 #spark #sql #sqlite

# New Deep Learning course on Udemy

February 26, 2016

This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

#adagrad #aws #batch gradient descent #deep learning #ec2 #gpu #machine learning #nesterov momentum #numpy #nvidia #python #rmsprop #stochastic gradient descent #tensorflow #theano

# Principal Components Analysis in Theano

February 21, 2016

This is a follow-up post to my original PCA tutorial. It is of interest to you if you:

• Are interested in deep learning (this tutorial uses gradient descent)
• Are interested in learning more about Theano (it is not like regular Python, and it is very popular for implementing deep learning algorithms)
• Want to know how you can write your own PCA solver (in the previous post we used a library to get eigenvalues and eigenvectors)
• Work with big data (this technique can be used to process data where the dimensionality is very large – where the covariance matrix wouldn’t even fit into memory)

First, you should be familiar with creating variables and functions in Theano. Here is a simple example of how you would do matrix multiplication:

import numpy as np
import theano
import theano.tensor as T

X = T.matrix('X')
Q = T.matrix('Q')
Z = T.dot(X, Q)

transform = theano.function(inputs=[X,Q], outputs=Z)

X_val = np.random.randn(100,10)
Q_val = np.random.randn(10,10)

Z_val = transform(X_val, Q_val)


I think of Theano variables as “containers” for real numbers. They actually represent nodes in a graph. You will see the term “graph” a lot when you read about Theano, and probably think to yourself – what does matrix multiplication or machine learning have to do with graphs? (not graphs as in visual graphs, graphs as in nodes and edges) You can think of any “equation” or “formula” as a graph. Just draw the variables and functions as nodes and then connect them to make the equation using lines/edges. It’s just like drawing a “system” in control systems or a visual representation of a neural network (which is also a graph).

If you have ever done linear programming or integer programming in PuLP you are probably familiar with the idea of “variable” objects and them passing them into a “solver” after creating some “expressions” that represent the constraints and objective of the linear / integer program.

Anyway, onto principal components analysis.

Let’s consider how you would find the leading eigenvalue and eigenvector (the one corresponding to the largest eigenvalue) of a square matrix.

The loss function / objective for PCA is:

$$J = \sum_{n=1}^{N} |x_n – \hat{x}_n|^2$$

Where $$\hat{X}$$ is the reconstruction of $$X$$. If there is only one eigenvector, let’s call this $$v$$, then this becomes:

$$J = \sum_{n=1}^{N} |x_n – x_nvv^T|^2$$

This is equivalent to the Frobenius norm, so we can write:

$$J = |X – Xvv^T|_F$$

One identity of the Frobenius norm is:

$$|A|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij} } = \sqrt{ Tr(A^T A ) }$$

Which means we can rewrite the loss function as:

$$J = Tr( (X – Xvv^T)^T(X – Xvv^T) )$$

Keeping in mind that with the trace function you can re-order matrix multiplications that you wouldn’t normally be able to (matrix multiplication isn’t commutative), and dropping any terms that don’t depend on $$v$$, you can use matrix algebra to rearrange this to get:

$$v^* = argmin\{-Tr(X^TXvv^T) \}$$

Which again using reordering would be equivalent to maximizing:

$$v^* = argmax\{ v^TX^TXv \}$$

The corresponding eigenvalue would then be:

$$\lambda = v^TX^TXv$$

Now that we have a function to maximize, we can simply use gradient descent to do it, similar to how you would do it in logistic regression or in a deep belief network.

$$v \leftarrow v + \eta \nabla_v(v^TX^TXv)$$

Next, let’s extend this algorithm for finding the other eigenvalues and eigenvectors. You essentially subtract the contributions of the eigenvalues you already found.

$$v_i \leftarrow v_i + \eta \nabla_{v_i}(v_i^T( X^TX – \sum_{j=1}^{i-1} \lambda_j v_j v_j^T )v_i )$$

Next, note that to implement this algorithm you never need to actually calculate the covariance $$X^T X$$. If your dimensionality is, say, 1 million, then your covariance matrix will have 1 trillion entries!

Instead, you can multiply by your eigenvector first to get $$Xv$$, which is only of size $$N \times 1$$. You can then “dot” this with itself to get a scalar, which is only an $$O(N)$$ operation.

So how do you write this code in Theano? If you’ve never used Theano for gradient descent there will be some new concepts here.

First, you don’t actually need to know how to differentiate your cost function. You use Theano’s T.grad(cost_function, differentiation_variable).

v = theano.shared(init_v, name="v")
Xv = T.dot(X, v)
cost = T.dot(Xv.T, Xv) - np.sum(evals[j]*T.dot(evecs[j], v)*T.dot(evecs[j], v) for j in xrange(i))

gv = T.grad(cost, v)


Note that we re-normalize the eigenvector on each step, so that $$v^T v = 1$$.

Next, you define your “weight update rule” as an expression, and pass this into the “updates” argument of Theano’s function creator.

y = v + learning_rate*gv
update_expression = y / y.norm(2)

train = theano.function(
inputs=[X],
)


Note that the update variable must be a “shared variable”. With this knowledge in hand, you are ready to implement the gradient descent version of PCA in Theano:

for i in xrange(number of eigenvalues you want to find):
... initialize variables and expressions ...
... initialize theano train function ...
while t < max_iterations and change in v < tol:
outputs = train(data)

... return eigenvalues and eigenvectors ...


This is not really trivial but at the same time it's a great exercise in both (a) linear algebra and (b) Theano coding.

If you are interested in learning more about PCA, dimensionality reduction, gradient descent, deep learning, or Theano, then check out my course on Udemy "Data Science: Deep Learning in Python" and let me know what you think in the comments.

#aws #data science #deep learning #gpu #machine learning #nvidia #pca #principal components analysis #statistics #theano

# How to run distributed machine learning jobs using Apache Spark and EC2 (and Python)

April 5, 2015

This is the age of big data.

Sometimes sci-kit learn doesn’t cut it.

In order to make your operations and data-driven decisions scalable – you need to distribute the processing of your data.

Two popular libraries that do such distributed machine learning are Mahout (which uses MapReduce) and MLlib (which uses Spark, which is sometimes considered as a successor to MapReduce).

What I want to do with this tutorial is to show you how easy it is to do distributed machine learning using Spark and EC2.

When I started a recent project of mine, I was distraught at how complicated a Mahout setup could be.

I am not an ops person. I hate installing and configuring things. For something like running distributed k-means clustering, 90% of the work could go into just setting up a Hadoop cluster, installing all the libraries your code needs to run, making sure they are the right versions, etc…

The Hadoop ecosystem is very sensitive to these things, and sometimes MapReduce jobs can be very hard to debug.

With Spark, everything is super easy. Installing Spark and Hadoop is tedious but do-able. Spinning up a cluster is very easy. Running a job is very easy. We will use Python, but you can also use Scala or Java.

Outline of this tutorial:

1. Install Spark on a driver machine.
2. Create a cluster.
3. Run a job.

## 1. Install Spark

I used an Ubuntu instance on EC2. I’m assuming you already know how to set up a security group, get your PEM, and SSH into the machine.

Once you’ve spun up your AMI, we can begin installing all the stuff we’ll need.

To make this even easier you can probably do this on your local machine, but if for some reason you’re using Windows or you don’t want to mess up your local machine, then you’ll want to do this.

First, set your AWS ID and secret environment variables.

export AWS_ACCESS_KEY_ID=…
export AWS_SECRET_ACCESS_KEY=…

Now install Java:

sudo apt-get update
sudo apt-get install default-jdk maven
export MAVEN_OPTS=”-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m”

For the last line, we will need this RAM available to build Spark, if I remember correctly.

wget http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-1.3.0/spark-1.3.0.tgz
tar -xf spark-1.3.0.tgz
cd spark-1.3.0
mvn -DskipTests clean package

By the time you read this a new version of Spark may be available. You should check.

## 2. Create a Cluster

Assuming you are in the Spark folder now, it is very easy to create a cluster to run your jobs:

./ec2/spark-ec2 -k “Your Key Pair Name” -i /path/to/key.pem -s <number of slaves> launch <cluster name> —copy-aws-credentials -z us-east-1b

I set my zone as “us-east-1b” but you can set it to a zone of your choice.

When you’re finished, don’t forget to tear down your cluster! On-demand machines are expensive.

./spark-ec2 destroy <cluster name>

For some reason, numpy isn’t installed when you create a cluster, and the default Python distribution on the m1.large machines is 2.6, while Spark installs its own 2.7. So, even if you easy_install numpy on each of the machines in the cluster, it won’t work for Spark.

You can instead copy the library over to each cluster machine from your driver machine:

scp -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy* [email protected]<cluster-machine>:/usr/lib/python2.7/dist-packages/
scp -r -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy [email protected]<cluster-machine>:/usr/lib/python2.7/dist-packages/

You can easily write a script to automatically copy all this stuff over (get the machine URLs from the EC2 console).

## 3. Run a Job

Spark gives you a Python shell.

First, go to your EC2 console and find the URL for your cluster master. SSH into that machine (username is root).

cd spark
MASTER=spark://<cluster-master-ip>:7077 ./pyspark

Import libraries:

from pyspark.mllib.clustering import KMeans
from numpy import array

data = sc.textFile(“s3://<my-bucket>/<path>/*.csv”)

Note 1: You can use a wildcard to grab multiple files into one variable – called an RDD – resilient distributed dataset.

Note 2: Spark gives you a variable called ‘sc’, which is an object of type SparkContext. It specifies the master node, among other things.

Maybe filter out some bad lines:

data = data.filter(lambda line: ‘ERROR’ not in line)

Turn each row into an array / vector observation:

data = data.map(lambda line: array([float(x) for x in line.split()]))

clusters = KMeans.train(parsedData, 2, maxIterations=20,
runs=1, initializationMode=”k-means||”)

Save some output:

sc.parallelize(clusters.centers).saveAsTextFile(”s3://…./output.csv”)

You can also run a standalone Python script using spark-submit instead of the shell.

./bin/spark-submit —master spark://<master-ip>:7077 myscript.py

Remember you’ll have to instantiate your own SparkContext in this case.

## Future Improvements

The goal of this tutorial is to make things easy.

There are many areas for improvement – for instance – on-demand machines on Amazon are the most expensive.

Spark still spins up “m1.large” instances, even though EC2′s current documentation recommends using the better, faster, AND cheaper “m3.large” instance instead.

At the same time, that custom configuration could mean we can’t use the spark-ec2 script to spin up the cluster automatically. There might be an option there to choose. I didn’t really look.

One major reason I wrote this tutorial is because all the information in it is out there in some form, but it is disparate and some of it can be hard to find without knowing what to search for.

So that’s it. The easiest possible way to run distributed machine learning.

How do you do distributed machine learning?

#apache #aws #big data #data science #ec2 #emr #machine learning #python #spark