Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

Announcing Data Science: Supervised Machine Learning in Python (Less Math, More Action!)

September 16, 2016


If you don’t want to read about the course and just want the 88% OFF coupon code, skip to the bottom.

In recent years, we’ve seen a resurgence in AI, or artificial intelligence, and machine learning.

Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts.

Google’s AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning.

Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

Google famously announced that they are now “machine learning first”, meaning that machine learning is going to get a lot more attention now, and this is what’s going to drive innovation in the coming years.

Machine learning is used in many industries, like finance, online advertising, medicine, and robotics.

It is a widely applicable tool that will benefit you no matter what industry you’re in, and it will also open up a ton of career opportunities once you get good.

Machine learning also raises some philosophical questions. Are we building a machine that can think? What does it mean to be conscious? Will computers one day take over the world?

The best part about this course is that it requires WAY less math than my usual courses; just some basic probability and geometry, no calculus!

In this course, we are first going to discuss the K-Nearest Neighbor algorithm. It’s extremely simple and intuitive, and it’s a great first classification algorithm to learn. After we discuss the concepts and implement it in code, we’ll look at some ways in which KNN can fail.

It’s important to know both the advantages and disadvantages of each algorithm we look at.

Next we’ll look at the Naive Bayes Classifier and the General Bayes Classifier. This is a very interesting algorithm to look at because it is grounded in probability.

We’ll see how we can transform the Bayes Classifier into a linear and quadratic classifier to speed up our calculations.

Next we’ll look at the famous Decision Tree algorithm. This is the most complex of the algorithms we’ll study, and most courses you’ll look at won’t implement them. We will, since I believe implementation is good practice.

The last algorithm we’ll look at is the Perceptron algorithm. Perceptrons are the ancestor of neural networks and deep learning, so they are important to study in the context of machine learning.

One we’ve studied these algorithms, we’ll move to more practical machine learning topics. Hyperparameters, cross-validation, feature extraction, feature selection, and multiclass classification.

We’ll do a comparison with deep learning so you understand the pros and cons of each approach.

We’ll discuss the Sci-Kit Learn library, because even though implementing your own algorithms is fun and educational, you should use optimized and well-tested code in your actual work.

We’ll cap things off with a very practical, real-world example by writing a web service that runs a machine learning model and makes predictions. This is something that real companies do and make money from.

All the materials for this course are FREE. You can download and install Python, Numpy, and Scipy with simple commands on Windows, Linux, or Mac.

UPDATE: New coupon if the above is sold out:

#data science #machine learning #matplotlib #numpy #pandas #python

Go to comments

New course – Natural Language Processing: Deep Learning in Python part 6

August 9, 2016


[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at advanced NLP using deep learning.

Previously, you learned about some of the basics, like how many NLP problems are just regular machine learning and data science problems in disguise, and simple, practical methods like bag-of-words and term-document matrices.

These allowed us to do some pretty cool things, like detect spam emails, write poetry, spin articles, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but 4 new architectures in this course.

First up is word2vec.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

  • king – man = queen – woman
  • France – Paris = England – London
  • December – Novemeber = July – June

We are also going to look at the GLoVe method, which also finds word vectors, but uses a technique called matrix factorization, which is a popular algorithm for recommender systems.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like parts-of-speech tagging and named entity recognition, and use recurrent neural networks to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about recursive neural networks, which finally help us solve the problem of negation in sentiment analysis. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

All of the materials required for this course can be downloaded and installed for FREE. We will do most of our work in Numpy and Matplotlib,and Theano. I am always available to answer your questions and help you along your data science journey.

See you in class!

UPDATE: New coupon if the above is sold out:

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec

Go to comments

New Deep Learning course on Udemy

February 26, 2016


This course continues where my first course, Deep Learning in Python, left off. You already know how to build an artificial neural network in Python, and you have a plug-and-play script that you can use for TensorFlow.

You learned about backpropagation (and because of that, this course contains basically NO MATH), but there were a lot of unanswered questions. How can you modify it to improve training speed? In this course you will learn about batch and stochastic gradient descent, two commonly used techniques that allow you to train on just a small sample of the data at each iteration, greatly speeding up training time.

You will also learn about momentum, which can be helpful for carrying you through local minima and prevent you from having to be too conservative with your learning rate. You will also learn aboutadaptive learning rate techniques like AdaGrad and RMSprop which can also help speed up your training.

In my last course, I just wanted to give you a little sneak peak at TensorFlow. In this course we are going to start from the basics so you understand exactly what’s going on – what are TensorFlow variables and expressions and how can you use these building blocks to create a neural network? We are also going to look at a library that’s been around much longer and is very popular for deep learning – Theano. With this library we will also examine the basic building blocks – variables, expressions, and functions – so that you can build neural networks in Theano with confidence.

Because one of the main advantages of TensorFlow and Theano is the ability to use the GPU to speed up training, I will show you how to set up a GPU-instance on AWS and compare the speed of CPU vs GPU for training a deep neural network.

With all this extra speed, we are going to look at a real dataset – the famous MNIST dataset (images of handwritten digits) and compare against various known benchmarks.

#adagrad #aws #batch gradient descent #deep learning #ec2 #gpu #machine learning #nesterov momentum #numpy #nvidia #python #rmsprop #stochastic gradient descent #tensorflow #theano

Go to comments

Data Science: Natural Language Processing in Python

February 11, 2016

Do you want to learn natural language processing from the ground-up?

If you hate math and want to jump into purely practical coding examples, my new course is for you.

You can check it out at Udemy:

I am posting the course summary here also for convenience:


In this course you will build MULTIPLE practical systems using natural language processing, or NLP. This course is not part of my deep learning series, so there are no mathematical prerequisites – just straight up coding in Python. All the materials for this course are FREE.

After a brief discussion about what NLP is and what it can do, we will begin building very useful stuff. The first thing we’ll build is a spam detector. You likely get very little spam these days, compared to say, the early 2000s, because of systems like these.

Next we’ll build a model for sentiment analysis in Python. This is something that allows us to assign a score to a block of text that tells us how positive or negative it is. People have used sentiment analysis on Twitter to predict the stock market.

We’ll go over some practical tools and techniques like the NLTK (natural language toolkit) library and latent semantic analysis or LSA.

Finally, we end the course by building an article spinner. This is a very hard problem and even the most popular products out there these days don’t get it right. These lectures are designed to just get you started and to give you ideas for how you might improve on them yourself. Once mastered, you can use it as an SEO, or search engine optimization tool. Internet marketers everywhere will love you if you can do this for them!

As a thank you for visiting this site, I’ve created a coupon that gets you 70% off.

Click here to get the course for only $15.

#article spinner #latent semantic analysis #latent semantic indexing #machine learning #natural language processing #nlp #pca #python #spam detection #svd

Go to comments

Linear regression video course in Python

November 11, 2015

Hi all!

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of linear regression and how to program it with Python and numpy?

Look no further, that video course is here.

#linear regression #numpy #python #statistics

Go to comments

Logistic Regression in Python video course

November 11, 2015

Hi all!

Do you ever get tired of reading walls of text, and just want a nice video or 10 to explain to you the magic of logistic regression and how to program it with Python?

Look no further, that video course is here.

#big data #data science #logistic regression #neural networks #numpy #python

Go to comments

How to run distributed machine learning jobs using Apache Spark and EC2 (and Python)

April 5, 2015

This is the age of big data.

Sometimes sci-kit learn doesn’t cut it.

In order to make your operations and data-driven decisions scalable – you need to distribute the processing of your data.

Two popular libraries that do such distributed machine learning are Mahout (which uses MapReduce) and MLlib (which uses Spark, which is sometimes considered as a successor to MapReduce).

What I want to do with this tutorial is to show you how easy it is to do distributed machine learning using Spark and EC2.

When I started a recent project of mine, I was distraught at how complicated a Mahout setup could be.

I am not an ops person. I hate installing and configuring things. For something like running distributed k-means clustering, 90% of the work could go into just setting up a Hadoop cluster, installing all the libraries your code needs to run, making sure they are the right versions, etc…

The Hadoop ecosystem is very sensitive to these things, and sometimes MapReduce jobs can be very hard to debug.

With Spark, everything is super easy. Installing Spark and Hadoop is tedious but do-able. Spinning up a cluster is very easy. Running a job is very easy. We will use Python, but you can also use Scala or Java.

Outline of this tutorial:

  1. Install Spark on a driver machine.
  2. Create a cluster.
  3. Run a job.


1. Install Spark

I used an Ubuntu instance on EC2. I’m assuming you already know how to set up a security group, get your PEM, and SSH into the machine.


Once you’ve spun up your AMI, we can begin installing all the stuff we’ll need.

To make this even easier you can probably do this on your local machine, but if for some reason you’re using Windows or you don’t want to mess up your local machine, then you’ll want to do this.

First, set your AWS ID and secret environment variables.


Now install Java:

sudo apt-get update
sudo apt-get install default-jdk maven
export MAVEN_OPTS=”-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m”

For the last line, we will need this RAM available to build Spark, if I remember correctly.

Now download and build Spark:

tar -xf spark-1.3.0.tgz
cd spark-1.3.0
mvn -DskipTests clean package

By the time you read this a new version of Spark may be available. You should check.


2. Create a Cluster

Assuming you are in the Spark folder now, it is very easy to create a cluster to run your jobs:

./ec2/spark-ec2 -k “Your Key Pair Name” -i /path/to/key.pem -s <number of slaves> launch <cluster name> —copy-aws-credentials -z us-east-1b

I set my zone as “us-east-1b” but you can set it to a zone of your choice.

When you’re finished, don’t forget to tear down your cluster! On-demand machines are expensive.

./spark-ec2 destroy <cluster name>

For some reason, numpy isn’t installed when you create a cluster, and the default Python distribution on the m1.large machines is 2.6, while Spark installs its own 2.7. So, even if you easy_install numpy on each of the machines in the cluster, it won’t work for Spark.

You can instead copy the library over to each cluster machine from your driver machine:

scp -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy* root@<cluster-machine>:/usr/lib/python2.7/dist-packages/
scp -r -i /path/to/key.pem /usr/lib/python2.7/dist-packages/numpy root@<cluster-machine>:/usr/lib/python2.7/dist-packages/

You can easily write a script to automatically copy all this stuff over (get the machine URLs from the EC2 console).


3. Run a Job

Spark gives you a Python shell.

First, go to your EC2 console and find the URL for your cluster master. SSH into that machine (username is root).

cd spark
MASTER=spark://<cluster-master-ip>:7077 ./pyspark

Import libraries:

from pyspark.mllib.clustering import KMeans
from numpy import array

Load your data:

data = sc.textFile(“s3://<my-bucket>/<path>/*.csv”)

Note 1: You can use a wildcard to grab multiple files into one variable – called an RDD – resilient distributed dataset.

Note 2: Spark gives you a variable called ‘sc’, which is an object of type SparkContext. It specifies the master node, among other things.

Maybe filter out some bad lines:

data = data.filter(lambda line: ‘ERROR’ not in line)

Turn each row into an array / vector observation:

data = line: array([float(x) for x in line.split()]))

Train your model:

clusters = KMeans.train(parsedData, 2, maxIterations=20,
runs=1, initializationMode=”k-means||”)

Save some output:


You can also run a standalone Python script using spark-submit instead of the shell.

./bin/spark-submit —master spark://<master-ip>:7077

Remember you’ll have to instantiate your own SparkContext in this case.


Future Improvements

The goal of this tutorial is to make things easy.

There are many areas for improvement – for instance – on-demand machines on Amazon are the most expensive.

Spark still spins up “m1.large” instances, even though EC2′s current documentation recommends using the better, faster, AND cheaper “m3.large” instance instead.

At the same time, that custom configuration could mean we can’t use the spark-ec2 script to spin up the cluster automatically. There might be an option there to choose. I didn’t really look.

One major reason I wrote this tutorial is because all the information in it is out there in some form, but it is disparate and some of it can be hard to find without knowing what to search for.

So that’s it. The easiest possible way to run distributed machine learning.

How do you do distributed machine learning?

#apache #aws #big data #data science #ec2 #emr #machine learning #python #spark

Go to comments

Install all your statistics and numerical computation libraries for Python in one go on Ubuntu

July 26, 2014

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

#numpy #python #scientific computing #scipy #statistics

Go to comments