Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

New Years Udemy Coupons! All Udemy Courses only $10

January 1, 2017

Act fast! These $10 Udemy Coupons expire in 10 days.

Ensemble Machine Learning: Random Forest and AdaBoost

Deep Learning Prerequisites: Linear Regression in Python

Deep Learning Prerequisites: Logistic Regression in Python

Deep Learning in Python

Practical Deep Learning in Theano and TensorFlow

Deep Learning: Convolutional Neural Networks in Python

Unsupervised Deep Learning in Python

Deep Learning: Recurrent Neural Networks in Python

Advanced Natural Language Processing: Deep Learning in Python

Easy Natural Language Processing in Python

Cluster Analysis and Unsupervised Machine Learning in Python

Unsupervised Machine Learning: Hidden Markov Models in Python

Data Science: Supervised Machine Learning in Python

Bayesian Machine Learning in Python: A/B Testing

SQL for Newbs and Marketers

How to get ANY course on Udemy for $10 (please use my coupons above for my courses):

Click here for a link to all courses on the site:

Click here for a great calculus prerequisite course:

Click here for a great Python prerequisite course:

Click here for a great linear algebra 1 prerequisite course:

Click here for a great linear algebra 2 prerequisite course:

Go to comments

New course – Natural Language Processing: Deep Learning in Python part 6

August 9, 2016


[Scroll to the bottom for the early bird discount if you already know what this course is about]

In this course we are going to look at advanced NLP using deep learning.

Previously, you learned about some of the basics, like how many NLP problems are just regular machine learning and data science problems in disguise, and simple, practical methods like bag-of-words and term-document matrices.

These allowed us to do some pretty cool things, like detect spam emails, write poetry, spin articles, and group together similar words.

In this course I’m going to show you how to do even more awesome things. We’ll learn not just 1, but 4 new architectures in this course.

First up is word2vec.

In this course, I’m going to show you exactly how word2vec works, from theory to implementation, and you’ll see that it’s merely the application of skills you already know.

Word2vec is interesting because it magically maps words to a vector space where you can find analogies, like:

  • king – man = queen – woman
  • France – Paris = England – London
  • December – Novemeber = July – June

We are also going to look at the GLoVe method, which also finds word vectors, but uses a technique called matrix factorization, which is a popular algorithm for recommender systems.

Amazingly, the word vectors produced by GLoVe are just as good as the ones produced by word2vec, and it’s way easier to train.

We will also look at some classical NLP problems, like parts-of-speech tagging and named entity recognition, and use recurrent neural networks to solve them. You’ll see that just about any problem can be solved using neural networks, but you’ll also learn the dangers of having too much complexity.

Lastly, you’ll learn about recursive neural networks, which finally help us solve the problem of negation in sentiment analysis. Recursive neural networks exploit the fact that sentences have a tree structure, and we can finally get away from naively using bag-of-words.

All of the materials required for this course can be downloaded and installed for FREE. We will do most of our work in Numpy and Matplotlib,and Theano. I am always available to answer your questions and help you along your data science journey.

See you in class!

UPDATE: New coupon if the above is sold out:

#deep learning #GLoVe #natural language processing #nlp #python #recursive neural networks #tensorflow #theano #word2vec

Go to comments

Data Science: Natural Language Processing in Python

February 11, 2016

Do you want to learn natural language processing from the ground-up?

If you hate math and want to jump into purely practical coding examples, my new course is for you.

You can check it out at Udemy:

I am posting the course summary here also for convenience:


In this course you will build MULTIPLE practical systems using natural language processing, or NLP. This course is not part of my deep learning series, so there are no mathematical prerequisites – just straight up coding in Python. All the materials for this course are FREE.

After a brief discussion about what NLP is and what it can do, we will begin building very useful stuff. The first thing we’ll build is a spam detector. You likely get very little spam these days, compared to say, the early 2000s, because of systems like these.

Next we’ll build a model for sentiment analysis in Python. This is something that allows us to assign a score to a block of text that tells us how positive or negative it is. People have used sentiment analysis on Twitter to predict the stock market.

We’ll go over some practical tools and techniques like the NLTK (natural language toolkit) library and latent semantic analysis or LSA.

Finally, we end the course by building an article spinner. This is a very hard problem and even the most popular products out there these days don’t get it right. These lectures are designed to just get you started and to give you ideas for how you might improve on them yourself. Once mastered, you can use it as an SEO, or search engine optimization tool. Internet marketers everywhere will love you if you can do this for them!

As a thank you for visiting this site, I’ve created a coupon that gets you 70% off.

Click here to get the course for only $15.

#article spinner #latent semantic analysis #latent semantic indexing #machine learning #natural language processing #nlp #pca #python #spam detection #svd

Go to comments

Probability Smoothing for Natural Language Processing

January 23, 2016


Level: Beginner

Topic: Natural language processing (NLP)

This is a very basic technique that can be applied to most machine learning algorithms you will come across when you’re doing NLP.

Suppose for example, you are creating a “bag of words” model, and you have just collected data from a set of documents with a very small vocabulary. Your dictionary looks like this:

{"cat": 10, "dog": 10, "parrot": 10}

You would naturally assume that the probability of seeing the word “cat” is 1/3, and similarly P(dog) = 1/3 and P(parrot) = 1/3.

Now, suppose I want to determine the probability of P(mouse). Since “mouse” does not appear in my dictionary, its count is 0, therefore P(mouse) = 0.

This is a problem!

If you wanted to do something like calculate a likelihood, you’d have $$ P(document) = P(words that are not mouse) \times P(mouse) = 0 $$

This is where smoothing enters the picture.

We simply add 1 to the numerator and the vocabulary size (V = total number of distinct words) to the denominator of our probability estimate.

$$ P(word) = \frac{word count + 1}{total number of words + V} $$

Now our probabilities will approach 0, but never actually reach 0.

For a word we haven’t seen before, the probability is simply:

$$ P(new word) = \frac{1}{N + V} $$

You can see how this accounts for sample size as well.

If our sample size is small, we will have more smoothing, because N will be smaller.


N-gram probability smoothing for natural language processing

An n-gram (ex. bigram, trigram) is a probability estimate of a word given past words.

For example, in recent years, \( P(scientist | data) \) has probably overtaken \( P(analyst | data) \).

In general we want to measure:

$$ P(w_i | w_{i-1}) $$

This probably looks familiar if you’ve ever studied Markov models.

You can see how such a model would be useful for, say, article spinning.

You could potentially automate writing content online by learning from a huge corpus of documents, and sampling from a Markov chain to create new documents.

Disclaimer: you will get garbage results, many have tried and failed, and Google already knows how to catch you doing it. It will take much more ingenuity to solve this problem.

The maximum likelihood estimate for the above conditional probability is:

$$ P(w_i | w_{i-1}) = \frac{count(w_i | w_{i-1})}{count(w_{i-1})} $$

You can see that as we increase the complexity of our model, say, to trigrams instead of bigrams, we would need more data in order to estimate these probabilities accurately.

$$ P(w_i | w_{i-1}, w_{i-2}) = \frac{count(w_i | w_{i-1}, w_{i-2})}{count(w_{i-1}, w_{i-2})} $$

So what do we do?

You could use the simple “add-1” method above (also called Laplace Smoothing), or you can use linear interpolation.

What does this mean? It means we simply make the probability a linear combination of the maximum likelihood estimates of itself and lower order probabilities.

It’s easier to see in math…

$$ P(w_i | w_{i-1}, w_{i-2}) = \lambda_3 P_{ML}(w_i | w_{i-1}, w_{i-2}) + \lambda_2 P_{ML}(w_i | w_{i-1}) + \lambda_1 P_{ML}(w_i) $$

We treat the lambda’s like probabilities, so we have the constraints \( \lambda_i \geq 0 \) and \( \sum_i \lambda_i = 1 \).

The question now is, how do we learn the values of lambda?

One method is “held-out estimation” (same thing you’d do to choose hyperparameters for a neural network). You take a part of your training set, and choose values for lambda that maximize the objective (or minimize the error) of that training set.

If you have ever studied linear programming, you can see how it would be related to solving the above problem.

Another method might be to base it on the counts. This would work similarly to the “add-1” method described above. If we have a higher count for \( P_{ML}(w_i | w_{i-1}, w_{i-2}) \), we would want to use that instead of \( P_{ML}(w_i) \). If we have a lower count we know we have to depend on\( P_{ML}(w_i) \).

Good-Turing smoothing and Kneser-Ney smoothing

These are more complicated topics that we won’t cover here, but may be covered in the future if the opportunity arises.

Have you had success with probability smoothing in NLP? Let me know in the comments below!



Go to comments